Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement/Expose List/Array "builder" or "cast"er similar to pa.LargeListArray.from_arrays with expressions as inputs #17578

Open
deanm0000 opened this issue Jul 11, 2024 · 5 comments
Labels
A-dtype-list/array Area: list/array data type enhancement New feature or an improvement of an existing feature performance Performance issues or improvements

Comments

@deanm0000
Copy link
Collaborator

deanm0000 commented Jul 11, 2024

Description

This was inspired by this SO question

Here's a simpler example, imagine we have two list columns and we want to add them

df=pl.DataFrame({
    'a':[[1,2,3],[2,3,4],[5,7]],
    'b':[[2,3,4],[7,8,9],[1,2]]
})

One approach is

df.with_columns(
    c=(pl.col('a').explode()*pl.col('b').explode()).implode().over(pl.int_range(pl.len()))
)

but the over is really unnecessary. We can use pyarrow to do this instead

df.with_columns(
    c=pl.struct('a','b').map_batches(lambda s: pl.from_arrow(
        pa.LargeListArray.from_arrays(
            s.struct.field('a').to_arrow().offsets,
            s.struct.field('a').explode()*s.struct.field('b').explode()
        )
    ))
)

Performance

df=pl.DataFrame([
    pl.Series('a',np.random.randint(0,10,100000),dtype=pl.Float64),
    pl.Series('b',np.random.randint(0,10,100000),dtype=pl.Float64),
    pl.Series('c',np.random.randint(0,8,100000),dtype=pl.Float64)
]).group_by('c').agg('a','b').drop('c')
%%timeit
df.with_columns(
    c=pl.struct('a','b').map_batches(lambda s: pl.from_arrow(
        pa.LargeListArray.from_arrays(
            s.struct.field('a').to_arrow().offsets,
            (s.struct.field('a').explode()*s.struct.field('b').explode()).to_arrow()
            # if you don't do an explicit to_arrow above then it'll work but performance will be 60x worse
        )
    ))
)
2.07 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.with_columns(
    c=(pl.col('a').explode()*pl.col('b').explode()).implode().over(pl.int_range(pl.len()))
)
4.4 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So this is half the time.

@deanm0000 deanm0000 added enhancement New feature or an improvement of an existing feature performance Performance issues or improvements A-dtype-list/array Area: list/array data type labels Jul 11, 2024
@cmdlineluser
Copy link
Contributor

Is the Array type intended for these cases? (although it requires same shape)

df = pl.DataFrame({
    'a':[[1,2,3],[2,3,4],[5,7, None]],
    'b':[[2,3,4],[7,8,9],[1,2, None]]
}).cast(pl.Array(pl.Int64, 3))

df.with_columns(c = pl.col.a * pl.col.b)
shape: (3, 3)
┌───────────────┬───────────────┬───────────────┐
│ abc             │
│ ---------           │
│ array[i64, 3] ┆ array[i64, 3] ┆ array[i64, 3] │
╞═══════════════╪═══════════════╪═══════════════╡
│ [1, 2, 3]     ┆ [2, 3, 4]     ┆ [2, 6, 12]    │
│ [2, 3, 4]     ┆ [7, 8, 9]     ┆ [14, 24, 36]  │
│ [5, 7, null]  ┆ [1, 2, null]  ┆ [5, 14, null] │
└───────────────┴───────────────┴───────────────┘

@deanm0000
Copy link
Collaborator Author

@cmdlineluser Sure as long as the lists are the same size then that's better. I deliberately made the example different sizes to illustrate that there's still a need.

On the syntax, I'm not sure if it should a new method, or if it'd be part of cast or implode.
Maybe like

df=pl.DataFrame({'a':[1,2,3,4]})
# to list
df.select(pl.col('a').cast(pl.List(pl.Int64, [0,2,4]))
# to array
df.select(pl.col('a').cast(pl.List(pl.Int64, 2))

or

df=pl.DataFrame({'a':[1,2,3,4]})
# to list
df.select(pl.col('a').implode([0,2,4]))
# to array
df.select(pl.col('a').implode(2))

so if implode gets a list then it becomes a list, if it gets a scaler then it becomes an array.

In typing it out, I'm leaning more towards the latter.

@deanm0000
Copy link
Collaborator Author

use case #14711

@itamarst
Copy link
Contributor

Adding two Series with lists is implemented in #17823. Is that PR sufficient to close this issue too?

@deanm0000
Copy link
Collaborator Author

@itamarst no, the point isn't addition, that was just an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-list/array Area: list/array data type enhancement New feature or an improvement of an existing feature performance Performance issues or improvements
Projects
None yet
Development

No branches or pull requests

3 participants