Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pl.Expr.struct.unnest() #13481

Closed
mkleinbort opened this issue Jan 6, 2024 · 10 comments
Closed

Add pl.Expr.struct.unnest() #13481

mkleinbort opened this issue Jan 6, 2024 · 10 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@mkleinbort
Copy link

Description

Structs are a fantastic data type, and I like working with them. However, I don't know a way to easily expand a struct column into multiple columns without using the dataframe level .unnest() (or looping over the field names)

I propose adding pl.Expr.struct.unnest

Example:

df = pl.DataFrame({'x': [{'x1':10, 'x2':20}]}) # Sample data

df.unnest('x') # Current solution to expand the dataframe

df.with_columns(pl.col('x').stuct.unnest()) # Proposed addition

Note that I envision pl.col('x').stuct.unnest() as the equivalent of pl.col('x1', 'x2')

To give a usecase, suppose I want to write an expression to increase the value of x1 in the above example.

Today one can do

df.unnest('x').select(x=pl.struct(pl.col('x1')+1, 'x2')) 
# This works but requires the df-level unnest, meaning it can't be saved as an expression
@mkleinbort mkleinbort added the enhancement New feature or an improvement of an existing feature label Jan 6, 2024
@avimallu
Copy link
Contributor

avimallu commented Jan 6, 2024

I don't think this is possible the way Expr is implemented right now in Polars. From the user guide section that I wrote with Ritchie's input:

Polars expressions always have a Fn(Series) -> Series signature and Struct is thus the data type that allows us to provide multiple columns as input/output of an expression. In other words, all expressions have to return a Series object, and Struct allows us to stay consistent with that requirement.

@mkleinbort
Copy link
Author

Interesting, but not strictly true I think. Or at least there is some behaviour that appears as an exception to that rule.

df = pl.DataFrame({'x':[1,2], 'y':[3,4]})
df.with_columns(pl.col('x','y').add(1).name.suffix('_incremented'))

shape: (2, 4)
┌─────┬─────┬───────────────┬───────────────┐
│ xyx_incrementedy_incremented │
│ ------------           │
│ i64i64i64i64           │
╞═════╪═════╪═══════════════╪═══════════════╡
│ 1324             │
│ 2435             │
└─────┴─────┴───────────────┴───────────────┘

@avimallu
Copy link
Contributor

avimallu commented Jan 6, 2024

I believe that internally it's still translated to series operations with the same function signature. From the user guide that mentions expression expansion:

a single expression that specifies multiple columns expands into a list of expressions (depending on the DataFrame schema), resulting in being able to select multiple columns + run computation on them!

@cmdlineluser
Copy link
Contributor

#10919 (comment)

I think what's really lacking here is Expr.unnest(). While this may not be feasible in the near-future until we have expressions that can return dataframes [...]

@mcrumiller
Copy link
Contributor

mcrumiller commented Jan 6, 2024

Could we not internally translate pl.col('a').struct.unnest() to:

df.select(
    pl.col('a').struct.field("field1"),
    pl.col('a').struct.field("field2"),
)

@deanm0000
Copy link
Collaborator

@mcrumiller does df.schema "know" that 'a' has those fields or, at least, how many fields are in the struct?

@mkleinbort
Copy link
Author

@deanm0000 - does this answer your question?

import polars as pl 
df = pl.DataFrame({'x': [{'x1':'A', 'x2':20}]})
df.schema
>>> OrderedDict([('x', Struct({'x1': String, 'x2': Int64}))])

@ritchie46
Copy link
Member

This isn't possible naively. An expression can only return a single result series.

@mcrumiller might work, but then we must rely on CSE to not duplicate the work done before the struct.

@cmdlineluser
Copy link
Contributor

I think the latest struct commits will close this:

df.with_columns(pl.col("x").struct.field("*"))
# shape: (1, 3)
# ┌───────────┬─────┬─────┐
# │ x         ┆ x1  ┆ x2  │
# │ ---       ┆ --- ┆ --- │
# │ struct[2] ┆ i64 ┆ i64 │
# ╞═══════════╪═════╪═════╡
# │ {10,20}   ┆ 10  ┆ 20  │
# └───────────┴─────┴─────┘
df.with_columns(
    pl.col("x").struct.with_fields(
        pl.col("x").struct.field("x1") + 1
    )
)
# shape: (1, 1)
# ┌───────────┐
# │ x         │
# │ ---       │
# │ struct[2] │
# ╞═══════════╡
# │ {11,20}   │
# └───────────┘

@ritchie46
Copy link
Member

Yes, I think we have a more general form now with the two last commits. The unnest can be done with a wildcard as shown by @cmdlineluser. There is also regex support and multiple column names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

6 participants