Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): additional schema_overrides param for more ergonomic DataFrame init #6230

Merged
merged 2 commits into from
Jan 17, 2023

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jan 14, 2023

ref #6209.

This turned into more effort than anticipated... heads-up if you ever want to head into our constructor labyrinth; you may not come back - bring coffee :)

What it does

As @stinodego accurately pointed out in the linked issue, we do not currently have an ergonomic way to specify specific dtypes for just one or two columns out of a larger set at frame-init time; it tends to be either all or nothing, require the full list of columns, etc - this use-case can be said to Not Have A Good Story™.

The primary suggestion was to add a new dtypes param that can apply full or partial schema information, supplementing or overriding anything gleaned from columns, and allowing better delineation of column names/order/types. That is what this PR implements (though we have since settled on schema_overrides).

While I was at it I found that quite a few of the pl.from_<xyz> methods didn't actually support columns, though in the main they pass-through to underlying functions that do; these functions have been upgraded to offer both columns and schema_overrides).

Example

Before:
(have to refer to every column to override one)

df = pl.DataFrame(
    data = {
        "a": [1, 2 ,3],
        "b": [1, 2 ,3],
        "c": [1, 2 ,3],
        "d": [1, 2 ,3],
        "e": [1, 2 ,3],
    },
    columns = ["a", "b", "c", ("d",pl.Int8), "e"],
)
# or: columns = {"a":None, "b":None, "c":None, "d":pl.Int8, "e":None}`

^^ Now imagine how bad this gets for frames with 100s of columns ;)

After:
(previous code would still work, but now we have a much cleaner option)

df = pl.DataFrame(
    data = {
        "a": [1, 2 ,3],
        "b": [1, 2 ,3],
        "c": [1, 2 ,3],
        "d": [1, 2 ,3],
        "e": [1, 2 ,3],
    },
    schema_overrides = {"d": pl.Int8},
)

Bug fixes

Several; carefully walking through the constructors again and adding more unit tests uncovered a number of previously unreported issues. Existing type overrides could potentially be ignored (eg: dataclass/namedtuple input), cause errors, or (in one especially odd case involving pyarrow) reshape the frame repr output in an unexpected way. I wouldn't describe any of them as major, but all have now been fixed.

Next steps

This PR is completely backwards-compatible and additive to what we already have; aside from the bug fixes there are no changes to the current default behaviour, and no need for any changes in existing code. The new param can simply be used to write cleaner code going forward, and it lives happily alongside all of the overloads currently possible via columns.

Now that this gives us a more ergonomic way to apply partial schemas we have a better base to build on if there is interest in reworking it further (different param names like "schema", etc); this commit definitely doesn't address everything from the #6209 discussion, but it does take care of the biggest pain point.

I'd suggest closing out the existing issue, and starting a new one that refocuses on the remaining areas of interest?

(Should probably run through the docstrings and add more examples using schema_overrides too).

Also

Miscellaneous:

I know I've seen a request for this somewhere (possibly the Discord - couldn't find it in the open issue list), so I added the negative counterparts of our existing assert functions as they made sense for a couple of the updated tests:

  • assert_frame_not_equal
  • assert_series_not_equal

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Jan 14, 2023
@alexander-beedie alexander-beedie changed the title feat(python): additional dtypes param for DataFrame init feat(python): additional dtypes param for more ergonomic DataFrame init Jan 14, 2023
@stinodego
Copy link
Member

Thanks! I'll take a look tomorrow :)

@alexander-beedie alexander-beedie force-pushed the frame-init-dtypes-param branch 4 times, most recently from 7a87130 to 28d6c59 Compare January 15, 2023 12:01
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 15, 2023

Just updated further, with extra tests and dtypes support for pl.from_pandas, pl.from_numpy, pl.from_arrow, and a rebase/squash to pickup/integrate recent commits.

@stinodego
Copy link
Member

stinodego commented Jan 15, 2023

@alexander-beedie I just added a comment to the original issue with some more thoughts and an alternative suggestion. Wondering what you think.

I just want to make sure we get this right! I want to avoid adding additional options to the constructor of our 'main' object that will be hard to deprecate/change later if we're not 100% sure.

Regardless, this PR contains a lot of nice improvements already! So whatever way we go, we can probably start from this PR.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 15, 2023

@alexander-beedie I just added a comment to the original issue with some more thoughts and an alternative suggestion. Wondering what you think.

@stinodego: I have made my counter-proposal, heh. I agree that we should rename the params, and also enforce a better separation of responsibilities - but I disagree that we should be overloading a single parameter with all of that. However, I think we can actually get a "best of both worlds" result here by combining our ideas - see what you think :)

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 15, 2023

@stinodego: should be good to go. I might actually have some connectivity in the air, so perhaps I'll try and set a "highest altitude" polars commit record later, with the columns deprecation... ;)

@alexander-beedie alexander-beedie changed the title feat(python): additional dtypes param for more ergonomic DataFrame init feat(python): additional schema_overrides param for more ergonomic DataFrame init Jan 15, 2023
@ritchie46
Copy link
Member

@stinodego: should be good to go. I might actually have some connectivity in the air, so perhaps I'll try and set a "highest altitude" polars commit record later, with the columns deprecation... ;)

LOL.. the polars sky club🚀😂

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments. Nice improvement!

py-polars/polars/convert.py Outdated Show resolved Hide resolved
py-polars/polars/convert.py Outdated Show resolved Hide resolved
py-polars/polars/convert.py Outdated Show resolved Hide resolved
py-polars/polars/convert.py Outdated Show resolved Hide resolved
py-polars/tests/unit/test_df.py Outdated Show resolved Hide resolved
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 15, 2023

Just a few comments. Nice improvement!

Great, thanks. You still have the eye of the tiger ;) We're on the runway and moving now - if there's connectivity at 35000 feet I'll take care of it! (And if not, then tomorrow in Tokyo ;)

@alexander-beedie alexander-beedie force-pushed the frame-init-dtypes-param branch 3 times, most recently from 4ce5a69 to d352469 Compare January 16, 2023 07:39
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 16, 2023

@stinodego: All remaining comments addressed, squashed/rebased against the latest commits; also changed the name of the ColumnsType alias to SchemaDefinition to better describe it, in anticipation of the imminent columns => schema update.

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can ship it!

@alexander-beedie
Copy link
Collaborator Author

@ritchie46: ready to roll :)

@ritchie46
Copy link
Member

Alright! Thanks @alexander-beedie and @stinodego!

@ritchie46 ritchie46 merged commit d11cc15 into pola-rs:master Jan 17, 2023
@alexander-beedie
Copy link
Collaborator Author

Alright! Thanks @alexander-beedie and @stinodego!

Great! Will start on the second phase update shortly... ;)

@alexander-beedie alexander-beedie deleted the frame-init-dtypes-param branch January 17, 2023 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants