Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust): faster frame-init from list of dicts (when omitting fields), and ensure fields are read according to the declared schema #6472

Merged
merged 2 commits into from
Jan 26, 2023

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jan 26, 2023

Closes #6437.

A feature (more speed when omitting fields), and a fix (ensure fields are read in the declared schema order).

Optimisation

Currently all dictionary values are read, even if we declare that we only want a subset of the available fields; rather than drop the column data after load, this optimisation simply never loads the omitted data.

Fix

Dictionaries are not guaranteed to be passed into pl.from_dicts (or to pl.DataFrame init) with data stored in the same key order. Consequently a declared schema can end-up reordering/renaming the loaded columns and/or moving values between columns if the keys do not initially present in the same order as the schema, or if the first dictionary is missing one or more fields.

  • Setup:
    import polars as pl
    
    d1 = {"a":1, "b":2, "c":3}
    d2 = {"b":2, "a":1, "c":3}
    d3 = {"c":3, "b":2}
  • First dict loaded has the keys in schema order; we get the expected result:
    pl.from_dicts( [d1,d2,d3], schema=["a","b","c"] )
    # shape: (3, 3)
    # ┌──────┬─────┬─────┐
    # │ a    ┆ b   ┆ c   │
    # │ ---  ┆ --- ┆ --- │
    # │ i64  ┆ i64 ┆ i64 │
    # ╞══════╪═════╪═════╡
    # │ 1    ┆ 2   ┆ 3   │
    # │ 1    ┆ 2   ┆ 3   │
    # │ null ┆ 2   ┆ 3   │
    # └──────┴─────┴─────┘
  • Initial dictionary keys don't match schema order, values get read into different columns:
    pl.from_dicts( [d2,d1,d3], schema=["a","b","c"] )
    # shape: (3, 3)
    # ┌─────┬──────┬─────┐
    # │ a   ┆ b    ┆ c   │
    # │ --- ┆ ---  ┆ --- │
    # │ i64 ┆ i64  ┆ i64 │
    # ╞═════╪══════╪═════╡
    # │ 2   ┆ 1    ┆ 3   │
    # │ 2   ┆ 1    ┆ 3   │
    # │ 2   ┆ null ┆ 3   │
    # └─────┴──────┴─────┘
  • Same again; value order now completely reversed from schema order:
    pl.from_dicts( [d3,d2,d1], schema=["a","b","c"] )
    # shape: (3, 3)
    # ┌─────┬─────┬──────┐
    # │ a   ┆ b   ┆ c    │
    # │ --- ┆ --- ┆ ---  │
    # │ i64 ┆ i64 ┆ i64  │
    # ╞═════╪═════╪══════╡
    # │ 3   ┆ 2   ┆ null │
    # │ 3   ┆ 2   ┆ 1    │
    # │ 3   ┆ 2   ┆ 1    │
    # └─────┴─────┴──────┘

This came about because schema_overwrite row behaviour (where the values are in a fixed order and are not named) was being applied to dict behaviour (where the fields are named and are not guaranteed to be in a fixed order).

I don't think we should ever be overwriting field names of incoming dicts with the schema param, as it is unstable - we should be honouring the schema as declared, reading the values in the given field order, and potentially applying dtypes; if the caller ever does want to do a rename, that is an easy second-step.

…omitting fields), and ensure values read match declared schema field order
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels Jan 26, 2023
@alexander-beedie alexander-beedie added fix Bug fix performance Performance issues or improvements labels Jan 26, 2023
@ritchie46 ritchie46 merged commit c0d5139 into pola-rs:master Jan 26, 2023
@alexander-beedie alexander-beedie deleted the dicts-optimisation-and-fix branch January 30, 2023 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature fix Bug fix performance Performance issues or improvements rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame initialisation using dict columns parameters
2 participants