feat(rust): faster frame-init from list of dicts (when omitting fields), and ensure fields are read according to the declared schema #6472

alexander-beedie · 2023-01-26T17:05:55Z

Closes #6437.

A feature (more speed when omitting fields), and a fix (ensure fields are read in the declared schema order).

Optimisation

Currently all dictionary values are read, even if we declare that we only want a subset of the available fields; rather than drop the column data after load, this optimisation simply never loads the omitted data.

Fix

Dictionaries are not guaranteed to be passed into pl.from_dicts (or to pl.DataFrame init) with data stored in the same key order. Consequently a declared schema can end-up reordering/renaming the loaded columns and/or moving values between columns if the keys do not initially present in the same order as the schema, or if the first dictionary is missing one or more fields.

Setup:

import polars as pl

d1 = {"a":1, "b":2, "c":3}
d2 = {"b":2, "a":1, "c":3}
d3 = {"c":3, "b":2}

First dict loaded has the keys in schema order; we get the expected result:

pl.from_dicts( [d1,d2,d3], schema=["a","b","c"] )
# shape: (3, 3)
# ┌──────┬─────┬─────┐
# │ a    ┆ b   ┆ c   │
# │ ---  ┆ --- ┆ --- │
# │ i64  ┆ i64 ┆ i64 │
# ╞══════╪═════╪═════╡
# │ 1    ┆ 2   ┆ 3   │
# │ 1    ┆ 2   ┆ 3   │
# │ null ┆ 2   ┆ 3   │
# └──────┴─────┴─────┘

Initial dictionary keys don't match schema order, values get read into different columns:

pl.from_dicts( [d2,d1,d3], schema=["a","b","c"] )
# shape: (3, 3)
# ┌─────┬──────┬─────┐
# │ a   ┆ b    ┆ c   │
# │ --- ┆ ---  ┆ --- │
# │ i64 ┆ i64  ┆ i64 │
# ╞═════╪══════╪═════╡
# │ 2   ┆ 1    ┆ 3   │
# │ 2   ┆ 1    ┆ 3   │
# │ 2   ┆ null ┆ 3   │
# └─────┴──────┴─────┘

Same again; value order now completely reversed from schema order:

pl.from_dicts( [d3,d2,d1], schema=["a","b","c"] )
# shape: (3, 3)
# ┌─────┬─────┬──────┐
# │ a   ┆ b   ┆ c    │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ i64 ┆ i64  │
# ╞═════╪═════╪══════╡
# │ 3   ┆ 2   ┆ null │
# │ 3   ┆ 2   ┆ 1    │
# │ 3   ┆ 2   ┆ 1    │
# └─────┴─────┴──────┘

This came about because schema_overwrite row behaviour (where the values are in a fixed order and are not named) was being applied to dict behaviour (where the fields are named and are not guaranteed to be in a fixed order).

I don't think we should ever be overwriting field names of incoming dicts with the schema param, as it is unstable - we should be honouring the schema as declared, reading the values in the given field order, and potentially applying dtypes; if the caller ever does want to do a rename, that is an easy second-step.

…omitting fields), and ensure values read match declared schema field order

feat(rust): potential for faster frame-init from list of dicts (when …

eda3a47

…omitting fields), and ensure values read match declared schema field order

github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels Jan 26, 2023

fix test lint

17b77b0

alexander-beedie force-pushed the dicts-optimisation-and-fix branch from e3cc3b9 to 17b77b0 Compare January 26, 2023 17:19

alexander-beedie added fix Bug fix performance Performance issues or improvements labels Jan 26, 2023

ritchie46 merged commit c0d5139 into pola-rs:master Jan 26, 2023

alexander-beedie deleted the dicts-optimisation-and-fix branch January 30, 2023 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rust): faster frame-init from list of dicts (when omitting fields), and ensure fields are read according to the declared schema #6472

feat(rust): faster frame-init from list of dicts (when omitting fields), and ensure fields are read according to the declared schema #6472

alexander-beedie commented Jan 26, 2023 •

edited

Loading

feat(rust): faster frame-init from list of dicts (when omitting fields), and ensure fields are read according to the declared schema #6472

feat(rust): faster frame-init from list of dicts (when omitting fields), and ensure fields are read according to the declared schema #6472

Conversation

alexander-beedie commented Jan 26, 2023 • edited Loading

alexander-beedie commented Jan 26, 2023 •

edited

Loading