Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Writing a column of type list with nulls results in the nulls being replaced with [] #1946

Closed
mkleinbort-ic opened this issue Feb 13, 2024 · 10 comments

Comments

@mkleinbort-ic
Copy link

Writing a table with a column of type list[int] containing nulls results in the nulls being filled in with []

df_test_before = pl.DataFrame({
    'x': [None, [1,2,3], []]
})

shape: (3, 1)
┌───────────┐
│ x         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ null      │
│ [1, 2, 3] │
│ []        │
└───────────┘

df_test_after = pl.from_arrow(lance.write_dataset(df_test_before, 'df_test.lance', mode='overwrite').to_table())

shape: (3, 1)
┌───────────┐
│ x         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ []        │
│ [1, 2, 3] │
│ []        │
└───────────┘
@changhiskhan
Copy link
Contributor

@westonpace is working on null support for plain encoder currently. I would expect this to land in a week or so. @westonpace is there extra work required to support nulls in list types?

@westonpace
Copy link
Contributor

😰 I don't know about a week or so. I hope the encoders and MVP version of the v2 file writer will land in a week or so. However, I think there is still some work to go before everything percolates up to the top-level APIs (need to integrate the new format with the scanner, etc.) Maybe the end of the month is more realistic for when users can start using these features.

@westonpace is there extra work required to support nulls in list types?

From the user perspective or from a development perspective?

Users shouldn't have to do anything. Once they upgrade Lance to the appropriate version it should just support writing nulls (any old files written with the old format will still read nulls back as empty lists, there is no way to recover them).

@westonpace
Copy link
Contributor

#1929 is the tracking issue for the new format version

@mkleinbort
Copy link

Thank you both, I'll keep a close eye on this. Keen to migrate to lance, pending this fix.

@mkleinbort
Copy link

How is this coming along? I see there is a lot to do in the writer V2 issue.

@mkleinbort
Copy link

Do you know an estimate for this feature - about to kick off some refactoring next month and would love to move to lance as part of it - but waiting on this at the moment.

@wjones127
Copy link
Contributor

The V2 format is in beta right now. I think if you want nullability it's a good time to try it out and migrate. More compressive encodings are coming soon.

@mkleinbort-ic
Copy link
Author

mkleinbort-ic commented Jun 13, 2024

I don't think this is working at the moment (0.12.1):

import polars as pl 
import lance

df_test_before = pl.DataFrame({
    'x': [None, [1,2,3], []]
})

lance.write_dataset(df_test_before, 'df_test.lance', mode='overwrite', use_legacy_format=False)

>>> PanicException: not yet implemented: Implement encoding for field Field(id=0, name=x, type=large_list, children=[Field(id=1, name=item, type=int64), ])

@wjones127
Copy link
Contributor

Hmm it might just be that we have it for list (what PyArrow defaults to) and not large list (what Polars defaults to). We should probably implement large list as well.

@mkleinbort-ic
Copy link
Author

This seems to be fixed - closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants