[bug] Writing a column of type list with nulls results in the nulls being replaced with [] #1946

mkleinbort-ic · 2024-02-13T03:20:20Z

Writing a table with a column of type list[int] containing nulls results in the nulls being filled in with []

df_test_before = pl.DataFrame({
    'x': [None, [1,2,3], []]
})

shape: (3, 1)
┌───────────┐
│ x         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ null      │
│ [1, 2, 3] │
│ []        │
└───────────┘

df_test_after = pl.from_arrow(lance.write_dataset(df_test_before, 'df_test.lance', mode='overwrite').to_table())

shape: (3, 1)
┌───────────┐
│ x         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ []        │
│ [1, 2, 3] │
│ []        │
└───────────┘

The text was updated successfully, but these errors were encountered:

changhiskhan · 2024-02-13T03:33:20Z

@westonpace is working on null support for plain encoder currently. I would expect this to land in a week or so. @westonpace is there extra work required to support nulls in list types?

westonpace · 2024-02-13T03:49:47Z

😰 I don't know about a week or so. I hope the encoders and MVP version of the v2 file writer will land in a week or so. However, I think there is still some work to go before everything percolates up to the top-level APIs (need to integrate the new format with the scanner, etc.) Maybe the end of the month is more realistic for when users can start using these features.

@westonpace is there extra work required to support nulls in list types?

From the user perspective or from a development perspective?

Users shouldn't have to do anything. Once they upgrade Lance to the appropriate version it should just support writing nulls (any old files written with the old format will still read nulls back as empty lists, there is no way to recover them).

westonpace · 2024-02-13T03:50:08Z

#1929 is the tracking issue for the new format version

mkleinbort · 2024-02-15T00:16:47Z

Thank you both, I'll keep a close eye on this. Keen to migrate to lance, pending this fix.

mkleinbort · 2024-02-29T18:45:07Z

How is this coming along? I see there is a lot to do in the writer V2 issue.

mkleinbort · 2024-06-11T08:26:46Z

Do you know an estimate for this feature - about to kick off some refactoring next month and would love to move to lance as part of it - but waiting on this at the moment.

wjones127 · 2024-06-11T18:43:43Z

The V2 format is in beta right now. I think if you want nullability it's a good time to try it out and migrate. More compressive encodings are coming soon.

mkleinbort-ic · 2024-06-13T16:04:34Z

I don't think this is working at the moment (0.12.1):

import polars as pl 
import lance

df_test_before = pl.DataFrame({
    'x': [None, [1,2,3], []]
})

lance.write_dataset(df_test_before, 'df_test.lance', mode='overwrite', use_legacy_format=False)

>>> PanicException: not yet implemented: Implement encoding for field Field(id=0, name=x, type=large_list, children=[Field(id=1, name=item, type=int64), ])

wjones127 · 2024-06-17T18:03:57Z

Hmm it might just be that we have it for list (what PyArrow defaults to) and not large list (what Polars defaults to). We should probably implement large list as well.

mkleinbort-ic · 2024-07-09T12:19:21Z

This seems to be fixed - closing the issue.

mkleinbort mentioned this issue Feb 16, 2024

test: added nested field testcase for field of type list of ints #1939

Open

mkleinbort-ic closed this as completed Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Writing a column of type list with nulls results in the nulls being replaced with [] #1946

[bug] Writing a column of type list with nulls results in the nulls being replaced with [] #1946

mkleinbort-ic commented Feb 13, 2024

changhiskhan commented Feb 13, 2024

westonpace commented Feb 13, 2024

westonpace commented Feb 13, 2024

mkleinbort commented Feb 15, 2024

mkleinbort commented Feb 29, 2024

mkleinbort commented Jun 11, 2024

wjones127 commented Jun 11, 2024

mkleinbort-ic commented Jun 13, 2024 •

edited

Loading

wjones127 commented Jun 17, 2024

mkleinbort-ic commented Jul 9, 2024

[bug] Writing a column of type list with nulls results in the nulls being replaced with [] #1946

[bug] Writing a column of type list with nulls results in the nulls being replaced with [] #1946

Comments

mkleinbort-ic commented Feb 13, 2024

changhiskhan commented Feb 13, 2024

westonpace commented Feb 13, 2024

westonpace commented Feb 13, 2024

mkleinbort commented Feb 15, 2024

mkleinbort commented Feb 29, 2024

mkleinbort commented Jun 11, 2024

wjones127 commented Jun 11, 2024

mkleinbort-ic commented Jun 13, 2024 • edited Loading

wjones127 commented Jun 17, 2024

mkleinbort-ic commented Jul 9, 2024

mkleinbort-ic commented Jun 13, 2024 •

edited

Loading