Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet statistics missing when reading Utf8 as Utf8View #12123

Closed
Tracked by #11752
alamb opened this issue Aug 22, 2024 · 5 comments · Fixed by #12232
Closed
Tracked by #11752

Parquet statistics missing when reading Utf8 as Utf8View #12123

alamb opened this issue Aug 22, 2024 · 5 comments · Fixed by #12232
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Aug 22, 2024

Part of #11752

Describe the bug

One of the last remaining issues causing test failures when we enable reading StringView by default in #12092 is as follows:

failures:
    datasource::file_format::parquet::tests::fetch_metadata_with_size_hint
    datasource::file_format::parquet::tests::read_alltypes_plain_parquet
    datasource::file_format::parquet::tests::read_binary_alltypes_plain_parquet
    datasource::file_format::parquet::tests::read_merged_batches
    datasource::file_format::parquet::tests::test_statistics_from_parquet_metadata

To Reproduce

#12092

And then run:

cargo test -p datafusion --lib -- file_format::parquet

Expected behavior

The tests should pass

Additional context

The problem is that table schema is configured to be UTF8View but the file schema is using Utf8 (so the stats are returned as Utf8) and the accumulators can't deal updating a Utf8View from Utf8.

@XiangpengHao solved this issue in #11862 (comment) to thread the parameter and then and cast the file schema appropriately.

The code isn't great to start with and adding a new parameter makes it worse.

I also think there are some bugs lurking there that maybe we could improve if the code was more testable

@goldmedal
Copy link
Contributor

take

@goldmedal goldmedal removed their assignment Aug 27, 2024
@goldmedal
Copy link
Contributor

I unassigned myself because I'm not very familiar with this topic (StringView). I'll keep digging into the issue, but if anyone has an idea for a solution, feel free to take over.

@goldmedal
Copy link
Contributor

Hi @alamb I just quickly drafted a PR for this issue, #12198
Instead of passing the config, I think we can check the schema from the table schema. I’m trying to coerce the UTF-8 field in the file schema if the corresponding field in the table schema is a UTF-8 view. Does that make sense?

@wiedld
Copy link
Contributor

wiedld commented Aug 29, 2024

Have an alternative solution, done in the process of fixing #12119. PR up shortly.

@wiedld
Copy link
Contributor

wiedld commented Sep 4, 2024

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants