Support parquet_metadata for datafusion-cli #8413

Veeupup · 2023-12-04T14:01:01Z

Which issue does this PR close?

Closes #8367

Function behaves just like duckdb.

The above is from duckdb, and the below is from datafusion.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: veeupup <code@tanweime.com>

Veeupup · 2023-12-04T14:01:39Z

@alamb PTAL : )

how do you like the display format?

alamb · 2023-12-04T14:16:09Z

Thank you @Veeupup -- I will check this out later today!

simonvandel

Cool feature!

datafusion-cli/src/functions.rs

alamb

Thank you @Veeupup and @simonvandel -- this is a great start.

I would like to suggest that we follow DuckDB for parquet_metadata and not reinvent the wheel / formatting here.

That that means one output row per file, per row group
Each row has 23 columns

Here is an example of what duckdbs output looks like:

D select * from read_parquet('./parquet-testing/data/int64_decimal.parquet');
┌───────────────┐
│     value     │
│ decimal(10,2) │
├───────────────┤
│          1.00 │
│          2.00 │
│          3.00 │
│          4.00 │
│          5.00 │
│          6.00 │
│          7.00 │
│          8.00 │
│          9.00 │
│         10.00 │
│         11.00 │
│         12.00 │
│         13.00 │
│         14.00 │
│         15.00 │
│         16.00 │
│         17.00 │
│         18.00 │
│         19.00 │
│         20.00 │
│         21.00 │
│         22.00 │
│         23.00 │
│         24.00 │
├───────────────┤
│    24 rows    │
└───────────────┘

D .mode markdown
D select * from parquet_metadata('./parquet-testing/data/int64_decimal.parquet');

file_name	row_group_id	row_group_num_rows	row_group_num_columns	row_group_bytes	column_id	file_offset	num_values	path_in_schema	type	stats_min	stats_max	stats_null_count	stats_distinct_count	stats_min_value	stats_max_value	compression	encodings	index_page_offset	dictionary_page_offset	data_page_offset	total_compressed_size	total_uncompressed_size
./parquet-testing/data/int64_decimal.parquet	0	24	1	241	0		24	value	INT64	1.00	24.00	0				UNCOMPRESSED	BIT_PACKED, RLE, PLAIN			4	241	241
D


Here is the schema

D describe select * from parquet_metadata('./parquet-testing/data/int64_decimal.parquet');



|       column_name       | column_type | null | key | default | extra |
|-------------------------|-------------|------|-----|---------|-------|
| file_name               | VARCHAR     | YES  |     |         |       |
| row_group_id            | BIGINT      | YES  |     |         |       |
| row_group_num_rows      | BIGINT      | YES  |     |         |       |
| row_group_num_columns   | BIGINT      | YES  |     |         |       |
| row_group_bytes         | BIGINT      | YES  |     |         |       |
| column_id               | BIGINT      | YES  |     |         |       |
| file_offset             | BIGINT      | YES  |     |         |       |
| num_values              | BIGINT      | YES  |     |         |       |
| path_in_schema          | VARCHAR     | YES  |     |         |       |
| type                    | VARCHAR     | YES  |     |         |       |
| stats_min               | VARCHAR     | YES  |     |         |       |
| stats_max               | VARCHAR     | YES  |     |         |       |
| stats_null_count        | BIGINT      | YES  |     |         |       |
| stats_distinct_count    | BIGINT      | YES  |     |         |       |
| stats_min_value         | VARCHAR     | YES  |     |         |       |
| stats_max_value         | VARCHAR     | YES  |     |         |       |
| compression             | VARCHAR     | YES  |     |         |       |
| encodings               | VARCHAR     | YES  |     |         |       |
| index_page_offset       | BIGINT      | YES  |     |         |       |
| dictionary_page_offset  | BIGINT      | YES  |     |         |       |
| data_page_offset        | BIGINT      | YES  |     |         |       |
| total_compressed_size   | BIGINT      | YES  |     |         |       |
| total_uncompressed_size | BIGINT      | YES  |     |         |       |

alamb · 2023-12-05T20:56:29Z

datafusion-cli/src/functions.rs

+            .iter()
+            .map(|rg| {
+                format!(
+                    "num_columns: {}, num_rows: {}, total_byte_size: {}, sorting_columns: {:?}",


every row group should have the same number of columns, for what it is worth

it seems that each column in each row_group will be one row in duckdb, so there will be Num(cols) * Num(row_group) rows

I'll just mock the duckdb display format then ~

Veeupup · 2023-12-06T14:27:25Z

@alamb @simonvandel Hi! I made parquet_metadata behave just like duckdb!

The above is from duckdb, and the below is from datafusion. You can click on the picture to see more clearly!

Signed-off-by: veeupup <code@tanweime.com>

alamb

I tried this out and it is quite neat. Thank you @Veeupup . This and all your previous PRs have been a joy to review.

I think this could be merged as is and we could make improvements as a follow on, as it is new functionality. However, it would also be neat to fix them prior to merge

I left suggestions about what needed to be fixed

Request 1: Can we make string literal arguments work? Something like

❯ select * from parquet_metadata('/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/tests/data/clickbench_hits_10.parquet');
Error during planning: parquet_metadata requires string argument as its input

Request 2 Do not panic when statistics are not set

❯ select * from parquet_metadata("/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/tests/data/clickbench_hits_10.parquet");
thread 'main' panicked at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-49.0.0/src/file/statistics.rs:474:27:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I highlighted the problem I think in comments

Request 3: Add some tests

It would be really nice if you could add some regression tests to make sure we don't break this in the future. For example, could we read and verify the data that is read from the two files (one that has min/max stats set and one that does not?)

Thanks again

datafusion-cli/src/functions.rs

alamb · 2023-12-06T22:57:51Z

datafusion-cli/src/main.rs

@@ -185,6 +186,8 @@ pub async fn main() -> Result<()> {
        ctx.state().catalog_list(),
        ctx.state_weak_ref(),
    )));
+    // register `parquet_metadata` table function to get metadata from parquet files
+    ctx.register_udtf("parquet_metadata", Arc::new(ParquetMetadataFunc {}));


alamb

Love it -- thank you @Veeupup -- really nice work

alamb · 2023-12-07T22:16:08Z

This feature was so cool I started using it, and I found one suggestion for improvement already: #8464

* Support parquet_metadata for datafusion-cli Signed-off-by: veeupup <code@tanweime.com> * make tomlfmt happy * display like duckdb Signed-off-by: veeupup <code@tanweime.com> * add test & fix single quote --------- Signed-off-by: veeupup <code@tanweime.com>

Support parquet_metadata for datafusion-cli

99f069d

Signed-off-by: veeupup <code@tanweime.com>

make tomlfmt happy

b367b9e

simonvandel suggested changes Dec 5, 2023

View reviewed changes

datafusion-cli/src/functions.rs Outdated Show resolved Hide resolved

datafusion-cli/src/functions.rs Show resolved Hide resolved

datafusion-cli/src/functions.rs Outdated Show resolved Hide resolved

alamb mentioned this pull request Dec 5, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 4, 2023 #8420

Closed

7 tasks

alamb reviewed Dec 5, 2023

View reviewed changes

display like duckdb

5516189

Signed-off-by: veeupup <code@tanweime.com>

Veeupup force-pushed the cli_parquet_metadata branch from dfbd127 to 5516189 Compare December 6, 2023 14:30

Veeupup requested review from alamb and simonvandel December 6, 2023 14:30

alamb approved these changes Dec 6, 2023

View reviewed changes

add test & fix single quote

5f7d8c3

Veeupup force-pushed the cli_parquet_metadata branch from d237d66 to 5f7d8c3 Compare December 7, 2023 14:36

alamb approved these changes Dec 7, 2023

View reviewed changes

alamb merged commit 6767ea3 into apache:main Dec 7, 2023
22 checks passed

alamb mentioned this pull request Dec 7, 2023

Improve string statistics display in datafusion-cli parquet_metadata #8464

Closed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parquet_metadata for datafusion-cli #8413

Support parquet_metadata for datafusion-cli #8413

Veeupup commented Dec 4, 2023 •

edited

Loading

Veeupup commented Dec 4, 2023

alamb commented Dec 4, 2023

simonvandel left a comment

alamb left a comment

alamb Dec 5, 2023

Veeupup Dec 6, 2023

Veeupup commented Dec 6, 2023

alamb left a comment

alamb Dec 6, 2023

alamb left a comment

alamb commented Dec 7, 2023

Support parquet_metadata for datafusion-cli #8413

Support parquet_metadata for datafusion-cli #8413

Conversation

Veeupup commented Dec 4, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Veeupup commented Dec 4, 2023

alamb commented Dec 4, 2023

simonvandel left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Dec 5, 2023

Choose a reason for hiding this comment

Veeupup Dec 6, 2023

Choose a reason for hiding this comment

Veeupup commented Dec 6, 2023

alamb left a comment

Choose a reason for hiding this comment

Request 1: Can we make string literal arguments work? Something like

Request 2 Do not panic when statistics are not set

Request 3: Add some tests

alamb Dec 6, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Dec 7, 2023

Veeupup commented Dec 4, 2023 •

edited

Loading