Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parquet_metadata function in datafusion-cli #8367

Closed
alamb opened this issue Nov 29, 2023 · 4 comments · Fixed by #8413
Closed

Implement parquet_metadata function in datafusion-cli #8367

alamb opened this issue Nov 29, 2023 · 4 comments · Fixed by #8413
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 29, 2023

Is your feature request related to a problem or challenge?

When exploring Parquet files using datafusion-cli I would often like to see how they are structured (how many row groups, if they have statistics, etc).

Describe the solution you'd like

I would like to create new functions for exploring parquet metadata using the new User Defined Table Functions (🙇 to @Veeupup ) introduced in #8306

Ideally we could implement something like parquet_metadata: https://duckdb.org/docs/data/parquet/overview

(I think parquet_schema is covered by describe 'filename.parquet' already

Note I think this should be done in datafusion-cli (not core DataFusion)

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Nov 29, 2023
@alamb alamb changed the title Implement `pa Implement parquet_metadata function in datafusion-cli Nov 29, 2023
@alamb
Copy link
Contributor Author

alamb commented Nov 29, 2023

This would also be a great test of the user defined table function feature to see if we can build something slightly more complicated than read_csv

@Veeupup
Copy link
Contributor

Veeupup commented Nov 30, 2023

I can help with this ticket as a following PR to #8306

@Veeupup
Copy link
Contributor

Veeupup commented Nov 30, 2023

After this ticker was finished, maybe we can have a list of internal table functions to implement, just like:

  • read_csv
  • read_parquet
  • read_json
  • ...

@alamb
Copy link
Contributor Author

alamb commented Nov 30, 2023

I am not sure about read_csv etc -- the datafusion-cli already has the ability to do the equivalent of read_csv (and the other formats) with a different format

Instead of

select * from read_csv('foo.csv')

It uses

select * from 'foo.csv'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants