Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable creating and inserting to empty external tables via SQL #7276

Merged
merged 2 commits into from
Aug 14, 2023

Conversation

devinjdangelo
Copy link
Contributor

@devinjdangelo devinjdangelo commented Aug 13, 2023

Which issue does this PR close?

Follow on to #7244

Related to #7228 and #7036

Note: does not close #7228 since it does not provide a mechanism to create external table target if it does not exist. It does offer a partial workaround since you can now specify OPTIONS (insert_mode append_new_files) when creating external tables so insert into will add new files to the table.

Rationale for this change

Previous PRs adding support for inserting to ListingTables have not adequately tested an end to end process of creating an empty external table from SQL and subsequently inserting data into it. This PR adds test cases for this and adds a few enhancements / solves a few bugs to make the test cases pass.

What changes are included in this PR?

  • New test cases which create empty external table and insert data into it
  • Remove logic throwing error when parquet schema is specified via SQL since it is now supported
  • Support setting ListingTableInsertMode via Create External Table ... OPTIONS () statement
  • Update old test cases expecting setting parquet schema to fail
  • Fix bug in CsvSink which set header to always be true even if table specified this explicitly as false

Are these changes tested?

Yes

Are there any user-facing changes?

Creating empty external tables and inserting to them via SQL is possible now.

@github-actions github-actions bot added sql SQL Planner core Core DataFusion crate labels Aug 13, 2023
.collect()
.await?;

let expected = vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very cool.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @devinjdangelo -- this code is both nicely written and well tested and I think moves DataFusion forward. We are still not quite there but we are so close

Thank you for making progress


/// tests insert into with end to end sql
/// create external table + insert into statements
async fn helper_test_insert_into_sql(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use sqllogictest to test this as well --

Perhaps add to https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/test_files/insert.slt

I can't remember how temporary directories work in sqllogictests though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually worked on this in #7283 which I just opened for copy.slt. Once those updates are in, I can cut another PR to add additional tests to insert.slt

fn create_external_table_parquet_sort_order() {
let sql = "create external table foo(a varchar, b varchar, c timestamp) stored as parquet location '/tmp/foo' with order (c)";
let expected = "CreateExternalTable: Bare { table: \"foo\" }";
quick_test(sql, expected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I tried to run this locally it doesn't quite work

❯ create external table t(x integer, y varchar) stored as parquet location '/tmp/foo' with order (x);
0 rows in set. Query took 0.001 seconds.

❯ select * from t;
0 rows in set. Query took 0.002 seconds.

❯ insert into t values (1, 'foo'), (2, 'bar');
This feature is not implemented: Writing to a sorted listing table via insert into is not supported yet. To write to this table in the meantime, register an equivalent table with file_sort_order = vec![]

But it is getting very close.

Copy link
Contributor Author

@devinjdangelo devinjdangelo Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that error I added intentionally. Inserting to a sorted table would work, but there is nothing to enforce that the sort order is preserved yet. So, my concern is a user inserts unsorted data to a sorted table, and then subsequent queries return incorrect results surprisingly.

We could add this as an issue to the streaming writes epic (support inserts to a sorted listingtable).

@alamb alamb merged commit 3bbf48a into apache:main Aug 14, 2023
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create new empty external table
2 participants