Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382

Merged
merged 35 commits into from
Mar 12, 2024

Conversation

metesynnada
Copy link
Contributor

@metesynnada metesynnada commented Feb 28, 2024

Which issue does this PR close?

Closes #8994.
Related to #9369 as well.

Rationale for this change

Currently, in our implementation of 'Create External Table' and 'Copy to', the configuration options are not systematically organized, leading to potential confusion and complexity for the users.

What changes are included in this PR?

  • A document generatable, 'datafusion.x' config-like table options introduced.
  • FileFormat now supports all configurations coming from SQL syntax. They can create Writers as they create Readers, which removes the writer from the FileSInkConfig.
  • TableOptions is extendable, which gives us the power of uniting the object_store configurations, like AWS.
  • Removed StatementOptions.

The issues I want to talk about

  • write_csv, write_parquet etc. does not have a CsvReadOptions or ParquetReadOptions counterpart. I think we can add a structure to ensure a symmetric design (like CsvWriteOptions), which reduces the friction on user behaviour.
  • Limiting the usage of free-standing strings.
  • Separating the options into scan and sink, like format.csv.sink.dateformat or format.csv.scan.has_header

Note: This PR is not compiling, since it has a huge impact on the proto file and its conversions, I will implement them after we settle on a design.

Are these changes tested?

With existing tests.

Are there any user-facing changes?

Yes, there is a breaking change here.
Now., COPY options beside partitioned_by and format must have a prefix.

COPY source_table
TO 'test_files/scratch/copy/table_with_options/'
(format parquet,
'format.parquet.compression' snappy,
'format.parquet.compression::col1' 'zstd(5)',
'format.parquet.compression::col2' snappy,
'format.parquet.max_row_group_size' 12345,
'format.parquet.data_pagesize_limit' 1234,
'format.parquet.write_batch_size' 1234,
'format.parquet.writer_version' 2.0,
'format.parquet.dictionary_page_size_limit' 123,
'format.parquet.created_by' 'DF copy.slt',
'format.parquet.column_index_truncate_length' 123,
'format.parquet.data_page_row_count_limit' 1234,
'format.parquet.bloom_filter_enabled' true,
'format.parquet.bloom_filter_enabled::col1' false,
'format.parquet.bloom_filter_fpp::col2' 0.456,
'format.parquet.bloom_filter_ndv::col2' 456,
'format.parquet.encoding' plain,
'format.parquet.encoding::col1' DELTA_BINARY_PACKED,
'format.parquet.dictionary_enabled::col2' true,
'format.parquet.dictionary_enabled' false,
'format.parquet.statistics_enabled' page,
'format.parquet.statistics_enabled::col2' none,
'format.parquet.max_statistics_size' 123,
'format.parquet.bloom_filter_fpp' 0.001,
'format.parquet.bloom_filter_ndv' 100
)

or CREATE EXTERNAL TABLE

CREATE EXTERNAL TABLE csv_with_quote (
c1 VARCHAR,
c2 VARCHAR
) STORED AS CSV
WITH HEADER ROW
DELIMITER ','
OPTIONS ('format.csv.quote' '~')
LOCATION '../core/tests/data/quote.csv';

@metesynnada metesynnada marked this pull request as draft February 28, 2024 10:18
@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 28, 2024
@metesynnada
Copy link
Contributor Author

@alamb @devinjdangelo if you have time, a look would be perfect. Please ignore proto errors, I just want to start a discussion around this.

@alamb
Copy link
Contributor

alamb commented Feb 28, 2024

@alamb @devinjdangelo if you have time, a look would be perfect. Please ignore proto errors, I just want to start a discussion around this.

Thanks @metesynnada -- could you possibly highlight some examples to look at in this this PR where the user API changed that you would like feedback? I am being lazy and asking you in the hopes you will know how to do this without much effort rather than trying to grok the PR myself. If you don't have the time i will find some time to do so

@alamb
Copy link
Contributor

alamb commented Feb 28, 2024

Now., COPY options beside partitioned_by and format must have a prefix.

Is there a reason that we need to repeat the format so much? It looks quite repetitive. Maybe could we default / automatically try and add it?

@@ -33,7 +33,7 @@ c2 VARCHAR
) STORED AS CSV
WITH HEADER ROW
DELIMITER ','
OPTIONS ('escape' '\"')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was wrong, since the changed code was using as_bytes()[0], but now we directly reject if there is more than one char. \ is not an escape for .slt files.

max_statistics_size 123,
bloom_filter_fpp 0.001,
bloom_filter_ndv 100
'format.parquet.compression' snappy,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a solid example of how we change the option handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the config options didn't need to be literal quoted e.g. 'format.parquet.compression'. I think it would be ideal if DFParser were aware of the custom syntax we are creating for options including the dots and double colons, so a user could simply write:

COPY table
TO 'file.parquet'
(
format.parquet.compression snappy,
format.parquet.compression::col1 zstd(5)
)

Related is #9274

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. We can enhance this behaviour after this PR.

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I carefully reviewed this PR and it LGTM.

After we merge this, let's make a quick follow-on PR to handle the following:

  1. Supporting options/configuration items without quotes (foo.bar.baz instead of 'foo.bar.baz').
  2. Supporting both WITH HEADER ROW and OPTIONS (csv.has_header true), with the former overriding if present.
  3. Making the syntax uniform in COPY TO and elsewhere to use STORED AS PARQUET, and making OPTIONS (parquet.foo.bar baz) only legal when format is specified through the STORED AS quantifier. This way we don't have the weird format parquet within the options clause at all.

Other than these three items we will tackle in the follow-on, I left only a few minor in-line comments to fix/improve in this PR.

Great work, @metesynnada 🚀

datafusion/sqllogictest/test_files/copy.slt Outdated Show resolved Hide resolved
datafusion/sql/src/statement.rs Show resolved Hide resolved
@metesynnada
Copy link
Contributor Author

I visited @ozankabak review.

@alamb alamb added the api change Changes the API exposed to users of the crate label Mar 12, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @metesynnada

I skimmed this PR (didn't study the implementation carefully, but looked at the tests carefully) and I think it is a very nice improvement and consolidation of the configuration. The pattern you introduce I think will extend very well

I think it might make sense to update the documentation in https://arrow.apache.org/datafusion/user-guide/sql/write_options.html#available-options a bit to reflect the new format prefix.

I also marked the PR as "api-change" as the name of configuration settings now have the appropriate format prefix:

COPY source_table TO 'test_files/scratch/copy/table/' (format parquet, 'parquet.compression' 'zstd(10)');

Instead of

COPY source_table TO 'test_files/scratch/copy/table/' (format parquet, 'compression' 'zstd(10)');

}

async fn create_external_table(
/// Asynchronously registers an object store and its configuration extensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@@ -396,12 +455,12 @@ mod tests {

// Missing region, use object_store defaults
let sql = format!("CREATE EXTERNAL TABLE test STORED AS PARQUET
OPTIONS('access_key_id' '{access_key_id}', 'secret_access_key' '{secret_access_key}') LOCATION '{location}'");
OPTIONS('aws.access_key_id' '{access_key_id}', 'aws.secret_access_key' '{secret_access_key}') LOCATION '{location}'");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think prefixing the configuration with cloud provider specific prefix makes sense


config_namespace_with_hashmap! {
pub struct ColumnOptions {
/// Sets if bloom filter is enabled for the column path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really nice to see in structured format here

.with_header(value.has_header)
.with_delimiter(value.delimiter);

if let Some(v) = &value.date_format {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice example of how the new programatic API improves things

@@ -65,7 +65,7 @@ SHOW datafusion.execution.batch_size
datafusion.execution.batch_size 1

# set variable unknown variable
statement error DataFusion error: External error: could not find config namespace for key "aabbcc"
statement error DataFusion error: Invalid or Unsupported Configuration: could not find config namespace for key "aabbcc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Mar 12, 2024

Given the large possibility of conflicts of this PR I am merging it in now. Thanks again everyone. I also have a follow on idea I will file

@alamb alamb merged commit 27b8a49 into apache:main Mar 12, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 12, 2024

Filed #9575 for backwards compatibility for options

@progval
Copy link
Contributor

progval commented Mar 16, 2024

I noticed two other breaking changes with Parquet bloom filters (and possibly other options too):

  1. using create external table before set datafusion.execution.parquet.bloom_filter_enabled = true, Bloom filters were used in queries after set. Now they are only used if datafusion.execution.parquet.bloom_filter_enabled was set first.
  2. after this PR, set datafusion.execution.parquet.bloom_filter_enabled = true has no effect on "anonymous" tables (ie. SELECT * FROM 'path/to/file.parquet' WHERE key = ...), they don't use seem to ever use Bloom filters.

@ozankabak
Copy link
Contributor

Thanks for bringing these to our attention @progval. I will discuss with @metesynnada to get more clarity on how to proceed (which of these changes you observed make sense, which don't and should be fixed)

@metesynnada
Copy link
Contributor Author

I noticed two other breaking changes with Parquet bloom filters (and possibly other options too):

  1. using create external table before set datafusion.execution.parquet.bloom_filter_enabled = true, Bloom filters were used in queries after set. Now they are only used if datafusion.execution.parquet.bloom_filter_enabled was set first.
  2. after this PR, set datafusion.execution.parquet.bloom_filter_enabled = true has no effect on "anonymous" tables (ie. SELECT * FROM 'path/to/file.parquet' WHERE key = ...), they don't use seem to ever use Bloom filters.

Solved the issues on #9604, can you check if this is the expected behaviour?

@progval
Copy link
Contributor

progval commented Mar 18, 2024

@metesynnada It is, thanks!

@progval
Copy link
Contributor

progval commented Mar 18, 2024

Well, issue 2 is. Issue 1 is unchanged.

Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request May 1, 2024
The options to write_parquet changed.

write_json has a new argument that I defaulted to None. We can expose that config later.

Ref: apache/datafusion#9382
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request May 1, 2024
The options to write_parquet changed.

write_json has a new argument that I defaulted to None. We can expose that config later.

Ref: apache/datafusion#9382
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request May 8, 2024
The options to write_parquet changed.

write_json has a new argument that I defaulted to None. We can expose that config later.

Ref: apache/datafusion#9382
andygrove pushed a commit to apache/datafusion-python that referenced this pull request May 8, 2024
* deps: upgrade datafusion to 37.1.0

* feat: re-implement SessionContext::tables

The method was removed upstream but is used in many tests for `datafusion-python`.

Ref: apache/datafusion#9627

* feat: upgrade dataframe write_parquet and write_json

The options to write_parquet changed.

write_json has a new argument that I defaulted to None. We can expose that config later.

Ref: apache/datafusion#9382

* feat: impl new ExecutionPlanProperties for DatasetExec

Ref: apache/datafusion#9346

* feat: add upstream variant and method params

- `WindowFunction` and `AggregateFunction` have `null_treatment` options.
- `ScalarValue` and `DataType` have new variants
- `SchemaProvider::table` now returns a `Result`

* lint: allow(deprecated) for make_scalar_function

* feat: migrate functions.rs

`datafusion` completed an Epic that ported many of the `BuiltInFunctions` enum to `SclarUDF`.

I created new macros to simplify the port, and used these macros to refactor a few existing functions.

Ref: apache/datafusion#9285

* fixme: commented out last failing test

This is a bug upstream in datafusion

FAILED datafusion/tests/test_functions.py::test_array_functions - pyo3_runtime.PanicException: range end index 9 out of range for slice of length 8

* chore: update Cargo.toml package info
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discussion for a Systematic Configuration in 'Create External Table' Options
7 participants