Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support saveAsTable for writing orc and parquet #1134

Merged
merged 26 commits into from
Nov 17, 2020

Conversation

tgravescs
Copy link
Collaborator

@tgravescs tgravescs commented Nov 17, 2020

This adds support for saveAsTable and create table from select sql statements.

This is basically just metadata operations and then calling into the existing GpuInsertIntoHadoopFsRelationCommand.

I ended up copying the Spark DataSource and making a GpuDatasource version, that is a slightly modified version. One of the main changes is we pass in the provider and the fileformat because we already have to figure that out in the GpuOverrides CreateDataSourceTableAsSelectCommandMeta.

Other than that I split the parquet and orc tests into read and write files. I added a bunch of different write tests

I had to add a shim function because https://issues.apache.org/jira/browse/SPARK-32431 went into Spark 3.1 and that changed the call into checkColumnNameDuplication.

Also note that Spark hasn't enabled any of the datasource v2 writers, it always falls back to the v1 version at this point.

fixes #1096

@tgravescs
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the code that was copied and pasted from spark I mostly checked to see if it matched what spark had. I am a little concerned with the amount of code that we have copied over, but it should be fine for now.

@jlowe jlowe added the SQL part of the SQL/Dataframe plugin label Nov 17, 2020
@jlowe jlowe added this to the Nov 9 - Nov 20 milestone Nov 17, 2020
@jlowe jlowe merged commit 2112f7c into NVIDIA:branch-0.3 Nov 17, 2020
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this pull request Nov 20, 2020
* start saveAsTable

* Add GpuDataSource

* columnar ifle format

* Update to GpuFileFormat

* fix typo

* logging

* more logging

* change format parquet

* fix classof

* fix run to runColumnar

* using original providing instance for end

* remove unneeded code and pass in providers so don't calculate twice

* create shim for SchemaUtils checkSchemaColumnNameDuplication

Signed-off-by: Thomas Graves <tgraves@apache.org>

* fix typo with checkSchemaColumnNameDuplication

* fix name

* fix calling

* fix anothername

* fix none

* Fix provider vs FileFormat

* split read/write tests

* Write a bunch more tests for orc and parquet writing

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cleanup and csv test

* Add more test

* Add bucket write test

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove debug logs

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update for spark 3.1.0
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* start saveAsTable

* Add GpuDataSource

* columnar ifle format

* Update to GpuFileFormat

* fix typo

* logging

* more logging

* change format parquet

* fix classof

* fix run to runColumnar

* using original providing instance for end

* remove unneeded code and pass in providers so don't calculate twice

* create shim for SchemaUtils checkSchemaColumnNameDuplication

Signed-off-by: Thomas Graves <tgraves@apache.org>

* fix typo with checkSchemaColumnNameDuplication

* fix name

* fix calling

* fix anothername

* fix none

* Fix provider vs FileFormat

* split read/write tests

* Write a bunch more tests for orc and parquet writing

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cleanup and csv test

* Add more test

* Add bucket write test

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove debug logs

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update for spark 3.1.0
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* start saveAsTable

* Add GpuDataSource

* columnar ifle format

* Update to GpuFileFormat

* fix typo

* logging

* more logging

* change format parquet

* fix classof

* fix run to runColumnar

* using original providing instance for end

* remove unneeded code and pass in providers so don't calculate twice

* create shim for SchemaUtils checkSchemaColumnNameDuplication

Signed-off-by: Thomas Graves <tgraves@apache.org>

* fix typo with checkSchemaColumnNameDuplication

* fix name

* fix calling

* fix anothername

* fix none

* Fix provider vs FileFormat

* split read/write tests

* Write a bunch more tests for orc and parquet writing

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cleanup and csv test

* Add more test

* Add bucket write test

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove debug logs

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update for spark 3.1.0
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#1134)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Implement parquet CreateDataSourceTableAsSelectCommand
3 participants