Support saveAsTable for writing orc and parquet #1134

tgravescs · 2020-11-17T00:37:12Z

This adds support for saveAsTable and create table from select sql statements.

This is basically just metadata operations and then calling into the existing GpuInsertIntoHadoopFsRelationCommand.

I ended up copying the Spark DataSource and making a GpuDatasource version, that is a slightly modified version. One of the main changes is we pass in the provider and the fileformat because we already have to figure that out in the GpuOverrides CreateDataSourceTableAsSelectCommandMeta.

Other than that I split the parquet and orc tests into read and write files. I added a bunch of different write tests

I had to add a shim function because https://issues.apache.org/jira/browse/SPARK-32431 went into Spark 3.1 and that changed the call into checkColumnNameDuplication.

Also note that Spark hasn't enabled any of the datasource v2 writers, it always falls back to the v1 version at this point.

fixes #1096

Signed-off-by: Thomas Graves <tgraves@apache.org>

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs · 2020-11-17T13:59:07Z

build

revans2

For the code that was copied and pasted from spark I mostly checked to see if it matched what spark had. I am a little concerned with the amount of code that we have copied over, but it should be fine for now.

* start saveAsTable * Add GpuDataSource * columnar ifle format * Update to GpuFileFormat * fix typo * logging * more logging * change format parquet * fix classof * fix run to runColumnar * using original providing instance for end * remove unneeded code and pass in providers so don't calculate twice * create shim for SchemaUtils checkSchemaColumnNameDuplication Signed-off-by: Thomas Graves <tgraves@apache.org> * fix typo with checkSchemaColumnNameDuplication * fix name * fix calling * fix anothername * fix none * Fix provider vs FileFormat * split read/write tests * Write a bunch more tests for orc and parquet writing Signed-off-by: Thomas Graves <tgraves@nvidia.com> * cleanup and csv test * Add more test * Add bucket write test Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove debug logs Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update for spark 3.1.0

…IDIA#1134) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

tgravescs and others added 26 commits November 11, 2020 14:05

start saveAsTable

caa62bb

Add GpuDataSource

afc5076

columnar ifle format

563da0e

Update to GpuFileFormat

33c56ef

fix typo

1c55e0a

logging

22accae

more logging

77631d7

change format parquet

d791d2e

fix classof

8fe568e

fix run to runColumnar

5338861

using original providing instance for end

1e2f7c4

remove unneeded code and pass in providers so don't calculate twice

554646a

create shim for SchemaUtils checkSchemaColumnNameDuplication

1567c30

Signed-off-by: Thomas Graves <tgraves@apache.org>

fix typo with checkSchemaColumnNameDuplication

1e30a7d

fix name

4790771

fix calling

a941be4

fix anothername

5031682

fix none

ab4df19

Fix provider vs FileFormat

6f9f50d

split read/write tests

bdea8f2

Write a bunch more tests for orc and parquet writing

d080202

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

cleanup and csv test

5770c98

Add more test

e433e54

Add bucket write test

e72f1d5

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

remove debug logs

fc8d1f3

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Update for spark 3.1.0

14fee28

revans2 approved these changes Nov 17, 2020

View reviewed changes

jlowe added the SQL part of the SQL/Dataframe plugin label Nov 17, 2020

jlowe added this to the Nov 9 - Nov 20 milestone Nov 17, 2020

jlowe approved these changes Nov 17, 2020

View reviewed changes

jlowe merged commit 2112f7c into NVIDIA:branch-0.3 Nov 17, 2020

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

Update submodule cudf to 7575e8da54499990b51535a0f975acd02a493144 (NV…

e017670

…IDIA#1134) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support saveAsTable for writing orc and parquet #1134

Support saveAsTable for writing orc and parquet #1134

tgravescs commented Nov 17, 2020 •

edited

Loading

tgravescs commented Nov 17, 2020

revans2 left a comment

Support saveAsTable for writing orc and parquet #1134

Support saveAsTable for writing orc and parquet #1134

Conversation

tgravescs commented Nov 17, 2020 • edited Loading

tgravescs commented Nov 17, 2020

revans2 left a comment

Choose a reason for hiding this comment

tgravescs commented Nov 17, 2020 •

edited

Loading