Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to control number of partitions when converting from CSV to Parquet #915

Merged
merged 5 commits into from
Oct 15, 2020

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Oct 7, 2020

Signed-off-by: Andy Grove andygrove@nvidia.com

When we convert TPC-* CSV files to Parquet, the number and size of Parquet files created are not consistent and depends on the number of executor cores running the conversion. This can lead to thousands of small files being created, which is not optimal for GPU.

This PR adds the option to control the number of partitions per table using coalesce or repartition.

Also, all three TPC-* benchmarks now have a ConvertFiles object with a main method, so that the file conversion can be submitted with spark-submit with command-line arguments for all available options. For TPC-DS, there is an option to use partitioning when creating the Parquet files since the underlying code supports that.

This closes #902

@andygrove andygrove added this to the Sep 28 - Oct 9 milestone Oct 7, 2020
@andygrove andygrove self-assigned this Oct 7, 2020
@andygrove andygrove added benchmark Benchmarking, benchmarking tools and removed benchmarking labels Oct 7, 2020
@@ -60,10 +62,14 @@ case class Table(
private def setupWrite(
spark: SparkSession,
inputBase: String,
maxPartitions: Option[Int],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to not support also growing the number of partitions? Basically just have this numPartitions. You would need to use repartition in that case. I would find this useful for like small file testing where I actually want to grow the number of partitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I can look at this later today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs I'm looking at this now and there are multiple options for doing this that I can see.

Option 1: Add specific command-line arguments for coalesce vs repartition. For the issue where some tables have too many files, coalesce is much more efficient than repartition, and for the tables that already have a small number of files, I don't want to increase the number.

Option 2: Dynamically look at the number of partitions in each table and then decide whether to coalesce or repartition to achieve the desired number of partitions, but this would mean that we could end up increasing partitions for some tables and decreasing for others.

I think option 1 is more explicit and would work well for both of our use cases? I'll start down this path but let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used coalesce in the past and it works great, but it can be problematic when applying it blindly across all tables in the dataset. For example, there are typically many tables in these benchmarks that scale very slowly or not at all (e.g.: tables containing regions, warehouse locations, etc.) that for the best performance should just be one file. But then there are other tables that have the largest scale (e.g.: tables containing individual sales record data) that would be disastrous to force into a single file.

This may need to be handled in a benchmark-specific way, e.g.: tiny tables that don't really scale with the rest of the data are always coalesced into a single table, and there are two user-specified settings, one for the number of partitions to use for "medium" scale tables, and another for "large" scale tables.

Copy link
Contributor Author

@andygrove andygrove Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would make sense to allow the --coalesce and --repartition arguments to operate on individual tables as a starting point e.g. --coalesce table1=1 table2=24 --repartition table3=200 so that we have the ability to fully control partition sizes manually and then we can build higher-level logic on top in the future to do this in a more automated way.

@andygrove andygrove changed the title [WIP] Add option to limit number of partitions when converting from CSV to Parquet Add option to limit number of partitions when converting from CSV to Parquet Oct 8, 2020
@andygrove andygrove changed the title Add option to limit number of partitions when converting from CSV to Parquet Add option to control number of partitions when converting from CSV to Parquet Oct 13, 2020
@andygrove
Copy link
Contributor Author

These changes have been tested with TPC-DS but not with TPC-H or TPC-xBB.

@andygrove
Copy link
Contributor Author

build

.write
.mode("overwrite")
val df = readCSV(spark, inputBase)
val repart = (coalesce.get(name), repartition.get(name)) match {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if I specify it in both I get the coalesce? Perhaps we should error?

@andygrove
Copy link
Contributor Author

I'm not sure what happened, but after applying the suggested changes, I see other changes in this PR. I will rebase.

… per-table basis

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove force-pushed the max-partitions-csv-to-parquet branch from 2551bad to 3186f39 Compare October 14, 2020 14:55
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

@jlowe @tgravescs Thanks for the reviews. I've addressed your feedback now.

@andygrove
Copy link
Contributor Author

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
@andygrove
Copy link
Contributor Author

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@jlowe
Copy link
Member

jlowe commented Oct 15, 2020

build

@tgravescs tgravescs merged commit 24ce1c9 into NVIDIA:branch-0.3 Oct 15, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Oct 20, 2020
…o Parquet (NVIDIA#915)

* Add command-line arguments for applying coalesce and repartition on a per-table basis

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move command-line validation logic and address other feedback

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update copyright years and fix import order

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update docs/benchmarks.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Remove withPartitioning option from TPC-H and TPC-xBB file conversion

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
tgravescs added a commit that referenced this pull request Oct 21, 2020
* Add some more checks to databricks build scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove extra newline

* use the right -gt for bash

* Add new python file for databricks cluster utils

* Fix up scripts

* databricks scripts working

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Pass in sshkey

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cluster creation script mods

* fix

* fix pub key

* fix missing quote

* fix $

* update public key to be param

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Add public key value

* clenaup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* modify permissions

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* change loc cluster id file

* fix extra /

* quote public key

* try different setting cluster id

* debug

* try again

* try readfile

* try again

* try quotes

* cleanup

* Add option to control number of partitions when converting from CSV to Parquet (#915)

* Add command-line arguments for applying coalesce and repartition on a per-table basis

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move command-line validation logic and address other feedback

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update copyright years and fix import order

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update docs/benchmarks.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Remove withPartitioning option from TPC-H and TPC-xBB file conversion

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Benchmark runner script (#918)

* Benchmark runner script

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add argument for number of iterations

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Fix docs

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* add license

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* improve documentation for the configuration files

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add missing line-continuation symbol in example

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Remove hard-coded spark-submit-template.txt and add --template argument. Also make all arguments required.

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update benchmarking guide to link to the benchmark python script

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add --template to example and fix markdown header

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add legacy config to clear active Spark 3.1.0 session in tests (#970)

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* XFail tests until final fix can be put in (#968)

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

* Stop reporting totalTime metric for GpuShuffleExchangeExec (#973)

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add some more checks to databricks build scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Pass in sshkey

* Add create script, add more parameters, etc

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* add create script

* rework some scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* fix is_cluster_running

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* put slack back in

* update text

* cleanup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove datetime

* send output to stderr

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Co-authored-by: Andy Grove <andygrove@users.noreply.github.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Co-authored-by: Robert (Bobby) Evans <bobby@apache.org>
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this pull request Nov 20, 2020
…o Parquet (NVIDIA#915)

* Add command-line arguments for applying coalesce and repartition on a per-table basis

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move command-line validation logic and address other feedback

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update copyright years and fix import order

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update docs/benchmarks.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Remove withPartitioning option from TPC-H and TPC-xBB file conversion

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this pull request Nov 20, 2020
* Add some more checks to databricks build scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove extra newline

* use the right -gt for bash

* Add new python file for databricks cluster utils

* Fix up scripts

* databricks scripts working

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Pass in sshkey

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cluster creation script mods

* fix

* fix pub key

* fix missing quote

* fix $

* update public key to be param

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Add public key value

* clenaup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* modify permissions

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* change loc cluster id file

* fix extra /

* quote public key

* try different setting cluster id

* debug

* try again

* try readfile

* try again

* try quotes

* cleanup

* Add option to control number of partitions when converting from CSV to Parquet (NVIDIA#915)

* Add command-line arguments for applying coalesce and repartition on a per-table basis

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move command-line validation logic and address other feedback

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update copyright years and fix import order

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update docs/benchmarks.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Remove withPartitioning option from TPC-H and TPC-xBB file conversion

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Benchmark runner script (NVIDIA#918)

* Benchmark runner script

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add argument for number of iterations

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Fix docs

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* add license

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* improve documentation for the configuration files

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add missing line-continuation symbol in example

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Remove hard-coded spark-submit-template.txt and add --template argument. Also make all arguments required.

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update benchmarking guide to link to the benchmark python script

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add --template to example and fix markdown header

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add legacy config to clear active Spark 3.1.0 session in tests (NVIDIA#970)

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* XFail tests until final fix can be put in (NVIDIA#968)

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

* Stop reporting totalTime metric for GpuShuffleExchangeExec (NVIDIA#973)

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add some more checks to databricks build scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Pass in sshkey

* Add create script, add more parameters, etc

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* add create script

* rework some scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* fix is_cluster_running

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* put slack back in

* update text

* cleanup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove datetime

* send output to stderr

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Co-authored-by: Andy Grove <andygrove@users.noreply.github.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Co-authored-by: Robert (Bobby) Evans <bobby@apache.org>
@andygrove andygrove deleted the max-partitions-csv-to-parquet branch December 17, 2020 15:26
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…o Parquet (NVIDIA#915)

* Add command-line arguments for applying coalesce and repartition on a per-table basis

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move command-line validation logic and address other feedback

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update copyright years and fix import order

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update docs/benchmarks.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Remove withPartitioning option from TPC-H and TPC-xBB file conversion

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Add some more checks to databricks build scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove extra newline

* use the right -gt for bash

* Add new python file for databricks cluster utils

* Fix up scripts

* databricks scripts working

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Pass in sshkey

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cluster creation script mods

* fix

* fix pub key

* fix missing quote

* fix $

* update public key to be param

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Add public key value

* clenaup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* modify permissions

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* change loc cluster id file

* fix extra /

* quote public key

* try different setting cluster id

* debug

* try again

* try readfile

* try again

* try quotes

* cleanup

* Add option to control number of partitions when converting from CSV to Parquet (NVIDIA#915)

* Add command-line arguments for applying coalesce and repartition on a per-table basis

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move command-line validation logic and address other feedback

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update copyright years and fix import order

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update docs/benchmarks.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Remove withPartitioning option from TPC-H and TPC-xBB file conversion

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Benchmark runner script (NVIDIA#918)

* Benchmark runner script

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add argument for number of iterations

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Fix docs

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* add license

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* improve documentation for the configuration files

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add missing line-continuation symbol in example

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Remove hard-coded spark-submit-template.txt and add --template argument. Also make all arguments required.

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update benchmarking guide to link to the benchmark python script

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add --template to example and fix markdown header

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add legacy config to clear active Spark 3.1.0 session in tests (NVIDIA#970)

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* XFail tests until final fix can be put in (NVIDIA#968)

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

* Stop reporting totalTime metric for GpuShuffleExchangeExec (NVIDIA#973)

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add some more checks to databricks build scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Pass in sshkey

* Add create script, add more parameters, etc

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* add create script

* rework some scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* fix is_cluster_running

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* put slack back in

* update text

* cleanup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove datetime

* send output to stderr

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Co-authored-by: Andy Grove <andygrove@users.noreply.github.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Co-authored-by: Robert (Bobby) Evans <bobby@apache.org>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…o Parquet (NVIDIA#915)

* Add command-line arguments for applying coalesce and repartition on a per-table basis

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move command-line validation logic and address other feedback

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update copyright years and fix import order

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update docs/benchmarks.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Remove withPartitioning option from TPC-H and TPC-xBB file conversion

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Add some more checks to databricks build scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove extra newline

* use the right -gt for bash

* Add new python file for databricks cluster utils

* Fix up scripts

* databricks scripts working

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Pass in sshkey

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cluster creation script mods

* fix

* fix pub key

* fix missing quote

* fix $

* update public key to be param

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Add public key value

* clenaup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* modify permissions

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* change loc cluster id file

* fix extra /

* quote public key

* try different setting cluster id

* debug

* try again

* try readfile

* try again

* try quotes

* cleanup

* Add option to control number of partitions when converting from CSV to Parquet (NVIDIA#915)

* Add command-line arguments for applying coalesce and repartition on a per-table basis

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move command-line validation logic and address other feedback

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update copyright years and fix import order

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update docs/benchmarks.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Remove withPartitioning option from TPC-H and TPC-xBB file conversion

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Benchmark runner script (NVIDIA#918)

* Benchmark runner script

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add argument for number of iterations

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Fix docs

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* add license

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* improve documentation for the configuration files

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add missing line-continuation symbol in example

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Remove hard-coded spark-submit-template.txt and add --template argument. Also make all arguments required.

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update benchmarking guide to link to the benchmark python script

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add --template to example and fix markdown header

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add legacy config to clear active Spark 3.1.0 session in tests (NVIDIA#970)

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* XFail tests until final fix can be put in (NVIDIA#968)

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

* Stop reporting totalTime metric for GpuShuffleExchangeExec (NVIDIA#973)

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add some more checks to databricks build scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Pass in sshkey

* Add create script, add more parameters, etc

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* add create script

* rework some scripts

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* fix is_cluster_running

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* put slack back in

* update text

* cleanup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove datetime

* send output to stderr

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Co-authored-by: Andy Grove <andygrove@users.noreply.github.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Co-authored-by: Robert (Bobby) Evans <bobby@apache.org>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
[auto-merge] bot-auto-merge-branch-23.02 to branch-23.04 [skip ci] [bot]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Benchmarking, benchmarking tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Benchmark CSV to Parquet conversion should have explicit partitioning
4 participants