Add option to control number of partitions when converting from CSV to Parquet #915

andygrove · 2020-10-07T22:16:39Z

Signed-off-by: Andy Grove andygrove@nvidia.com

When we convert TPC-* CSV files to Parquet, the number and size of Parquet files created are not consistent and depends on the number of executor cores running the conversion. This can lead to thousands of small files being created, which is not optimal for GPU.

This PR adds the option to control the number of partitions per table using coalesce or repartition.

Also, all three TPC-* benchmarks now have a ConvertFiles object with a main method, so that the file conversion can be submitted with spark-submit with command-line arguments for all available options. For TPC-DS, there is an option to use partitioning when creating the Parquet files since the underlying code supports that.

This closes #902

tgravescs · 2020-10-08T13:31:53Z

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpcds/TpcdsLikeSpark.scala

@@ -60,10 +62,14 @@ case class Table(
  private def setupWrite(
      spark: SparkSession,
      inputBase: String,
+      maxPartitions: Option[Int],


any reason to not support also growing the number of partitions? Basically just have this numPartitions. You would need to use repartition in that case. I would find this useful for like small file testing where I actually want to grow the number of partitions

Good point. I can look at this later today.

@tgravescs I'm looking at this now and there are multiple options for doing this that I can see.

Option 1: Add specific command-line arguments for coalesce vs repartition. For the issue where some tables have too many files, coalesce is much more efficient than repartition, and for the tables that already have a small number of files, I don't want to increase the number.

Option 2: Dynamically look at the number of partitions in each table and then decide whether to coalesce or repartition to achieve the desired number of partitions, but this would mean that we could end up increasing partitions for some tables and decreasing for others.

I think option 1 is more explicit and would work well for both of our use cases? I'll start down this path but let me know what you think.

I've used coalesce in the past and it works great, but it can be problematic when applying it blindly across all tables in the dataset. For example, there are typically many tables in these benchmarks that scale very slowly or not at all (e.g.: tables containing regions, warehouse locations, etc.) that for the best performance should just be one file. But then there are other tables that have the largest scale (e.g.: tables containing individual sales record data) that would be disastrous to force into a single file.

This may need to be handled in a benchmark-specific way, e.g.: tiny tables that don't really scale with the rest of the data are always coalesced into a single table, and there are two user-specified settings, one for the number of partitions to use for "medium" scale tables, and another for "large" scale tables.

Perhaps it would make sense to allow the --coalesce and --repartition arguments to operate on individual tables as a starting point e.g. --coalesce table1=1 table2=24 --repartition table3=200 so that we have the ability to fully control partition sizes manually and then we can build higher-level logic on top in the future to do this in a more automated way.

andygrove · 2020-10-13T18:07:27Z

These changes have been tested with TPC-DS but not with TPC-H or TPC-xBB.

andygrove · 2020-10-13T18:19:40Z

build

tgravescs · 2020-10-14T13:51:25Z

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpcds/TpcdsLikeSpark.scala

-        .write
-        .mode("overwrite")
+    val df = readCSV(spark, inputBase)
+    val repart = (coalesce.get(name), repartition.get(name)) match {


so if I specify it in both I get the coalesce? Perhaps we should error?

docs/benchmarks.md

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpcds/TpcdsLikeSpark.scala

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpch/TpchLikeSpark.scala

andygrove · 2020-10-14T14:51:37Z

I'm not sure what happened, but after applying the suggested changes, I see other changes in this PR. I will rebase.

… per-table basis Signed-off-by: Andy Grove <andygrove@nvidia.com>

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/common/BenchUtils.scala

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpcxbb/TpcxbbLikeSpark.scala

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpch/TpchLikeSpark.scala

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpcds/TpcdsLikeSpark.scala

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove · 2020-10-14T23:31:10Z

@jlowe @tgravescs Thanks for the reviews. I've addressed your feedback now.

andygrove · 2020-10-14T23:31:16Z

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove · 2020-10-15T14:34:29Z

build

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

andygrove · 2020-10-15T14:46:46Z

build

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpch/TpchLikeSpark.scala

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpcxbb/TpcxbbLikeSpark.scala

Signed-off-by: Andy Grove <andygrove@nvidia.com>

jlowe · 2020-10-15T14:53:36Z

build

…o Parquet (NVIDIA#915) * Add command-line arguments for applying coalesce and repartition on a per-table basis Signed-off-by: Andy Grove <andygrove@nvidia.com> * Move command-line validation logic and address other feedback Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update copyright years and fix import order Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update docs/benchmarks.md Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Remove withPartitioning option from TPC-H and TPC-xBB file conversion Signed-off-by: Andy Grove <andygrove@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Add some more checks to databricks build scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove extra newline * use the right -gt for bash * Add new python file for databricks cluster utils * Fix up scripts * databricks scripts working Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Pass in sshkey Signed-off-by: Thomas Graves <tgraves@nvidia.com> * cluster creation script mods * fix * fix pub key * fix missing quote * fix $ * update public key to be param Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Add public key value * clenaup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * modify permissions Signed-off-by: Thomas Graves <tgraves@nvidia.com> * change loc cluster id file * fix extra / * quote public key * try different setting cluster id * debug * try again * try readfile * try again * try quotes * cleanup * Add option to control number of partitions when converting from CSV to Parquet (#915) * Add command-line arguments for applying coalesce and repartition on a per-table basis Signed-off-by: Andy Grove <andygrove@nvidia.com> * Move command-line validation logic and address other feedback Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update copyright years and fix import order Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update docs/benchmarks.md Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Remove withPartitioning option from TPC-H and TPC-xBB file conversion Signed-off-by: Andy Grove <andygrove@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Benchmark runner script (#918) * Benchmark runner script Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add argument for number of iterations Signed-off-by: Andy Grove <andygrove@nvidia.com> * Fix docs Signed-off-by: Andy Grove <andygrove@nvidia.com> * add license Signed-off-by: Andy Grove <andygrove@nvidia.com> * improve documentation for the configuration files Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add missing line-continuation symbol in example Signed-off-by: Andy Grove <andygrove@nvidia.com> * Remove hard-coded spark-submit-template.txt and add --template argument. Also make all arguments required. Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update benchmarking guide to link to the benchmark python script Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add --template to example and fix markdown header Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add legacy config to clear active Spark 3.1.0 session in tests (#970) Signed-off-by: Jason Lowe <jlowe@nvidia.com> * XFail tests until final fix can be put in (#968) Signed-off-by: Robert (Bobby) Evans <bobby@apache.org> * Stop reporting totalTime metric for GpuShuffleExchangeExec (#973) Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add some more checks to databricks build scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Pass in sshkey * Add create script, add more parameters, etc Signed-off-by: Thomas Graves <tgraves@nvidia.com> * add create script * rework some scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * fix is_cluster_running Signed-off-by: Thomas Graves <tgraves@nvidia.com> * put slack back in * update text * cleanup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove datetime * send output to stderr Signed-off-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Andy Grove <andygrove@users.noreply.github.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com> Co-authored-by: Robert (Bobby) Evans <bobby@apache.org>

…o Parquet (NVIDIA#915) * Add command-line arguments for applying coalesce and repartition on a per-table basis Signed-off-by: Andy Grove <andygrove@nvidia.com> * Move command-line validation logic and address other feedback Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update copyright years and fix import order Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update docs/benchmarks.md Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Remove withPartitioning option from TPC-H and TPC-xBB file conversion Signed-off-by: Andy Grove <andygrove@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Add some more checks to databricks build scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove extra newline * use the right -gt for bash * Add new python file for databricks cluster utils * Fix up scripts * databricks scripts working Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Pass in sshkey Signed-off-by: Thomas Graves <tgraves@nvidia.com> * cluster creation script mods * fix * fix pub key * fix missing quote * fix $ * update public key to be param Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Add public key value * clenaup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * modify permissions Signed-off-by: Thomas Graves <tgraves@nvidia.com> * change loc cluster id file * fix extra / * quote public key * try different setting cluster id * debug * try again * try readfile * try again * try quotes * cleanup * Add option to control number of partitions when converting from CSV to Parquet (NVIDIA#915) * Add command-line arguments for applying coalesce and repartition on a per-table basis Signed-off-by: Andy Grove <andygrove@nvidia.com> * Move command-line validation logic and address other feedback Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update copyright years and fix import order Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update docs/benchmarks.md Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Remove withPartitioning option from TPC-H and TPC-xBB file conversion Signed-off-by: Andy Grove <andygrove@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Benchmark runner script (NVIDIA#918) * Benchmark runner script Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add argument for number of iterations Signed-off-by: Andy Grove <andygrove@nvidia.com> * Fix docs Signed-off-by: Andy Grove <andygrove@nvidia.com> * add license Signed-off-by: Andy Grove <andygrove@nvidia.com> * improve documentation for the configuration files Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add missing line-continuation symbol in example Signed-off-by: Andy Grove <andygrove@nvidia.com> * Remove hard-coded spark-submit-template.txt and add --template argument. Also make all arguments required. Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update benchmarking guide to link to the benchmark python script Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add --template to example and fix markdown header Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add legacy config to clear active Spark 3.1.0 session in tests (NVIDIA#970) Signed-off-by: Jason Lowe <jlowe@nvidia.com> * XFail tests until final fix can be put in (NVIDIA#968) Signed-off-by: Robert (Bobby) Evans <bobby@apache.org> * Stop reporting totalTime metric for GpuShuffleExchangeExec (NVIDIA#973) Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add some more checks to databricks build scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Pass in sshkey * Add create script, add more parameters, etc Signed-off-by: Thomas Graves <tgraves@nvidia.com> * add create script * rework some scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * fix is_cluster_running Signed-off-by: Thomas Graves <tgraves@nvidia.com> * put slack back in * update text * cleanup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove datetime * send output to stderr Signed-off-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Andy Grove <andygrove@users.noreply.github.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com> Co-authored-by: Robert (Bobby) Evans <bobby@apache.org>

…o Parquet (NVIDIA#915) * Add command-line arguments for applying coalesce and repartition on a per-table basis Signed-off-by: Andy Grove <andygrove@nvidia.com> * Move command-line validation logic and address other feedback Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update copyright years and fix import order Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update docs/benchmarks.md Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Remove withPartitioning option from TPC-H and TPC-xBB file conversion Signed-off-by: Andy Grove <andygrove@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Add some more checks to databricks build scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove extra newline * use the right -gt for bash * Add new python file for databricks cluster utils * Fix up scripts * databricks scripts working Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Pass in sshkey Signed-off-by: Thomas Graves <tgraves@nvidia.com> * cluster creation script mods * fix * fix pub key * fix missing quote * fix $ * update public key to be param Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Add public key value * clenaup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * modify permissions Signed-off-by: Thomas Graves <tgraves@nvidia.com> * change loc cluster id file * fix extra / * quote public key * try different setting cluster id * debug * try again * try readfile * try again * try quotes * cleanup * Add option to control number of partitions when converting from CSV to Parquet (NVIDIA#915) * Add command-line arguments for applying coalesce and repartition on a per-table basis Signed-off-by: Andy Grove <andygrove@nvidia.com> * Move command-line validation logic and address other feedback Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update copyright years and fix import order Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update docs/benchmarks.md Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Remove withPartitioning option from TPC-H and TPC-xBB file conversion Signed-off-by: Andy Grove <andygrove@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Benchmark runner script (NVIDIA#918) * Benchmark runner script Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add argument for number of iterations Signed-off-by: Andy Grove <andygrove@nvidia.com> * Fix docs Signed-off-by: Andy Grove <andygrove@nvidia.com> * add license Signed-off-by: Andy Grove <andygrove@nvidia.com> * improve documentation for the configuration files Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add missing line-continuation symbol in example Signed-off-by: Andy Grove <andygrove@nvidia.com> * Remove hard-coded spark-submit-template.txt and add --template argument. Also make all arguments required. Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update benchmarking guide to link to the benchmark python script Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add --template to example and fix markdown header Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add legacy config to clear active Spark 3.1.0 session in tests (NVIDIA#970) Signed-off-by: Jason Lowe <jlowe@nvidia.com> * XFail tests until final fix can be put in (NVIDIA#968) Signed-off-by: Robert (Bobby) Evans <bobby@apache.org> * Stop reporting totalTime metric for GpuShuffleExchangeExec (NVIDIA#973) Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add some more checks to databricks build scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Pass in sshkey * Add create script, add more parameters, etc Signed-off-by: Thomas Graves <tgraves@nvidia.com> * add create script * rework some scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * fix is_cluster_running Signed-off-by: Thomas Graves <tgraves@nvidia.com> * put slack back in * update text * cleanup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove datetime * send output to stderr Signed-off-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Andy Grove <andygrove@users.noreply.github.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com> Co-authored-by: Robert (Bobby) Evans <bobby@apache.org>

…o Parquet (NVIDIA#915) * Add command-line arguments for applying coalesce and repartition on a per-table basis Signed-off-by: Andy Grove <andygrove@nvidia.com> * Move command-line validation logic and address other feedback Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update copyright years and fix import order Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update docs/benchmarks.md Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Remove withPartitioning option from TPC-H and TPC-xBB file conversion Signed-off-by: Andy Grove <andygrove@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Add some more checks to databricks build scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove extra newline * use the right -gt for bash * Add new python file for databricks cluster utils * Fix up scripts * databricks scripts working Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Pass in sshkey Signed-off-by: Thomas Graves <tgraves@nvidia.com> * cluster creation script mods * fix * fix pub key * fix missing quote * fix $ * update public key to be param Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Add public key value * clenaup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * modify permissions Signed-off-by: Thomas Graves <tgraves@nvidia.com> * change loc cluster id file * fix extra / * quote public key * try different setting cluster id * debug * try again * try readfile * try again * try quotes * cleanup * Add option to control number of partitions when converting from CSV to Parquet (NVIDIA#915) * Add command-line arguments for applying coalesce and repartition on a per-table basis Signed-off-by: Andy Grove <andygrove@nvidia.com> * Move command-line validation logic and address other feedback Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update copyright years and fix import order Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update docs/benchmarks.md Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Remove withPartitioning option from TPC-H and TPC-xBB file conversion Signed-off-by: Andy Grove <andygrove@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Benchmark runner script (NVIDIA#918) * Benchmark runner script Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add argument for number of iterations Signed-off-by: Andy Grove <andygrove@nvidia.com> * Fix docs Signed-off-by: Andy Grove <andygrove@nvidia.com> * add license Signed-off-by: Andy Grove <andygrove@nvidia.com> * improve documentation for the configuration files Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add missing line-continuation symbol in example Signed-off-by: Andy Grove <andygrove@nvidia.com> * Remove hard-coded spark-submit-template.txt and add --template argument. Also make all arguments required. Signed-off-by: Andy Grove <andygrove@nvidia.com> * Update benchmarking guide to link to the benchmark python script Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add --template to example and fix markdown header Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add legacy config to clear active Spark 3.1.0 session in tests (NVIDIA#970) Signed-off-by: Jason Lowe <jlowe@nvidia.com> * XFail tests until final fix can be put in (NVIDIA#968) Signed-off-by: Robert (Bobby) Evans <bobby@apache.org> * Stop reporting totalTime metric for GpuShuffleExchangeExec (NVIDIA#973) Signed-off-by: Andy Grove <andygrove@nvidia.com> * Add some more checks to databricks build scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Pass in sshkey * Add create script, add more parameters, etc Signed-off-by: Thomas Graves <tgraves@nvidia.com> * add create script * rework some scripts Signed-off-by: Thomas Graves <tgraves@nvidia.com> * fix is_cluster_running Signed-off-by: Thomas Graves <tgraves@nvidia.com> * put slack back in * update text * cleanup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove datetime * send output to stderr Signed-off-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Andy Grove <andygrove@users.noreply.github.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com> Co-authored-by: Robert (Bobby) Evans <bobby@apache.org>

[auto-merge] bot-auto-merge-branch-23.02 to branch-23.04 [skip ci] [bot]

andygrove added the benchmarking label Oct 7, 2020

andygrove added this to the Sep 28 - Oct 9 milestone Oct 7, 2020

andygrove self-assigned this Oct 7, 2020

andygrove added benchmark Benchmarking, benchmarking tools and removed benchmarking labels Oct 7, 2020

tgravescs reviewed Oct 8, 2020

View reviewed changes

andygrove changed the title ~~[WIP] Add option to limit number of partitions when converting from CSV to Parquet~~ Add option to limit number of partitions when converting from CSV to Parquet Oct 8, 2020

sameerz modified the milestones: Sep 28 - Oct 9, Oct 12 - Oct 23 Oct 10, 2020

andygrove changed the title ~~Add option to limit number of partitions when converting from CSV to Parquet~~ Add option to control number of partitions when converting from CSV to Parquet Oct 13, 2020

tgravescs reviewed Oct 14, 2020

View reviewed changes

jlowe reviewed Oct 14, 2020

View reviewed changes

Add command-line arguments for applying coalesce and repartition on a…

3186f39

… per-table basis Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove force-pushed the max-partitions-csv-to-parquet branch from 2551bad to 3186f39 Compare October 14, 2020 14:55

jlowe reviewed Oct 14, 2020

View reviewed changes

Move command-line validation logic and address other feedback

b11a701

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Update copyright years and fix import order

57da3a7

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Update docs/benchmarks.md

c114ee6

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

jlowe reviewed Oct 15, 2020

View reviewed changes

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpch/TpchLikeSpark.scala Outdated Show resolved Hide resolved

jlowe reviewed Oct 15, 2020

View reviewed changes

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpcxbb/TpcxbbLikeSpark.scala Outdated Show resolved Hide resolved

Remove withPartitioning option from TPC-H and TPC-xBB file conversion

9ae99b6

Signed-off-by: Andy Grove <andygrove@nvidia.com>

jlowe approved these changes Oct 15, 2020

View reviewed changes

tgravescs approved these changes Oct 15, 2020

View reviewed changes

tgravescs merged commit 24ce1c9 into NVIDIA:branch-0.3 Oct 15, 2020

andygrove deleted the max-partitions-csv-to-parquet branch December 17, 2020 15:26

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

Merge pull request NVIDIA#915 from NVIDIA/bot-auto-merge-branch-23.02

f9e34a8

[auto-merge] bot-auto-merge-branch-23.02 to branch-23.04 [skip ci] [bot]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to control number of partitions when converting from CSV to Parquet #915

Add option to control number of partitions when converting from CSV to Parquet #915

andygrove commented Oct 7, 2020 •

edited

Loading

tgravescs Oct 8, 2020

andygrove Oct 8, 2020

andygrove Oct 9, 2020

jlowe Oct 9, 2020

andygrove Oct 12, 2020 •

edited

Loading

andygrove commented Oct 13, 2020

andygrove commented Oct 13, 2020

tgravescs Oct 14, 2020

andygrove commented Oct 14, 2020

andygrove commented Oct 14, 2020

andygrove commented Oct 14, 2020

andygrove commented Oct 15, 2020

andygrove commented Oct 15, 2020

jlowe commented Oct 15, 2020

Add option to control number of partitions when converting from CSV to Parquet #915

Add option to control number of partitions when converting from CSV to Parquet #915

Conversation

andygrove commented Oct 7, 2020 • edited Loading

tgravescs Oct 8, 2020

Choose a reason for hiding this comment

andygrove Oct 8, 2020

Choose a reason for hiding this comment

andygrove Oct 9, 2020

Choose a reason for hiding this comment

jlowe Oct 9, 2020

Choose a reason for hiding this comment

andygrove Oct 12, 2020 • edited Loading

Choose a reason for hiding this comment

andygrove commented Oct 13, 2020

andygrove commented Oct 13, 2020

tgravescs Oct 14, 2020

Choose a reason for hiding this comment

andygrove commented Oct 14, 2020

andygrove commented Oct 14, 2020

andygrove commented Oct 14, 2020

andygrove commented Oct 15, 2020

andygrove commented Oct 15, 2020

jlowe commented Oct 15, 2020

andygrove commented Oct 7, 2020 •

edited

Loading

andygrove Oct 12, 2020 •

edited

Loading