Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support ZSTD compression with Parquet and Orc #3037

Closed
tgravescs opened this issue Jul 27, 2021 · 24 comments · Fixed by #6362
Closed

[FEA] Support ZSTD compression with Parquet and Orc #3037

tgravescs opened this issue Jul 27, 2021 · 24 comments · Fixed by #6362
Assignees
Labels
feature request New feature or request

Comments

@tgravescs
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Feature request from a user asking for zstd compressed data support for Parquet with Spark rapids.

@tgravescs tgravescs added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jul 27, 2021
@tgravescs
Copy link
Collaborator Author

Need to check into CUDF support

@jlowe
Copy link
Member

jlowe commented Jul 27, 2021

cudf does not support ZSTD. They are in the process of migrating to using nvcomp for handling codecs. nvcomp does not yet support ZSTD, but it is on their radar to investigate.

@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 27, 2021
@sameerz
Copy link
Collaborator

sameerz commented Apr 22, 2022

Depends on rapidsai/cudf#9056

@jbrennan333
Copy link
Collaborator

#6362 is the WIP PR to add support for parquet/orc write compression in spark-rapids, but because the CUDF feature is still experimental, we may want to hold off on enabling it. See #6362 (comment).

@jbrennan333
Copy link
Collaborator

jbrennan333 commented Sep 22, 2022

We discussed this in our stand-up yesterday. The consensus was that we should enable this early in the 22.12 branch and then do more rigorous testing to ensure it is stable. Among things we need to verify:

  • Run some more write-intensive workloads and validate results.
  • Generate data with GPU and CPU, and compare resulting data size. Early testing on this showed that GPU produced smaller data than CPU for parquet, but larger data than CPU for ORC. We need to confirm these results and test them at higher scales (initial tests were done at scale 100 on my desktop machine). If the data size is larger than that produced by CPU, we need to determine whether it is an acceptable increase and either push for improvements or document it.
  • Do more round-trip testing - generate data with gpu, then test that cpu jobs can read that data and produce correct results. I did this with nds2.0 on my desktop, but we should verify at larger scales and possibly with different queries/data.
  • Make an early release available for interested customers to allow them to test with ZSTD compression.
  • Measure ZSTD compression performance with Spark. Need to identify appropriate benchmarks for this. nds2.0 doesn't write much data, so it doesn't seem like a good test. One option is to use nds2.0 data conversion from CVS to parquet/orc with ZSTD compression.

@jbrennan333
Copy link
Collaborator

jbrennan333 commented Oct 12, 2022

Initial testing on desktop.

  • Run zstd compression integration tests
  • NDS2.0 Data Conversion - scale 100 (desktop)
  • convert from raw data to parquet with no compression
  • convert from raw data to parquet/zstd using CPU
  • convert from raw data to parquet/zstd using GPU
  • compare sizes of all three
  • verify data matches between CPU parquet/zstd and GPU parquet/zstd
  • NDS2.0 Power Run - scale 100 (desktop) on CPU using parquet/zstd data generated by CPU.
  • NDS2.0 Power Run - scale 100 (desktop) on CPU using parquet/zstd data generated by GPU.
  • NDS2.0 Power Run - scale 100 (desktop) on GPU using parquet/zstd data generated by CPU.
  • NDS2.0 Power Run - scale 100 (desktop) on GPU using parquet/zstd data generated by GPU.
  • Compare results from these four runs.
  • NDS2.0 Data Conversion - scale 100 (desktop)
  • convert from raw data to orc with no compression
  • convert from raw data to orc/zstd using CPU
  • convert from raw data to orc/zstd using GPU
  • compare sizes of all three
  • verify data matches between CPU zstd and GPU zstd
  • NDS2.0 Power Run - scale 100 (desktop) on CPU using orc/zstd data generated by CPU.
  • NDS2.0 Power Run - scale 100 (desktop) on CPU using orc/zstd data generated by GPU.
  • NDS2.0 Power Run - scale 100 (desktop) on GPU using parquet/zstd data generated by CPU.
  • NDS2.0 Power Run - scale 100 (desktop) on GPU using parquet/zstd data generated by GPU.
  • Compare results from these four runs

@jbrennan333 jbrennan333 changed the title [FEA] Support ZSTD compression with Parquet [FEA] Support ZSTD compression with Parquet and Orc Oct 13, 2022
@jbrennan333
Copy link
Collaborator

For my scale 100 (desktop) data gen tests, I regenerated the raw data using a parallel setting of 16 to produce larger source files. python3 nds_gen_data.py local 100 16 /opt/data/nds2/raw_sf100_parallel_16

@jbrennan333
Copy link
Collaborator

These are overall sizes for data converted from /opt/data/nds2/raw_sf100_parallel_16

100998016	raw_sf100_parallel_16
30800556	cpu_orc_sf100_none
23827772	cpu_orc_sf100_zstd
36334628	cpu_parquet_sf100_none
26837096	cpu_parquet_sf100_zstd
28221372	gpu_orc_sf100_none
28230416	gpu_orc_sf100_zstd
34250312	gpu_parquet_sf100_none
27570448	gpu_parquet_sf100_zstd

Note that most of the reduction in size comes from the conversion from raw csv data to orc/parquet.

CPU raw to PARQUET compression ratio: 2.78
CPU PARQUET to PARQUET ZSTD compression ratio: 1.35
CPU overall raw to PARQUET ZSTD compression ratio: 3.76

GPU raw to PARQUET compression ratio: 2.95
GPU PARQUET to PARQUET ZSTD compression ratio: 1.24
GPU overall raw to PARQUET ZSTD compression ratio: 3.66

CPU raw to ORC compression ratio: 3.28
CPU ORC to ORC ZSTD compression ratio: 1.29
CPU overall raw to ORC ZSTD compression ratio: 4.23

GPU raw to ORC compression ratio: 3.58
GPU ORC to ORC ZSTD compression ratio: 0.99
GPU overall raw to ORC ZSTD compression ratio: 3.58

@jbrennan333
Copy link
Collaborator

When comparing output of power-runs between cpu and and gpu as scale 100, I consistently see a difference in query79, regardless of compression (none, zstd, snappy). So I don't think it is a zstd issue, but might need more investigation.

@jbrennan333
Copy link
Collaborator

The q79 diff looks like this:

=== Comparing Query: query79 ===
Collecting rows from DataFrame
Collected 100 rows in 0.11267280578613281 seconds
Collecting rows from DataFrame
Collected 100 rows in 0.10186338424682617 seconds
Row 99: 
[None, None, 'Bethel', 12213922, None, None]
[None, None, 'Bethel', 22767968, None, None]

@jbrennan333
Copy link
Collaborator

Power-run output is pretty small, but I am noticing some differences in the size of output for parquet for cpu vs gpu:

4.2M	power-output-cpu-parquet-cpu-zstd
4.2M	power-output-cpu-parquet-gpu-zstd
6.0M	power-output-gpu-parquet-cpu-zstd
6.0M	power-output-gpu-parquet-gpu-zstd

Most of the difference seem to be in the query 98 output:

1.7M	power-output-cpu-parquet-cpu-zstd/query98
3.6M	power-output-gpu-parquet-cpu-zstd/query98

Inspecting one of the partitions for query98 shows that in the gpu, two of the columns are uncompressed:

############ Column(i_item_desc) ############
name: i_item_desc
path: i_item_desc
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: UNCOMPRESSED (space_saved: 0%)

############ Column(i_category) ############
name: i_category
path: i_category
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: UNCOMPRESSED (space_saved: 0%)

For CPU, these are compressed, although for i_category, the compressed size appears to be larger:

############ Column(i_item_desc) ############
name: i_item_desc
path: i_item_desc
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD (space_saved: 64%)

############ Column(i_category) ############
name: i_category
path: i_category
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD (space_saved: -26%)

@revans2
Copy link
Collaborator

revans2 commented Oct 18, 2022

When comparing output of power-runs between cpu and and gpu as scale 100, I consistently see a difference in query79, regardless of compression (none, zstd, snappy). So I don't think it is a zstd issue, but might need more investigation.

The line that is different is the ss_ticket_number field that is a group by key, but not in the sort order for the output. My guess is that this is just ambiguous ordering in the output.

@jbrennan333
Copy link
Collaborator

Will want this fix in CUDF: rapidsai/cudf#11869

@jbrennan333
Copy link
Collaborator

jbrennan333 commented Oct 19, 2022

Testing on spark2a cluster at scale 3TB.

  • NDS2.0 Data Conversion - scale 3000 (spark2a)
  • convert from raw data to parquet with no compression
  • will use existing CPU generated /data/nds2.0/parquet_sf3k_decimal_zstd for comparisons
  • convert from raw data to parquet/zstd using GPU
  • compare sizes of all three
  • verify data matches between CPU parquet/zstd and GPU parquet/zstd
  • NDS2.0 Power Run - scale 3000 on CPU using parquet/zstd data generated by CPU.
  • NDS2.0 Power Run - scale 3000 on CPU using parquet/zstd data generated by GPU.
  • NDS2.0 Power Run - scale 3000 on GPU using parquet/zstd data generated by CPU.
  • NDS2.0 Power Run - scale 3000 on GPU using parquet/zstd data generated by GPU.
  • Compare results from these four runs.
  • NDS2.0 Data Conversion - scale 3000 (spark2a)
  • convert from raw data to orc with no compression
  • convert from raw data to orc/zstd using CPU
  • convert from raw data to orc/zstd using GPU
  • compare sizes of all three
  • verify data matches between CPU ORC/zstd and GPU ORC/zstd
  • NDS2.0 Power Run - scale 3000 on CPU using orc/zstd data generated by CPU.
  • NDS2.0 Power Run - scale 3000 on CPU using orc/zstd data generated by GPU.
  • NDS2.0 Power Run - scale 3000 on GPU using parquet/zstd data generated by CPU.
  • NDS2.0 Power Run - scale 3000 on GPU using parquet/zstd data generated by GPU.
  • Compare results from these four runs

@jbrennan333
Copy link
Collaborator

jbrennan333 commented Oct 20, 2022

Spark2a (a100) data conversion sizes (via hadoop fs -count):

       DIRS         FILES       SIZE
          26         3016      2958431049504 /data/nds2.0/raw_sf3k
       12064        12080      1109313171766 /data/jimb/cpu_parquet_sf3k_decimal_none
       12064        12080       779117287097 /data/jimb/cpu_parquet_sf3k_decimal_zstd

       12064        12080       911568648724 /data/jimb/gpu_parquet_sf3k_decimal_none
       12064        12080       835715788420 /data/jimb/gpu_parquet_sf3k_decimal_zstd

CPU raw to PARQUET NONE compression ratio: 2.66
CPU PARQUET to CPU PARQUET ZSTD compression ratio: 1.42
CPU overall raw to PARQUET ZSTD compression ratio: 3.80

GPU raw to GPU PARQUET compression ratio: 3.24
GPU PARQUET to GPU PARQUET ZSTD compression ratio: 1.09
GPU overall raw to PARQUET ZSTD compression ratio: 3.53

@jbrennan333
Copy link
Collaborator

Comparing sizes of 3TB power-run outputs:

         104          251            8855415 /data/jimb/power-output-cpu-parquet-cpu-zstd
         104          231           12755903 /data/jimb/power-output-gpu-parquet-cpu-zstd

         104          269            8982069 /data/jimb/power-output-cpu-parquet-gpu-zstd
         104          235           12762506 /data/jimb/power-output-gpu-parquet-gpu-zstd

There is no significant difference in size for GPU and CPU when you run with GPU generated data vs CPU generated data.

In the runs using zstd input data generated by the CPU, the GPU output data is about 1.44x larger.
In the runs using zstd input data generated by the GPU, the GPU output data is about 1.42x larger.

Similar to my scale 100 results, most of the difference is in query98 results.

hadoop fs -count /data/jimb/power-output-cpu-parquet-cpu-zstd/* | sort -k3 -n | tail
           1            2              11810 /data/jimb/power-output-cpu-parquet-cpu-zstd/query17
           1            2              14063 /data/jimb/power-output-cpu-parquet-cpu-zstd/query31
           1            2              16877 /data/jimb/power-output-cpu-parquet-cpu-zstd/query73
           1            2              21508 /data/jimb/power-output-cpu-parquet-cpu-zstd/query66
           1            2              22966 /data/jimb/power-output-cpu-parquet-cpu-zstd/query39_part2
           1            2             462955 /data/jimb/power-output-cpu-parquet-cpu-zstd/query39_part1
           1            5             924995 /data/jimb/power-output-cpu-parquet-cpu-zstd/query64
           1           10            1489572 /data/jimb/power-output-cpu-parquet-cpu-zstd/query34
           1           30            2651055 /data/jimb/power-output-cpu-parquet-cpu-zstd/query71
           1            8            2929125 /data/jimb/power-output-cpu-parquet-cpu-zstd/query98
hadoop fs -count /data/jimb/power-output-gpu-parquet-cpu-zstd/* | sort -k3 -n | tail
           1            2              10257 /data/jimb/power-output-gpu-parquet-cpu-zstd/query17
           1            2              13227 /data/jimb/power-output-gpu-parquet-cpu-zstd/query31
           1            2              16399 /data/jimb/power-output-gpu-parquet-cpu-zstd/query73
           1            2              16614 /data/jimb/power-output-gpu-parquet-cpu-zstd/query66
           1            2              22002 /data/jimb/power-output-gpu-parquet-cpu-zstd/query39_part2
           1            2             462628 /data/jimb/power-output-gpu-parquet-cpu-zstd/query39_part1
           1            4             907409 /data/jimb/power-output-gpu-parquet-cpu-zstd/query64
           1            7            1465788 /data/jimb/power-output-gpu-parquet-cpu-zstd/query34
           1           16            3105661 /data/jimb/power-output-gpu-parquet-cpu-zstd/query71
           1            6            6479179 /data/jimb/power-output-gpu-parquet-cpu-zstd/query98

I think this may be due to UNCOMPRESSED i_item_desc string columns:

GPU:
parquet-tools inspect part-00000-7f2a657c-3a23-4024-b884-9acd56bb239e-c000.zstd.parquet
...
############ Column(i_item_desc) ############
name: i_item_desc
path: i_item_desc
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: UNCOMPRESSED (space_saved: 0%)
CPU:
parquet-tools inspect part-00000-8e5d6fbf-78fd-4bd2-a5ea-136b58ed2da6-c000.zstd.parquet
...
############ Column(i_item_desc) ############
name: i_item_desc
path: i_item_desc
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD (space_saved: 64%)

@jbrennan333
Copy link
Collaborator

jbrennan333 commented Oct 24, 2022

Spark2a (a100) data conversion sizes for orc (via hadoop fs -count):

          26         3016      2958431049504 /data/nds2.0/raw_sf3k

       12064        12080      1109611149766 /data/jimb/cpu_orc_sf3k_decimal_none
       12064        12080       718204423250 /data/nds2.0/orc_sf3k_decimal_zstd

       12064        12080       797612029660 /data/jimb/gpu_orc_sf3k_decimal_none
       12064        12080       797831485879 /data/jimb/gpu_orc_sf3k_decimal_zstd

CPU raw to CPU ORC NONE compression ratio: 2.67
CPU ORC to CPU ORC ZSTD compression ratio: 1.55
CPU overall raw to ORC ZSTD compression ratio: 4.12

GPU raw to GPU ORC NONE compression ratio: 3.71
GPU ORC to GPU ORC ZSTD compression ratio: 1.0
GPU overall raw to ORC ZSTD compression ratio: 3.71

@jbrennan333
Copy link
Collaborator

As a sanity check, I ran the nds2.0 power run on a dataproc cluster with P4 GPUs, with zstd parquet input data, and wrote the output in parquet/zstd format.
No failures.

@jbrennan333
Copy link
Collaborator

Comparing sizes of 3TB power-run outputs for ORC:

         104          254            9776040 /data/jimb/power-output-cpu-orc-cpu-zstd
         104          232           17211185 /data/jimb/power-output-gpu-orc-cpu-zstd

         104          267            9850470 /data/jimb/power-output-cpu-orc-gpu-zstd
         104          235           17244382 /data/jimb/power-output-gpu-orc-gpu-zstd

There is no significant difference in size for GPU and CPU when you run with GPU generated data vs CPU generated data.

In the runs using zstd input data generated by the CPU, the GPU output data is about 1.76x larger.
In the runs using zstd input data generated by the GPU, the GPU output data is about 1.75x larger.

hadoop fs -count /data/jimb/power-output-cpu-orc-cpu-zstd/* | sort -k3 -n | tail
           1            2               8741 /data/jimb/power-output-cpu-orc-cpu-zstd/query12
           1            2              13893 /data/jimb/power-output-cpu-orc-cpu-zstd/query31
           1            2              14015 /data/jimb/power-output-cpu-orc-cpu-zstd/query66
           1            2              14850 /data/jimb/power-output-cpu-orc-cpu-zstd/query73
           1            2              19958 /data/jimb/power-output-cpu-orc-cpu-zstd/query39_part2
           1            2             458098 /data/jimb/power-output-cpu-orc-cpu-zstd/query39_part1
           1            5             801487 /data/jimb/power-output-cpu-orc-cpu-zstd/query64
           1           11            1503282 /data/jimb/power-output-cpu-orc-cpu-zstd/query34
           1            8            3001069 /data/jimb/power-output-cpu-orc-cpu-zstd/query98
           1           32            3716342 /data/jimb/power-output-cpu-orc-cpu-zstd/query71
hadoop fs -count /data/jimb/power-output-gpu-orc-cpu-zstd/* | sort -k3 -n | tail
           1            2              16377 /data/jimb/power-output-gpu-orc-cpu-zstd/query31
           1            2              17921 /data/jimb/power-output-gpu-orc-cpu-zstd/query17
           1            2              21630 /data/jimb/power-output-gpu-orc-cpu-zstd/query73
           1            2              29545 /data/jimb/power-output-gpu-orc-cpu-zstd/query39_part2
           1            2              36416 /data/jimb/power-output-gpu-orc-cpu-zstd/query2
           1            2             723937 /data/jimb/power-output-gpu-orc-cpu-zstd/query39_part1
           1            4            1422589 /data/jimb/power-output-gpu-orc-cpu-zstd/query64
           1            8            1987830 /data/jimb/power-output-gpu-orc-cpu-zstd/query34
           1           16            5528351 /data/jimb/power-output-gpu-orc-cpu-zstd/query71
           1            6            7114846 /data/jimb/power-output-gpu-orc-cpu-zstd/query98

The sizes for GPU are consistently larger. Query98 is again the biggest difference at over 2.37x the CPU version.
The main difference I see in the metadata is for CPU, the Compression Size is 262144, while for GPU, it is 65536.

@jbrennan333
Copy link
Collaborator

ORC NDS2.0 SCALE 3000 Data Conversion using CPU

Total conversion time for 24 tables was 3241.63s
Time to convert 'customer_address' was 170.7648s
Time to convert 'customer_demographics' was 15.9405s
Time to convert 'date_dim' was 3.8864s
Time to convert 'warehouse' was 0.2383s
Time to convert 'ship_mode' was 2.0146s
Time to convert 'time_dim' was 2.9487s
Time to convert 'reason' was 1.9077s
Time to convert 'income_band' was 2.1465s
Time to convert 'item' was 10.3129s
Time to convert 'store' was 2.5937s
Time to convert 'call_center' was 2.3034s
Time to convert 'customer' was 333.3341s
Time to convert 'web_site' was 0.2279s
Time to convert 'store_returns' was 122.8344s
Time to convert 'household_demographics' was 0.2081s
Time to convert 'web_page' was 0.5157s
Time to convert 'promotion' was 0.2138s
Time to convert 'catalog_page' was 0.4268s
Time to convert 'inventory' was 19.3619s
Time to convert 'catalog_returns' was 23.9985s
Time to convert 'web_returns' was 37.9469s
Time to convert 'web_sales' was 151.2772s
Time to convert 'catalog_sales' was 320.9935s
Time to convert 'store_sales' was 2015.2364s

ORC NDS2.0 SCALE 3000 Data Conversion using GPU

Total conversion time for 24 tables was 2365.16s
Time to convert 'customer_address' was 143.8623s
Time to convert 'customer_demographics' was 5.5104s
Time to convert 'date_dim' was 2.0233s
Time to convert 'warehouse' was 1.9286s
Time to convert 'ship_mode' was 1.0685s
Time to convert 'time_dim' was 0.2953s
Time to convert 'reason' was 1.4386s
Time to convert 'income_band' was 0.1812s
Time to convert 'item' was 5.8370s
Time to convert 'store' was 0.7381s
Time to convert 'call_center' was 0.2208s
Time to convert 'customer' was 23.7831s
Time to convert 'web_site' was 2.0297s
Time to convert 'store_returns' was 135.3156s
Time to convert 'household_demographics' was 0.1854s
Time to convert 'web_page' was 0.2322s
Time to convert 'promotion' was 0.2122s
Time to convert 'catalog_page' was 0.2964s
Time to convert 'inventory' was 9.1562s
Time to convert 'catalog_returns' was 72.8754s
Time to convert 'web_returns' was 38.5083s
Time to convert 'web_sales' was 333.8525s
Time to convert 'catalog_sales' was 576.3461s
Time to convert 'store_sales' was 1009.2584s

Total speedup for GPU was about 1.37x. This is just a single run and I have not spent any significant time trying to optimize it. You can see for the largest table, store_sales, speedup was nearly 2x. But on the next largest table,, catalog_sales, gpu was about 1.8x slower.

@jbrennan333
Copy link
Collaborator

PARQUET NDS2.0 SCALE 3000 Data Conversion using CPU

Total conversion time for 24 tables was 3010.75s
Time to convert 'customer_address' was 182.2625s
Time to convert 'customer_demographics' was 16.8170s
Time to convert 'date_dim' was 4.1716s
Time to convert 'warehouse' was 2.4555s
Time to convert 'ship_mode' was 2.4329s
Time to convert 'time_dim' was 3.1540s
Time to convert 'reason' was 2.2615s
Time to convert 'income_band' was 2.3192s
Time to convert 'item' was 8.4986s
Time to convert 'store' was 2.6609s
Time to convert 'call_center' was 2.6035s
Time to convert 'customer' was 304.6228s
Time to convert 'web_site' was 2.6494s
Time to convert 'store_returns' was 116.6746s
Time to convert 'household_demographics' was 0.2331s
Time to convert 'web_page' was 0.4761s
Time to convert 'promotion' was 0.2269s
Time to convert 'catalog_page' was 0.5668s
Time to convert 'inventory' was 19.6026s
Time to convert 'catalog_returns' was 23.8653s
Time to convert 'web_returns' was 36.3303s
Time to convert 'web_sales' was 145.0934s
Time to convert 'catalog_sales' was 306.9950s
Time to convert 'store_sales' was 1823.7792s

Spark configuration follows:

('spark.driver.maxResultSize', '2GB')
('spark.app.name', 'NDS - transcode')
('spark.sql.adaptive.enabled', 'true')
('spark.driver.memory', '50G')
('spark.executor.memory', '32G')
('spark.sql.shuffle.partitions', '1024')
('spark.locality.wait', '0')
('spark.executor.cores', '16')
('spark.app.id', 'app-20221026140612-0010')
('spark.sql.parquet.compression.codec', 'zstd')

PARQUET NDS2.0 SCALE 3000 Data Conversion using GPU

Total conversion time for 24 tables was 2326.97s
Time to convert 'customer_address' was 135.5229s
Time to convert 'customer_demographics' was 6.7232s
Time to convert 'date_dim' was 1.2225s
Time to convert 'warehouse' was 2.8716s
Time to convert 'ship_mode' was 1.3489s
Time to convert 'time_dim' was 1.8186s
Time to convert 'reason' was 1.7725s
Time to convert 'income_band' was 1.2412s
Time to convert 'item' was 6.9413s
Time to convert 'store' was 0.3417s
Time to convert 'call_center' was 0.4945s
Time to convert 'customer' was 21.6170s
Time to convert 'web_site' was 0.2297s
Time to convert 'store_returns' was 129.1592s
Time to convert 'household_demographics' was 0.2200s
Time to convert 'web_page' was 0.2152s
Time to convert 'promotion' was 0.2361s
Time to convert 'catalog_page' was 0.2131s
Time to convert 'inventory' was 3.6131s
Time to convert 'catalog_returns' was 65.8336s
Time to convert 'web_returns' was 35.4903s
Time to convert 'web_sales' was 307.7802s
Time to convert 'catalog_sales' was 591.1566s
Time to convert 'store_sales' was 1010.9048s

Spark configuration follows:
('spark.app.id', 'app-20221026154733-0013')
('spark.app.name', 'NDS - transcode')
('spark.locality.wait', '0')
('spark.driver.maxResultSize', '2GB')
('spark.driver.memory', '50G')
('spark.driverEnv.LIBCUDF_NVCOMP_POLICY', 'ALWAYS')
('spark.executor.cores', '16')
('spark.executor.instances', '8')
('spark.executor.memory', '32G')
('spark.executor.resource.gpu.amount', '1')
('spark.executorEnv.LIBCUDF_NVCOMP_POLICY', 'ALWAYS')
('spark.plugins', 'com.nvidia.spark.SQLPlugin')
('spark.rapids.memory.pinnedPool.size', '8G')
('spark.rapids.memory.host.spillStorageSize', '32G')
('spark.rapids.sql.batchSizeBytes', '1GB')
('spark.rapids.sql.concurrentGpuTasks', '4')
('spark.sql.adaptive.enabled', 'true')
('spark.sql.files.maxPartitionBytes', '2gb')
('spark.sql.parquet.compression.codec', 'zstd')
('spark.sql.shuffle.partitions', '200')
('spark.task.resource.gpu.amount', '0.0625')

Total overall speedup for GPU was about 1.30x. The speedup for the largest file (store_sales) was about 1.8x. The next largest file (catalog_sales) was about 1.9x Slower (speed up of 0.52).

@jbrennan333
Copy link
Collaborator

jbrennan333 commented Oct 26, 2022

I collected the size/time info for converting to Parquet/zstd and Orc/zstd at 3TB scale
parquet-zstd-conversion-times-spark2a.pdf
orc-zstd-conversion-times-spark2a.pdf

@jbrennan333
Copy link
Collaborator

jbrennan333 commented Oct 26, 2022

Overall for NDS2.0 Data Conversion from Raw (CSV) at 3TB scale on GPU,
for Parquet we saw a speedup of about 1.3x with increased space of about 1.07x vs CPU.
for ORC we saw a speedup of about 1.37x with increased space of about 1.11x vs CPU.

@jbrennan333
Copy link
Collaborator

When we convert from CSV to Parquet and Orc, GPU produces smaller files (higher compression ratio) than CPU when compression is set to NONE:

  • Scale 100 CPU raw CSV to PARQUET NONE compression ratio: 2.78
    Scale 100 GPU raw CSV to PARQUET NONE compression ratio: 2.95

  • Scale 3000 CPU raw CSV to PARQUET NONE compression ratio: 2.66
    Scale 3000 GPU raw CSV to PARQUET NONE compression ratio: 3.24

  • Scale 100 CPU raw CSV to ORC NONE compression ratio: 3.28
    Scale 100 GPU raw CSV to ORC NONE compression ratio: 3.58

  • Scale 3000 CPU raw CSV to ORC NONE compression ratio: 2.67
    Scale 3000 GPU raw CSV to ORC NONE compression ratio: 3.71

When we compare the parquet/orc files with no compression to those with ZSTD compression, the cpu has a better compression ratio, and in particular, we get no benefit on the GPU for these ORC files compared to using no compression:

  • Scale 100 CPU PARQUET to PARQUET ZSTD compression ratio: 1.35
    Scale 100 GPU PARQUET to PARQUET ZSTD compression ratio: 1.24

  • Scale 3000 CPU PARQUET to PARQUET ZSTD compression ratio: 1.42
    Scale 3000 GPU PARQUET to PARQUET ZSTD compression ratio: 1.09

  • Scale 100 CPU ORC to ORC ZSTD compression ratio: 1.29
    Scale 100 GPU ORC to ORC ZSTD compression ratio: 0.99

  • Scale 3000 CPU ORC to ORC ZSTD compression ratio: 1.55
    Scale 3000 GPU ORC to ORC ZSTD compression ratio: 1.0

The overall compression ratio from CSV to parquet/orc zstd is about the same at scale 100 and scale 3000:

  • Scale 100 CPU overall raw CSV to PARQUET ZSTD compression ratio: 3.76
    Scale 100 GPU overall raw CSV to PARQUET ZSTD compression ratio: 3.66

  • Scale 3000 CPU overall raw CSV to PARQUET ZSTD compression ratio: 3.80
    Scale 3000 GPU overall raw CSV to PARQUET ZSTD compression ratio: 3.53

  • Scale 100 CPU overall raw CSV to ORC ZSTD compression ratio: 4.23
    Scale 100 GPU overall raw CSV to ORC ZSTD compression ratio: 3.58

  • Scale 3000 CPU overall raw to ORC ZSTD compression ratio: 4.12
    Scale 3000 GPU overall raw to ORC ZSTD compression ratio: 3.71

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Nov 4, 2022
NVCOMP zstd compression was added in 22.10, but marked experimental, meaning you have to define the environment variable `LIBCUDF_NVCOMP_POLICY=ALWAYS` to enable it.  After completing validation testing using the spark rapids plugin as documented here: NVIDIA/spark-rapids#3037, we believe that we can now change the zstd compression status to stable, which will enable it in cudf by default.  `LIBCUDF_NVCOMP_POLICY=STABLE` is the default value.

Authors:
  - Jim Brennan (https://github.com/jbrennan333)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #12059
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants