[FEA] Support ZSTD compression with Parquet and Orc #3037

tgravescs · 2021-07-27T13:33:47Z

Is your feature request related to a problem? Please describe.
Feature request from a user asking for zstd compressed data support for Parquet with Spark rapids.

tgravescs · 2021-07-27T13:34:46Z

Need to check into CUDF support

jlowe · 2021-07-27T13:43:15Z

cudf does not support ZSTD. They are in the process of migrating to using nvcomp for handling codecs. nvcomp does not yet support ZSTD, but it is on their radar to investigate.

sameerz · 2022-04-22T01:47:12Z

Depends on rapidsai/cudf#9056

jbrennan333 · 2022-09-21T17:45:35Z

#6362 is the WIP PR to add support for parquet/orc write compression in spark-rapids, but because the CUDF feature is still experimental, we may want to hold off on enabling it. See #6362 (comment).

jbrennan333 · 2022-09-22T14:43:21Z

We discussed this in our stand-up yesterday. The consensus was that we should enable this early in the 22.12 branch and then do more rigorous testing to ensure it is stable. Among things we need to verify:

Run some more write-intensive workloads and validate results.
Generate data with GPU and CPU, and compare resulting data size. Early testing on this showed that GPU produced smaller data than CPU for parquet, but larger data than CPU for ORC. We need to confirm these results and test them at higher scales (initial tests were done at scale 100 on my desktop machine). If the data size is larger than that produced by CPU, we need to determine whether it is an acceptable increase and either push for improvements or document it.
Do more round-trip testing - generate data with gpu, then test that cpu jobs can read that data and produce correct results. I did this with nds2.0 on my desktop, but we should verify at larger scales and possibly with different queries/data.
Make an early release available for interested customers to allow them to test with ZSTD compression.
Measure ZSTD compression performance with Spark. Need to identify appropriate benchmarks for this. nds2.0 doesn't write much data, so it doesn't seem like a good test. One option is to use nds2.0 data conversion from CVS to parquet/orc with ZSTD compression.

jbrennan333 · 2022-10-12T19:59:09Z

jbrennan333 · 2022-10-13T15:05:21Z

For my scale 100 (desktop) data gen tests, I regenerated the raw data using a parallel setting of 16 to produce larger source files. python3 nds_gen_data.py local 100 16 /opt/data/nds2/raw_sf100_parallel_16

jbrennan333 · 2022-10-17T22:05:01Z

These are overall sizes for data converted from /opt/data/nds2/raw_sf100_parallel_16

100998016	raw_sf100_parallel_16
30800556	cpu_orc_sf100_none
23827772	cpu_orc_sf100_zstd
36334628	cpu_parquet_sf100_none
26837096	cpu_parquet_sf100_zstd
28221372	gpu_orc_sf100_none
28230416	gpu_orc_sf100_zstd
34250312	gpu_parquet_sf100_none
27570448	gpu_parquet_sf100_zstd

Note that most of the reduction in size comes from the conversion from raw csv data to orc/parquet.

CPU raw to PARQUET compression ratio: 2.78
CPU PARQUET to PARQUET ZSTD compression ratio: 1.35
CPU overall raw to PARQUET ZSTD compression ratio: 3.76

GPU raw to PARQUET compression ratio: 2.95
GPU PARQUET to PARQUET ZSTD compression ratio: 1.24
GPU overall raw to PARQUET ZSTD compression ratio: 3.66

CPU raw to ORC compression ratio: 3.28
CPU ORC to ORC ZSTD compression ratio: 1.29
CPU overall raw to ORC ZSTD compression ratio: 4.23

GPU raw to ORC compression ratio: 3.58
GPU ORC to ORC ZSTD compression ratio: 0.99
GPU overall raw to ORC ZSTD compression ratio: 3.58

jbrennan333 · 2022-10-18T16:19:51Z

When comparing output of power-runs between cpu and and gpu as scale 100, I consistently see a difference in query79, regardless of compression (none, zstd, snappy). So I don't think it is a zstd issue, but might need more investigation.

jbrennan333 · 2022-10-18T16:21:24Z

The q79 diff looks like this:

=== Comparing Query: query79 ===
Collecting rows from DataFrame
Collected 100 rows in 0.11267280578613281 seconds
Collecting rows from DataFrame
Collected 100 rows in 0.10186338424682617 seconds
Row 99: 
[None, None, 'Bethel', 12213922, None, None]
[None, None, 'Bethel', 22767968, None, None]

jbrennan333 · 2022-10-18T18:03:49Z

Power-run output is pretty small, but I am noticing some differences in the size of output for parquet for cpu vs gpu:

4.2M	power-output-cpu-parquet-cpu-zstd
4.2M	power-output-cpu-parquet-gpu-zstd
6.0M	power-output-gpu-parquet-cpu-zstd
6.0M	power-output-gpu-parquet-gpu-zstd

Most of the difference seem to be in the query 98 output:

1.7M	power-output-cpu-parquet-cpu-zstd/query98
3.6M	power-output-gpu-parquet-cpu-zstd/query98

Inspecting one of the partitions for query98 shows that in the gpu, two of the columns are uncompressed:

############ Column(i_item_desc) ############
name: i_item_desc
path: i_item_desc
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: UNCOMPRESSED (space_saved: 0%)

############ Column(i_category) ############
name: i_category
path: i_category
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: UNCOMPRESSED (space_saved: 0%)

For CPU, these are compressed, although for i_category, the compressed size appears to be larger:

############ Column(i_item_desc) ############
name: i_item_desc
path: i_item_desc
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD (space_saved: 64%)

############ Column(i_category) ############
name: i_category
path: i_category
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD (space_saved: -26%)

revans2 · 2022-10-18T20:08:12Z

When comparing output of power-runs between cpu and and gpu as scale 100, I consistently see a difference in query79, regardless of compression (none, zstd, snappy). So I don't think it is a zstd issue, but might need more investigation.

The line that is different is the ss_ticket_number field that is a group by key, but not in the sort order for the output. My guess is that this is just ambiguous ordering in the output.

jbrennan333 · 2022-10-18T20:29:12Z

Will want this fix in CUDF: rapidsai/cudf#11869

jbrennan333 · 2022-10-19T15:47:20Z

jbrennan333 · 2022-10-20T15:44:42Z

Spark2a (a100) data conversion sizes (via hadoop fs -count):

       DIRS         FILES       SIZE
          26         3016      2958431049504 /data/nds2.0/raw_sf3k
       12064        12080      1109313171766 /data/jimb/cpu_parquet_sf3k_decimal_none
       12064        12080       779117287097 /data/jimb/cpu_parquet_sf3k_decimal_zstd

       12064        12080       911568648724 /data/jimb/gpu_parquet_sf3k_decimal_none
       12064        12080       835715788420 /data/jimb/gpu_parquet_sf3k_decimal_zstd

CPU raw to PARQUET NONE compression ratio: 2.66
CPU PARQUET to CPU PARQUET ZSTD compression ratio: 1.42
CPU overall raw to PARQUET ZSTD compression ratio: 3.80

GPU raw to GPU PARQUET compression ratio: 3.24
GPU PARQUET to GPU PARQUET ZSTD compression ratio: 1.09
GPU overall raw to PARQUET ZSTD compression ratio: 3.53

jbrennan333 · 2022-10-21T15:13:43Z

Comparing sizes of 3TB power-run outputs:

         104          251            8855415 /data/jimb/power-output-cpu-parquet-cpu-zstd
         104          231           12755903 /data/jimb/power-output-gpu-parquet-cpu-zstd

         104          269            8982069 /data/jimb/power-output-cpu-parquet-gpu-zstd
         104          235           12762506 /data/jimb/power-output-gpu-parquet-gpu-zstd

There is no significant difference in size for GPU and CPU when you run with GPU generated data vs CPU generated data.

In the runs using zstd input data generated by the CPU, the GPU output data is about 1.44x larger.
In the runs using zstd input data generated by the GPU, the GPU output data is about 1.42x larger.

Similar to my scale 100 results, most of the difference is in query98 results.

hadoop fs -count /data/jimb/power-output-cpu-parquet-cpu-zstd/* | sort -k3 -n | tail
           1            2              11810 /data/jimb/power-output-cpu-parquet-cpu-zstd/query17
           1            2              14063 /data/jimb/power-output-cpu-parquet-cpu-zstd/query31
           1            2              16877 /data/jimb/power-output-cpu-parquet-cpu-zstd/query73
           1            2              21508 /data/jimb/power-output-cpu-parquet-cpu-zstd/query66
           1            2              22966 /data/jimb/power-output-cpu-parquet-cpu-zstd/query39_part2
           1            2             462955 /data/jimb/power-output-cpu-parquet-cpu-zstd/query39_part1
           1            5             924995 /data/jimb/power-output-cpu-parquet-cpu-zstd/query64
           1           10            1489572 /data/jimb/power-output-cpu-parquet-cpu-zstd/query34
           1           30            2651055 /data/jimb/power-output-cpu-parquet-cpu-zstd/query71
           1            8            2929125 /data/jimb/power-output-cpu-parquet-cpu-zstd/query98

hadoop fs -count /data/jimb/power-output-gpu-parquet-cpu-zstd/* | sort -k3 -n | tail
           1            2              10257 /data/jimb/power-output-gpu-parquet-cpu-zstd/query17
           1            2              13227 /data/jimb/power-output-gpu-parquet-cpu-zstd/query31
           1            2              16399 /data/jimb/power-output-gpu-parquet-cpu-zstd/query73
           1            2              16614 /data/jimb/power-output-gpu-parquet-cpu-zstd/query66
           1            2              22002 /data/jimb/power-output-gpu-parquet-cpu-zstd/query39_part2
           1            2             462628 /data/jimb/power-output-gpu-parquet-cpu-zstd/query39_part1
           1            4             907409 /data/jimb/power-output-gpu-parquet-cpu-zstd/query64
           1            7            1465788 /data/jimb/power-output-gpu-parquet-cpu-zstd/query34
           1           16            3105661 /data/jimb/power-output-gpu-parquet-cpu-zstd/query71
           1            6            6479179 /data/jimb/power-output-gpu-parquet-cpu-zstd/query98

I think this may be due to UNCOMPRESSED i_item_desc string columns:

GPU:
parquet-tools inspect part-00000-7f2a657c-3a23-4024-b884-9acd56bb239e-c000.zstd.parquet
...
############ Column(i_item_desc) ############
name: i_item_desc
path: i_item_desc
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: UNCOMPRESSED (space_saved: 0%)

CPU:
parquet-tools inspect part-00000-8e5d6fbf-78fd-4bd2-a5ea-136b58ed2da6-c000.zstd.parquet
...
############ Column(i_item_desc) ############
name: i_item_desc
path: i_item_desc
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD (space_saved: 64%)

jbrennan333 · 2022-10-24T20:38:35Z

Spark2a (a100) data conversion sizes for orc (via hadoop fs -count):

          26         3016      2958431049504 /data/nds2.0/raw_sf3k

       12064        12080      1109611149766 /data/jimb/cpu_orc_sf3k_decimal_none
       12064        12080       718204423250 /data/nds2.0/orc_sf3k_decimal_zstd

       12064        12080       797612029660 /data/jimb/gpu_orc_sf3k_decimal_none
       12064        12080       797831485879 /data/jimb/gpu_orc_sf3k_decimal_zstd

CPU raw to CPU ORC NONE compression ratio: 2.67
CPU ORC to CPU ORC ZSTD compression ratio: 1.55
CPU overall raw to ORC ZSTD compression ratio: 4.12

GPU raw to GPU ORC NONE compression ratio: 3.71
GPU ORC to GPU ORC ZSTD compression ratio: 1.0
GPU overall raw to ORC ZSTD compression ratio: 3.71

jbrennan333 · 2022-10-25T20:27:55Z

As a sanity check, I ran the nds2.0 power run on a dataproc cluster with P4 GPUs, with zstd parquet input data, and wrote the output in parquet/zstd format.
No failures.

jbrennan333 · 2022-10-25T20:49:44Z

Comparing sizes of 3TB power-run outputs for ORC:

         104          254            9776040 /data/jimb/power-output-cpu-orc-cpu-zstd
         104          232           17211185 /data/jimb/power-output-gpu-orc-cpu-zstd

         104          267            9850470 /data/jimb/power-output-cpu-orc-gpu-zstd
         104          235           17244382 /data/jimb/power-output-gpu-orc-gpu-zstd

There is no significant difference in size for GPU and CPU when you run with GPU generated data vs CPU generated data.

In the runs using zstd input data generated by the CPU, the GPU output data is about 1.76x larger.
In the runs using zstd input data generated by the GPU, the GPU output data is about 1.75x larger.

hadoop fs -count /data/jimb/power-output-cpu-orc-cpu-zstd/* | sort -k3 -n | tail
           1            2               8741 /data/jimb/power-output-cpu-orc-cpu-zstd/query12
           1            2              13893 /data/jimb/power-output-cpu-orc-cpu-zstd/query31
           1            2              14015 /data/jimb/power-output-cpu-orc-cpu-zstd/query66
           1            2              14850 /data/jimb/power-output-cpu-orc-cpu-zstd/query73
           1            2              19958 /data/jimb/power-output-cpu-orc-cpu-zstd/query39_part2
           1            2             458098 /data/jimb/power-output-cpu-orc-cpu-zstd/query39_part1
           1            5             801487 /data/jimb/power-output-cpu-orc-cpu-zstd/query64
           1           11            1503282 /data/jimb/power-output-cpu-orc-cpu-zstd/query34
           1            8            3001069 /data/jimb/power-output-cpu-orc-cpu-zstd/query98
           1           32            3716342 /data/jimb/power-output-cpu-orc-cpu-zstd/query71

hadoop fs -count /data/jimb/power-output-gpu-orc-cpu-zstd/* | sort -k3 -n | tail
           1            2              16377 /data/jimb/power-output-gpu-orc-cpu-zstd/query31
           1            2              17921 /data/jimb/power-output-gpu-orc-cpu-zstd/query17
           1            2              21630 /data/jimb/power-output-gpu-orc-cpu-zstd/query73
           1            2              29545 /data/jimb/power-output-gpu-orc-cpu-zstd/query39_part2
           1            2              36416 /data/jimb/power-output-gpu-orc-cpu-zstd/query2
           1            2             723937 /data/jimb/power-output-gpu-orc-cpu-zstd/query39_part1
           1            4            1422589 /data/jimb/power-output-gpu-orc-cpu-zstd/query64
           1            8            1987830 /data/jimb/power-output-gpu-orc-cpu-zstd/query34
           1           16            5528351 /data/jimb/power-output-gpu-orc-cpu-zstd/query71
           1            6            7114846 /data/jimb/power-output-gpu-orc-cpu-zstd/query98

The sizes for GPU are consistently larger. Query98 is again the biggest difference at over 2.37x the CPU version.
The main difference I see in the metadata is for CPU, the Compression Size is 262144, while for GPU, it is 65536.

jbrennan333 · 2022-10-25T21:04:02Z

ORC NDS2.0 SCALE 3000 Data Conversion using CPU

Total conversion time for 24 tables was 3241.63s
Time to convert 'customer_address' was 170.7648s
Time to convert 'customer_demographics' was 15.9405s
Time to convert 'date_dim' was 3.8864s
Time to convert 'warehouse' was 0.2383s
Time to convert 'ship_mode' was 2.0146s
Time to convert 'time_dim' was 2.9487s
Time to convert 'reason' was 1.9077s
Time to convert 'income_band' was 2.1465s
Time to convert 'item' was 10.3129s
Time to convert 'store' was 2.5937s
Time to convert 'call_center' was 2.3034s
Time to convert 'customer' was 333.3341s
Time to convert 'web_site' was 0.2279s
Time to convert 'store_returns' was 122.8344s
Time to convert 'household_demographics' was 0.2081s
Time to convert 'web_page' was 0.5157s
Time to convert 'promotion' was 0.2138s
Time to convert 'catalog_page' was 0.4268s
Time to convert 'inventory' was 19.3619s
Time to convert 'catalog_returns' was 23.9985s
Time to convert 'web_returns' was 37.9469s
Time to convert 'web_sales' was 151.2772s
Time to convert 'catalog_sales' was 320.9935s
Time to convert 'store_sales' was 2015.2364s

ORC NDS2.0 SCALE 3000 Data Conversion using GPU

Total conversion time for 24 tables was 2365.16s
Time to convert 'customer_address' was 143.8623s
Time to convert 'customer_demographics' was 5.5104s
Time to convert 'date_dim' was 2.0233s
Time to convert 'warehouse' was 1.9286s
Time to convert 'ship_mode' was 1.0685s
Time to convert 'time_dim' was 0.2953s
Time to convert 'reason' was 1.4386s
Time to convert 'income_band' was 0.1812s
Time to convert 'item' was 5.8370s
Time to convert 'store' was 0.7381s
Time to convert 'call_center' was 0.2208s
Time to convert 'customer' was 23.7831s
Time to convert 'web_site' was 2.0297s
Time to convert 'store_returns' was 135.3156s
Time to convert 'household_demographics' was 0.1854s
Time to convert 'web_page' was 0.2322s
Time to convert 'promotion' was 0.2122s
Time to convert 'catalog_page' was 0.2964s
Time to convert 'inventory' was 9.1562s
Time to convert 'catalog_returns' was 72.8754s
Time to convert 'web_returns' was 38.5083s
Time to convert 'web_sales' was 333.8525s
Time to convert 'catalog_sales' was 576.3461s
Time to convert 'store_sales' was 1009.2584s

Total speedup for GPU was about 1.37x. This is just a single run and I have not spent any significant time trying to optimize it. You can see for the largest table, store_sales, speedup was nearly 2x. But on the next largest table,, catalog_sales, gpu was about 1.8x slower.

jbrennan333 · 2022-10-26T19:16:43Z

PARQUET NDS2.0 SCALE 3000 Data Conversion using CPU

Total conversion time for 24 tables was 3010.75s
Time to convert 'customer_address' was 182.2625s
Time to convert 'customer_demographics' was 16.8170s
Time to convert 'date_dim' was 4.1716s
Time to convert 'warehouse' was 2.4555s
Time to convert 'ship_mode' was 2.4329s
Time to convert 'time_dim' was 3.1540s
Time to convert 'reason' was 2.2615s
Time to convert 'income_band' was 2.3192s
Time to convert 'item' was 8.4986s
Time to convert 'store' was 2.6609s
Time to convert 'call_center' was 2.6035s
Time to convert 'customer' was 304.6228s
Time to convert 'web_site' was 2.6494s
Time to convert 'store_returns' was 116.6746s
Time to convert 'household_demographics' was 0.2331s
Time to convert 'web_page' was 0.4761s
Time to convert 'promotion' was 0.2269s
Time to convert 'catalog_page' was 0.5668s
Time to convert 'inventory' was 19.6026s
Time to convert 'catalog_returns' was 23.8653s
Time to convert 'web_returns' was 36.3303s
Time to convert 'web_sales' was 145.0934s
Time to convert 'catalog_sales' was 306.9950s
Time to convert 'store_sales' was 1823.7792s

Spark configuration follows:

('spark.driver.maxResultSize', '2GB')
('spark.app.name', 'NDS - transcode')
('spark.sql.adaptive.enabled', 'true')
('spark.driver.memory', '50G')
('spark.executor.memory', '32G')
('spark.sql.shuffle.partitions', '1024')
('spark.locality.wait', '0')
('spark.executor.cores', '16')
('spark.app.id', 'app-20221026140612-0010')
('spark.sql.parquet.compression.codec', 'zstd')

PARQUET NDS2.0 SCALE 3000 Data Conversion using GPU

Total conversion time for 24 tables was 2326.97s
Time to convert 'customer_address' was 135.5229s
Time to convert 'customer_demographics' was 6.7232s
Time to convert 'date_dim' was 1.2225s
Time to convert 'warehouse' was 2.8716s
Time to convert 'ship_mode' was 1.3489s
Time to convert 'time_dim' was 1.8186s
Time to convert 'reason' was 1.7725s
Time to convert 'income_band' was 1.2412s
Time to convert 'item' was 6.9413s
Time to convert 'store' was 0.3417s
Time to convert 'call_center' was 0.4945s
Time to convert 'customer' was 21.6170s
Time to convert 'web_site' was 0.2297s
Time to convert 'store_returns' was 129.1592s
Time to convert 'household_demographics' was 0.2200s
Time to convert 'web_page' was 0.2152s
Time to convert 'promotion' was 0.2361s
Time to convert 'catalog_page' was 0.2131s
Time to convert 'inventory' was 3.6131s
Time to convert 'catalog_returns' was 65.8336s
Time to convert 'web_returns' was 35.4903s
Time to convert 'web_sales' was 307.7802s
Time to convert 'catalog_sales' was 591.1566s
Time to convert 'store_sales' was 1010.9048s

Spark configuration follows:
('spark.app.id', 'app-20221026154733-0013')
('spark.app.name', 'NDS - transcode')
('spark.locality.wait', '0')
('spark.driver.maxResultSize', '2GB')
('spark.driver.memory', '50G')
('spark.driverEnv.LIBCUDF_NVCOMP_POLICY', 'ALWAYS')
('spark.executor.cores', '16')
('spark.executor.instances', '8')
('spark.executor.memory', '32G')
('spark.executor.resource.gpu.amount', '1')
('spark.executorEnv.LIBCUDF_NVCOMP_POLICY', 'ALWAYS')
('spark.plugins', 'com.nvidia.spark.SQLPlugin')
('spark.rapids.memory.pinnedPool.size', '8G')
('spark.rapids.memory.host.spillStorageSize', '32G')
('spark.rapids.sql.batchSizeBytes', '1GB')
('spark.rapids.sql.concurrentGpuTasks', '4')
('spark.sql.adaptive.enabled', 'true')
('spark.sql.files.maxPartitionBytes', '2gb')
('spark.sql.parquet.compression.codec', 'zstd')
('spark.sql.shuffle.partitions', '200')
('spark.task.resource.gpu.amount', '0.0625')

Total overall speedup for GPU was about 1.30x. The speedup for the largest file (store_sales) was about 1.8x. The next largest file (catalog_sales) was about 1.9x Slower (speed up of 0.52).

jbrennan333 · 2022-10-26T20:14:39Z

I collected the size/time info for converting to Parquet/zstd and Orc/zstd at 3TB scale
parquet-zstd-conversion-times-spark2a.pdf
orc-zstd-conversion-times-spark2a.pdf

jbrennan333 · 2022-10-26T20:32:50Z

Overall for NDS2.0 Data Conversion from Raw (CSV) at 3TB scale on GPU,
for Parquet we saw a speedup of about 1.3x with increased space of about 1.07x vs CPU.
for ORC we saw a speedup of about 1.37x with increased space of about 1.11x vs CPU.

jbrennan333 · 2022-10-26T21:44:16Z

When we convert from CSV to Parquet and Orc, GPU produces smaller files (higher compression ratio) than CPU when compression is set to NONE:

Scale 100 CPU raw CSV to PARQUET NONE compression ratio: 2.78
Scale 100 GPU raw CSV to PARQUET NONE compression ratio: 2.95
Scale 3000 CPU raw CSV to PARQUET NONE compression ratio: 2.66
Scale 3000 GPU raw CSV to PARQUET NONE compression ratio: 3.24
Scale 100 CPU raw CSV to ORC NONE compression ratio: 3.28
Scale 100 GPU raw CSV to ORC NONE compression ratio: 3.58
Scale 3000 CPU raw CSV to ORC NONE compression ratio: 2.67
Scale 3000 GPU raw CSV to ORC NONE compression ratio: 3.71

When we compare the parquet/orc files with no compression to those with ZSTD compression, the cpu has a better compression ratio, and in particular, we get no benefit on the GPU for these ORC files compared to using no compression:

Scale 100 CPU PARQUET to PARQUET ZSTD compression ratio: 1.35
Scale 100 GPU PARQUET to PARQUET ZSTD compression ratio: 1.24
Scale 3000 CPU PARQUET to PARQUET ZSTD compression ratio: 1.42
Scale 3000 GPU PARQUET to PARQUET ZSTD compression ratio: 1.09
Scale 100 CPU ORC to ORC ZSTD compression ratio: 1.29
Scale 100 GPU ORC to ORC ZSTD compression ratio: 0.99
Scale 3000 CPU ORC to ORC ZSTD compression ratio: 1.55
Scale 3000 GPU ORC to ORC ZSTD compression ratio: 1.0

The overall compression ratio from CSV to parquet/orc zstd is about the same at scale 100 and scale 3000:

Scale 100 CPU overall raw CSV to PARQUET ZSTD compression ratio: 3.76
Scale 100 GPU overall raw CSV to PARQUET ZSTD compression ratio: 3.66
Scale 3000 CPU overall raw CSV to PARQUET ZSTD compression ratio: 3.80
Scale 3000 GPU overall raw CSV to PARQUET ZSTD compression ratio: 3.53
Scale 100 CPU overall raw CSV to ORC ZSTD compression ratio: 4.23
Scale 100 GPU overall raw CSV to ORC ZSTD compression ratio: 3.58
Scale 3000 CPU overall raw to ORC ZSTD compression ratio: 4.12
Scale 3000 GPU overall raw to ORC ZSTD compression ratio: 3.71

NVCOMP zstd compression was added in 22.10, but marked experimental, meaning you have to define the environment variable `LIBCUDF_NVCOMP_POLICY=ALWAYS` to enable it. After completing validation testing using the spark rapids plugin as documented here: NVIDIA/spark-rapids#3037, we believe that we can now change the zstd compression status to stable, which will enable it in cudf by default. `LIBCUDF_NVCOMP_POLICY=STABLE` is the default value. Authors: - Jim Brennan (https://github.com/jbrennan333) Approvers: - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) URL: #12059

tgravescs added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jul 27, 2021

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 27, 2021

tgravescs mentioned this issue Sep 21, 2022

Update doc to indicate ORC and Parquet zstd read support [skip ci] #6589

Merged

sameerz linked a pull request Sep 21, 2022 that will close this issue

[FEA] Add support for using nvcomp ZSTD compression #6362

Merged

sameerz assigned jbrennan333 Sep 21, 2022

jbrennan333 changed the title ~~[FEA] Support ZSTD compression with Parquet~~ [FEA] Support ZSTD compression with Parquet and Orc Oct 13, 2022

This was referenced Nov 3, 2022

Mark nvcomp zstd compression stable rapidsai/cudf#12059

Merged

[FEA] Validate ORC deflate (zlib) compression #7008

Open

jbrennan333 mentioned this issue Nov 7, 2022

[FEA] Add support for using nvcomp ZSTD compression #6362

Merged

jbrennan333 closed this as completed in #6362 Nov 7, 2022

jbrennan333 mentioned this issue Nov 14, 2022

Reduce page fragment size in Parquet writer rapidsai/cudf#12130

Closed

3 tasks

jbrennan333 mentioned this issue Feb 3, 2023

[FEA] We should have regular validation jobs for parquet with zstd compression #7658

Open

jbrennan333 mentioned this issue Oct 17, 2023

[FEA] Validate nvcomp-3.0 with spark rapids plugin #9461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support ZSTD compression with Parquet and Orc #3037

[FEA] Support ZSTD compression with Parquet and Orc #3037

tgravescs commented Jul 27, 2021

tgravescs commented Jul 27, 2021

jlowe commented Jul 27, 2021

sameerz commented Apr 22, 2022

jbrennan333 commented Sep 21, 2022

jbrennan333 commented Sep 22, 2022 •

edited

Loading

jbrennan333 commented Oct 12, 2022 •

edited

Loading

jbrennan333 commented Oct 13, 2022

jbrennan333 commented Oct 17, 2022

jbrennan333 commented Oct 18, 2022

jbrennan333 commented Oct 18, 2022

jbrennan333 commented Oct 18, 2022

revans2 commented Oct 18, 2022

jbrennan333 commented Oct 18, 2022

jbrennan333 commented Oct 19, 2022 •

edited

Loading

jbrennan333 commented Oct 20, 2022 •

edited

Loading

jbrennan333 commented Oct 21, 2022

jbrennan333 commented Oct 24, 2022 •

edited

Loading

jbrennan333 commented Oct 25, 2022

jbrennan333 commented Oct 25, 2022

jbrennan333 commented Oct 25, 2022

jbrennan333 commented Oct 26, 2022

jbrennan333 commented Oct 26, 2022 •

edited

Loading

jbrennan333 commented Oct 26, 2022 •

edited

Loading

jbrennan333 commented Oct 26, 2022

[FEA] Support ZSTD compression with Parquet and Orc #3037

[FEA] Support ZSTD compression with Parquet and Orc #3037

Comments

tgravescs commented Jul 27, 2021

tgravescs commented Jul 27, 2021

jlowe commented Jul 27, 2021

sameerz commented Apr 22, 2022

jbrennan333 commented Sep 21, 2022

jbrennan333 commented Sep 22, 2022 • edited Loading

jbrennan333 commented Oct 12, 2022 • edited Loading

jbrennan333 commented Oct 13, 2022

jbrennan333 commented Oct 17, 2022

jbrennan333 commented Oct 18, 2022

jbrennan333 commented Oct 18, 2022

jbrennan333 commented Oct 18, 2022

revans2 commented Oct 18, 2022

jbrennan333 commented Oct 18, 2022

jbrennan333 commented Oct 19, 2022 • edited Loading

jbrennan333 commented Oct 20, 2022 • edited Loading

jbrennan333 commented Oct 21, 2022

jbrennan333 commented Oct 24, 2022 • edited Loading

jbrennan333 commented Oct 25, 2022

jbrennan333 commented Oct 25, 2022

jbrennan333 commented Oct 25, 2022

jbrennan333 commented Oct 26, 2022

jbrennan333 commented Oct 26, 2022 • edited Loading

jbrennan333 commented Oct 26, 2022 • edited Loading

jbrennan333 commented Oct 26, 2022

jbrennan333 commented Sep 22, 2022 •

edited

Loading

jbrennan333 commented Oct 12, 2022 •

edited

Loading

jbrennan333 commented Oct 19, 2022 •

edited

Loading

jbrennan333 commented Oct 20, 2022 •

edited

Loading

jbrennan333 commented Oct 24, 2022 •

edited

Loading

jbrennan333 commented Oct 26, 2022 •

edited

Loading

jbrennan333 commented Oct 26, 2022 •

edited

Loading