Include row group level stats when writing ORC files #10041

vuule · 2022-01-13T19:44:58Z

Closes #9964
Encodes row group level stats with the rest and writes the encoded blobs into the protobuf, at the start of each stripe (other stats are in the file footer).
Adds put_bytes to ProtobufWriter to optimize writing of buffers.
Adds new struct to represent the encoded ORC statistics so they are separated by granularity level (instead of using a single vector).

codecov · 2022-01-13T21:27:26Z

Codecov Report

Merging #10041 (fe56f23) into branch-22.02 (967a333) will decrease coverage by 0.07%.
The diff coverage is n/a.

❗ Current head fe56f23 differs from pull request most recent head 6ea0a50. Consider uploading reports for the commit 6ea0a50 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.02   #10041      +/-   ##
================================================
- Coverage         10.49%   10.41%   -0.08%     
================================================
  Files               119      119              
  Lines             20305    20541     +236     
================================================
+ Hits               2130     2139       +9     
- Misses            18175    18402     +227

Impacted Files	Coverage Δ
python/custreamz/custreamz/kafka.py	`29.16% <0.00%> (-0.63%)`	⬇️
python/dask_cudf/dask_cudf/sorting.py	`92.66% <0.00%> (-0.25%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`70.85% <0.00%> (-0.17%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/api/types.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/dtypes.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/scalar.py	`0.00% <0.00%> (ø)`
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 512e161...6ea0a50. Read the comment docs.

vuule · 2022-01-14T02:02:47Z

Measured a small performance regression due to additional stats encode. Difference is within a few percent (hard to measure exactly due to variance between runs).

vuule · 2022-01-14T17:10:41Z

Will look into adding a test, unsure if this is possible with the available readers.

hyperbolic2346

Copyrights need to be updated on some files here. I like these changes, I think they are going in the right direction for readability.

cpp/src/io/orc/orc.cpp

cpp/src/io/orc/orc.h

cpp/src/io/orc/writer_impl.cu

…fea-rowgroup-stats

wbo4958 · 2022-01-14T23:51:41Z

nd writes the encoded blobs into the protobuf, at the start of each stripe (other stats are in the file footer).
Adds put_bytes to ProtobufWriter to optimize writing of buffers.
Adds new struct to represent the encoded ORC

Maybe we can add a configure to enable or disable this FEA after SPARK has bumped to the ORC repo which has the fix.

vuule · 2022-01-15T00:01:40Z

nd writes the encoded blobs into the protobuf, at the start of each stripe (other stats are in the file footer).
Adds put_bytes to ProtobufWriter to optimize writing of buffers.
Adds new struct to represent the encoded ORC

Maybe we can add a configure to enable or disable this FEA after SPARK has bumped to the ORC repo which has the fix.

👍
@mythrocks is working on API changes that will allow callers to disable rowgroup level statistics, so they can effectively revert the behavioral changes in this PR.

jlowe · 2022-01-15T00:05:07Z

Maybe we can add a configure to enable or disable this FEA after SPARK has bumped to the ORC repo which has the fix.

That only works when everyone stops using the older Spark version(s) (and any other Java-based data processing frameworks) that still using the old ORC version with the reading bug. While those frameworks on the older ORC version are still in use, cudf applications could still end up creating ORC files that those frameworks will silently drop data when reading with predicate pushdown. Even though the spec says these things are technically optional, it is very sketchy to be the only ORC writer on the planet that is not generating these stats.

I think it's fine making it possible in libcudf to avoid writing these stats, but IMO cudf applications should always ask for the stats to be generated unless they know there's no chance the files they're creating could be read by data processing frameworks that could be affected by the ORC reading bug.

vuule · 2022-01-15T00:12:41Z

I think it's fine making it possible in libcudf to avoid writing these stats, but IMO cudf applications should always ask for the stats to be generated unless they know there's no chance the files they're creating could be read by data processing frameworks that could be affected by the ORC reading bug.

That's right. Writing all statistics will be the default.

vuule · 2022-01-15T03:18:45Z

rerun tests

nvdbaranec · 2022-01-18T16:04:10Z

cpp/src/io/orc/orc.cpp

  m_buf->data()[lpos + 2] = (uint8_t)(sz);
+
+  if (stats != nullptr) {
+    sz += put_uint(encode_field_number<decltype(*stats)>(2));  // 2: statistics


Nit: maybe field number (2 in this case) should be an enum. I see that it's used in a lot of places though, so maybe a followup.

It's doable, but there would need to be different enums for each ORC message type, since the numbers are not unique between messages (see https://orc.apache.org/specification/ORCv1/). We can have the set of enums (non-class) and still pass them as int. I would really need to do this in a follow up for this one to make it into 22.02.

cpp/src/io/orc/orc.h

nvdbaranec

Looks good. Might not hurt to cook up some kind of tests for this.

…fea-rowgroup-stats

vuule · 2022-01-19T08:58:18Z

rerun tests

galipremsagar · 2022-01-19T13:22:57Z

rerun tests

galipremsagar · 2022-01-19T15:03:52Z

@gpucibot merge

Depends on #10041. The erstwhile ORC writer API exposed only a binary choice to choose the level of statistics: ENABLED/DISABLED. This commit allows the ORC writer to further choose whether statistics are collected at the ROW_GROUP or STRIPE level. This commit also includes the relevant changes to `java/` and `python/`. Authors: - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Jason Lowe (https://github.com/jlowe) - GALI PREM SAGAR (https://github.com/galipremsagar) - Christopher Harris (https://github.com/cwharris) - Vukasin Milovanovic (https://github.com/vuule) URL: #10058

vuule added 5 commits January 11, 2022 14:50

separate stats by level

b8ae756

encode rg stats

d6c18c8

rename putb

cb2b972

add put_bytes

95f018c

include rg stats in rg index entries

c5b62b9

vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Jan 13, 2022

vuule self-assigned this Jan 13, 2022

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 13, 2022

docs

f1f6958

fix; don't use optional

c078879

wbo4958 mentioned this pull request Jan 14, 2022

Revert "Disable orc write by default because of https://issues.apache.org/jira/browse/ORC-1075 (#4471)" NVIDIA/spark-rapids#4535

Merged

vuule marked this pull request as ready for review January 14, 2022 17:09

vuule requested a review from a team as a code owner January 14, 2022 17:09

vuule requested review from hyperbolic2346 and nvdbaranec January 14, 2022 17:09

hyperbolic2346 requested changes Jan 14, 2022

View reviewed changes

cpp/src/io/orc/orc.cpp Outdated Show resolved Hide resolved

cpp/src/io/orc/orc.h Outdated Show resolved Hide resolved

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved

vuule added 2 commits January 14, 2022 14:02

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

61733f9

…fea-rowgroup-stats

return written size from put_byte and put_bytes

4bfe814

reuse encode_field_number in protobuf writer

2535754

vuule added 2 commits January 14, 2022 17:10

copyright year

683a016

comment

61a8bec

mythrocks mentioned this pull request Jan 15, 2022

ORC writer API changes for granular statistics #10058

Merged

vuule requested a review from hyperbolic2346 January 15, 2022 02:49

hyperbolic2346 approved these changes Jan 15, 2022

View reviewed changes

nvdbaranec reviewed Jan 18, 2022

View reviewed changes

nvdbaranec approved these changes Jan 18, 2022

View reviewed changes

vuule added 2 commits January 18, 2022 10:59

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

fecf4d5

…fea-rowgroup-stats

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

509138b

…fea-rowgroup-stats

wbo4958 mentioned this pull request Jan 19, 2022

[FEA] Add File Statistic when writing the ORC file #10075

Closed

vuule added 5 commits January 18, 2022 19:38

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

febcd04

…fea-rowgroup-stats

host_span; static_assert

4c82d89

ProtobufType enum

2afedce

style

46a4a3c

copyright year

6ea0a50

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jan 19, 2022

rapids-bot bot merged commit f193d59 into rapidsai:branch-22.02 Jan 19, 2022

vuule deleted the fea-rowgroup-stats branch January 19, 2022 18:03

vuule mentioned this pull request May 3, 2022

Protobuf error on SPARK with cudf data [BUG] #10755

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include row group level stats when writing ORC files #10041

Include row group level stats when writing ORC files #10041

vuule commented Jan 13, 2022 •

edited

Loading

codecov bot commented Jan 13, 2022 •

edited

Loading

vuule commented Jan 14, 2022

vuule commented Jan 14, 2022

hyperbolic2346 left a comment

wbo4958 commented Jan 14, 2022

vuule commented Jan 15, 2022

jlowe commented Jan 15, 2022

vuule commented Jan 15, 2022

vuule commented Jan 15, 2022

nvdbaranec Jan 18, 2022

vuule Jan 18, 2022

nvdbaranec left a comment

vuule commented Jan 19, 2022

galipremsagar commented Jan 19, 2022

galipremsagar commented Jan 19, 2022

Include row group level stats when writing ORC files #10041

Include row group level stats when writing ORC files #10041

Conversation

vuule commented Jan 13, 2022 • edited Loading

codecov bot commented Jan 13, 2022 • edited Loading

Codecov Report

vuule commented Jan 14, 2022

vuule commented Jan 14, 2022

hyperbolic2346 left a comment

Choose a reason for hiding this comment

wbo4958 commented Jan 14, 2022

vuule commented Jan 15, 2022

jlowe commented Jan 15, 2022

vuule commented Jan 15, 2022

vuule commented Jan 15, 2022

nvdbaranec Jan 18, 2022

Choose a reason for hiding this comment

vuule Jan 18, 2022

Choose a reason for hiding this comment

nvdbaranec left a comment

Choose a reason for hiding this comment

vuule commented Jan 19, 2022

galipremsagar commented Jan 19, 2022

galipremsagar commented Jan 19, 2022

vuule commented Jan 13, 2022 •

edited

Loading

codecov bot commented Jan 13, 2022 •

edited

Loading