Fix compression in ORC writer #12194

vuule · 2022-11-18T02:25:30Z

Description

issue #12066, #12170
There is a logic error in the ORC writer that prevents use of compressed blocks in the output file. This caused all ORC files to effectively be written without compression, causing large files in many cases.
This PR makes minimal changes to fix the logic and use compressed block when compression ratio is larger than one.
Also fixed the offset at which compressed blocks are written to avoid overwriting the data as block are compacted in the output buffer.
Verified reduction in file size for the files generated in benchmarks.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

codecov · 2022-11-18T07:18:12Z

Codecov Report

Base: 88.25% // Head: 88.25% // No change to project coverage 👍

Coverage data is based on head (08c0c5a) compared to base (782fba3).
Patch has no changes to coverable lines.

Additional details and impacted files

@@              Coverage Diff              @@
##           branch-22.12   #12194   +/-   ##
=============================================
  Coverage         88.25%   88.25%           
=============================================
  Files               137      137           
  Lines             22571    22571           
=============================================
  Hits              19921    19921           
  Misses             2650     2650

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/backends.py	`85.17% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

nvdbaranec · 2022-11-18T17:26:26Z

cpp/src/io/orc/stripe_enc.cu

@@ -1234,7 +1235,7 @@ __global__ void __launch_bounds__(1024)
                       ? results[ss.first_block + b].bytes_written
                       : src_len;
      uint32_t blk_size24{};
-      if (results[ss.first_block + b].status == compression_status::SUCCESS) {
+      if (src_len < dst_len) {


What are the semantics here? "If we failed to produce a compressed block smaller than the uncompressed data (either by failure or because of unlucky compression), just use the uncompressed data"?

That's exactly right.
Planning to add a comment.

We can also just invert the original condition, but this gives us additional benefits with uncompressable data.

Right. This seems like a good way to do it.

Wait, so there is no longer a check condition for SUCCESS compression?

It is, indirectly, in the code slightly above it:

auto dst_len = (results[ss.first_block + b].status == compression_status::SUCCESS) ? results[ss.first_block + b].bytes_written : src_len;

We don't throw any exception if the decomp process failed?

Added a comment. The logic should be refactored, but I don't want to make any unnecessary changes at this point.

nvdbaranec

Tested against spark plugin + integration tests. All passed. Suggest adding comment on the if statement for clarity.

ttnghia · 2022-11-18T20:35:07Z

Should this also close #12170?

vuule · 2022-11-18T20:43:04Z

Should this also close #12170?

Most likely it will address the issue. I don't see a way to verify the fix, file is not attached.

…bug-write_orc-compressission

ttnghia · 2022-11-18T21:47:34Z

cpp/src/io/orc/stripe_enc.cu

-    auto const dst_offset       = b * (padded_block_header_size + padded_comp_block_size);
+    inputs[ss.first_block + b] = {src + b * comp_blk_size, blk_size};
+    auto const dst_offset =
+      padded_block_header_size + b * (padded_block_header_size + padded_comp_block_size);


Was this also wrongly computed before?

Sort of. We insert a 3-byte header before each block when the compact the block in the output buffer. output buffer is actually same as compressed data buffer. So without the additional padded_block_header_size offset we would overwrite first three bytes of the first compressed block and then copy the block 3 bytes ahead, potentially doing even more damage to the block. This is why tests failed when @nvdbaranec fixed the SUCCESS condition locally. Luckily, this one change to the location where we write compressed block fixes all issues in compaction.

The code that copies the blocks:

cudf/cpp/src/io/orc/stripe_enc.cu

Lines 1258 to 1264 in d335aa3

if (src != dst) {

for (uint32_t i = 0; i < blk_size; i += 1024) {

uint8_t v = (i + t < blk_size) ? src[i + t] : 0;

__syncthreads();

if (i + t < blk_size) { dst[i + t] = v; }

}

}

vuule · 2022-11-18T23:13:27Z

rerun tests

vuule added 2 commits November 17, 2022 18:23

fix selection of original vs compressed blocks, padding

ec8888c

style

e29ea84

vuule added bug Something isn't working non-breaking Non-breaking change labels Nov 18, 2022

vuule self-assigned this Nov 18, 2022

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 18, 2022

nvdbaranec reviewed Nov 18, 2022

View reviewed changes

vuule marked this pull request as ready for review November 18, 2022 18:36

vuule requested a review from a team as a code owner November 18, 2022 18:36

vuule requested review from harrism and PointKernel and removed request for a team November 18, 2022 18:36

nvdbaranec approved these changes Nov 18, 2022

View reviewed changes

vuule added 2 commits November 18, 2022 13:21

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

c79c2d1

…bug-write_orc-compressission

comment

08c0c5a

ttnghia reviewed Nov 18, 2022

View reviewed changes

ttnghia approved these changes Nov 18, 2022

View reviewed changes

GregoryKimball mentioned this pull request Nov 19, 2022

[BUG] to_orc with snappy compression fails to compress the files #12170

Closed

jolorunyomi approved these changes Nov 21, 2022

View reviewed changes

jolorunyomi merged commit 769dfbb into rapidsai:branch-22.12 Nov 21, 2022

vuule mentioned this pull request Nov 22, 2022

[BUG] ORC file sizes can be much larger than CPU version #12066

Closed

vuule deleted the bug-write_orc-compressission branch August 10, 2023 03:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix compression in ORC writer #12194

Fix compression in ORC writer #12194

vuule commented Nov 18, 2022 •

edited

Loading

codecov bot commented Nov 18, 2022 •

edited

Loading

nvdbaranec Nov 18, 2022

vuule Nov 18, 2022

vuule Nov 18, 2022

nvdbaranec Nov 18, 2022

ttnghia Nov 18, 2022

nvdbaranec Nov 18, 2022

ttnghia Nov 18, 2022

vuule Nov 18, 2022

nvdbaranec left a comment

ttnghia commented Nov 18, 2022

vuule commented Nov 18, 2022

ttnghia Nov 18, 2022

vuule Nov 18, 2022

vuule Nov 18, 2022

vuule commented Nov 18, 2022

	if (src != dst) {
	for (uint32_t i = 0; i < blk_size; i += 1024) {
	uint8_t v = (i + t < blk_size) ? src[i + t] : 0;
	__syncthreads();
	if (i + t < blk_size) { dst[i + t] = v; }
	}
	}

Fix compression in ORC writer #12194

Fix compression in ORC writer #12194

Conversation

vuule commented Nov 18, 2022 • edited Loading

Description

Checklist

codecov bot commented Nov 18, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvdbaranec left a comment

Choose a reason for hiding this comment

ttnghia commented Nov 18, 2022

vuule commented Nov 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule commented Nov 18, 2022

vuule commented Nov 18, 2022 •

edited

Loading

codecov bot commented Nov 18, 2022 •

edited

Loading