Modify Parquet writer to produce column indexes #11171

etseidl · 2022-06-29T16:59:47Z

This PR addresses #9268 and closes #11038. It's still a work in progress and I would like feedback about my approach.

The column indexes are actually two different structures. The column index itself which is essentially per-page min/max statistics, and the offset index which stores each page's location, compressed size, and first row index. Since the column index contains information already in the EncColumnChunk structure, I calculate and encode the column index per chunk on device, storing the result in a blob I added to the EncColumnChunk struct. The offset index requires information available only after writing the file, so it is created on the CPU and stored in the aggregate_writer_metadata struct. The indexes themselves are then written to the file before the footer.

As part of this work, I've also included truncation of the statistics values, as recommended by the Parquet format. I've added a parameter column_index_truncate_length to the writer options/builder. It currently defaults to 64, which is the default used by parquet-mr. The truncation code required the addition of some UTF8 helper functions, some of which may no longer be needed after #11160 is merged.

Also, while writing the code to determine the sort order of the pages, I found that decimal128 statistics were not handled correctly (although the actual values were written properly). There are changes to statistics_type_identification.cuh, temp_storage_wrapper.cuh, and typed_statistics_chunk.cuh to address this. I'm not sure what impact these changes would have on the ORC writer.

I've added some python and java bindings, but haven't completed them yet. I've added the unit tests I could think of, but welcome suggestions for further tests.

Sorry for the huge PR, but it's due to circumstances beyond my control that I couldn't submit this as several PRs. :(

pass stats to gatherPages formatting

column indices

on device.

this should make additional ColumnIndex and OffsetIndex tests easier to write

implement readers for ColumnIndex, OffsetIndex, and PageLocation

truncate is broken for 2 and 4 byte values UTF-8 now maxes as 4 bytes, so fix that as well

GPUtester · 2022-06-29T16:59:50Z

Can one of the admins verify this patch?

rapids-bot · 2022-06-29T16:59:54Z

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

Adds some necessary structs to parquet.hpp as well as methods to CompactProtocolReader/Writer to address #9268 I can add tests if necessary once #11177 is merged, or testing can be deferred to be included in a future PR (based on #11171) Authors: - https://github.com/etseidl Approvers: - Devavret Makkar (https://github.com/devavret) - Yunsong Wang (https://github.com/PointKernel) URL: #11178

etseidl · 2022-07-08T17:20:11Z

Closing for now. Will resubmit slimmer PR once #11179 is merged.

etseidl and others added 30 commits May 27, 2022 13:53

checkpoint column index code

2ca0636

another checkpoint.

1a319a2

pass stats to gatherPages formatting

implement writing of offset indices

246bbe3

another checkpoint...almost working, just need to fix up the

3483f01

column indices

remove write(ColumnIndex&). not needed since ColumnIndex is encoded

324817b

on device.

column index min/max working now

ed41b76

get rid of unused struct

da7277f

calculate better estimate of size needed for column indexes

dabe0ac

ck is ref, not pointer

530cb16

checkpoint boundary order

84f0ea7

first pass at ordering is working now

cf0a3a2

style changes

5cde0d4

a little cleanup

0a913e8

Merge branch 'rapidsai:branch-22.08' into feature/column_index

9ff99b3

refactor CheckPageRows to make more use of CompactProtocolReader

f25472a

this should make additional ColumnIndex and OffsetIndex tests easier to write

tweaks to test code

8665566

test unsigned types in numeric tests

9532b8d

Merge branch 'rapidsai:branch-22.08' into feature/column_index

9425f03

checkpoint column order tests

fb89991

use vector for column_index blob so ColumnIndex can be fleshed out

aa5b714

make ColumnIndex a thrift struct

1d27c3d

implement readers for ColumnIndex, OffsetIndex, and PageLocation

fix bug in boolean list reader

5320214

needed converted type passed to device for stats encoding

5c1ed70

checkpoint...most numeric tests working now

e102346

fix for signed char

400eecd

strings work, chrono does not

5a1b206

Merge branch 'rapidsai:branch-22.08' into feature/column_index

0e2ed8a

move column index calculation to its own kernel

1b25279

clean up unused variables for column index calc

6f9be95

add debug statement

295bb52

etseidl and others added 18 commits June 24, 2022 11:13

Merge branch 'rapidsai:branch-22.08' into feature/column_index

f56f203

refactor a bit

b7fa031

more align8 refactoring

c4c40a6

checkpoint truncate_string

d9b05ce

truncation works for binary now, and maybe utf.

472adf0

Merge branch 'rapidsai:branch-22.08' into feature/column_index

bc7fba8

Merge branch 'rapidsai:branch-22.08' into feature/column_index

5473343

Merge branch 'rapidsai:branch-22.08' into feature/column_index

99ec1f4

Merge branch 'rapidsai:branch-22.08' into feature/column_index

152c4e8

Merge branch 'rapidsai:branch-22.08' into feature/column_index

2c1f38d

checkpoint test of truncated min/max stats

720838f

truncate is broken for 2 and 4 byte values UTF-8 now maxes as 4 bytes, so fix that as well

Merge branch 'rapidsai:branch-22.08' into feature/column_index

c3ce8e0

fix utf8 truncation

e7dd53a

fix cast

ed88ceb

Merge branch 'rapidsai:branch-22.08' into feature/column_index

bdd8680

cleanups for truncation tests

7ab9fd5

Merge branch 'rapidsai:branch-22.08' into feature/column_index

7bf5a5a

Merge branch 'rapidsai:branch-22.08' into feature/column_index

1f3b2d2

github-actions bot added Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jun 29, 2022

GregoryKimball assigned devavret Jun 29, 2022

This was referenced Jun 30, 2022

Add thrift support for parquet column and offset indexes #11178

Merged

Fix decimal128 stats in parquet writer #11179

Merged

etseidl closed this Jul 8, 2022

etseidl deleted the feature/column_index branch August 3, 2022 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify Parquet writer to produce column indexes #11171

Modify Parquet writer to produce column indexes #11171

etseidl commented Jun 29, 2022

GPUtester commented Jun 29, 2022

rapids-bot bot commented Jun 29, 2022

etseidl commented Jul 8, 2022

Modify Parquet writer to produce column indexes #11171

Modify Parquet writer to produce column indexes #11171

Conversation

etseidl commented Jun 29, 2022

GPUtester commented Jun 29, 2022

rapids-bot bot commented Jun 29, 2022

etseidl commented Jul 8, 2022