[FEA] Optimization of repetition and definition level decoding in the parquet reader kernel. #12633

nvdbaranec · 2023-01-27T20:59:55Z

The parquet reader kernels (gpuDecodePages and gpuComputePageSizes) are heavily bottlenecked by the repetition/definition level decoding steps.

These data streams produce the information needed to decode output value indices, nulls and nesting depth information. We currently use a single warp to decode batches of the above data in a double buffered way, handing these buffers off to 3 further warps which then decode the actual output data.

The idea is that the level decoding warp can provide batches of information faster than the value decoding warps can consume them. In practice, this doesn't happen: the value decoding warps sit idle while the level decoding warp slogs through the data.

A proposed optimization here is:

Change the fundamental way the kernel is structured to a more direct producer-consumer model. The primary bottleneck is the single warp which walks through the level data. Eg:

[4 byte header + level data chunk] [4 byte header + level data chunk] ....

A better structure would be to have a single warp which produces a list of each one of the chunks and puts them in a work queue. Each of the individual chunks would then get it's own warp to do the decoding. So instead of having 1 warp decoding say 64 chunks, we could have 64 warps each decoding one chunk. There would have to be some additional coordination to determine sizes and output positions, but that would mostly be in the form of a scan/prefix-sum between the warps which is fast. In addition, if we do this, it seems possible that we could eliminate the double buffered intermediate storage of output indices - each warp decoding the level stream could also simply copy/decode the output values/validity themselves directly.

The text was updated successfully, but these errors were encountered:

nvdbaranec · 2023-06-30T18:27:36Z

Update on this. I did quite a bit of work related to this. Notably:

Coming up with a mechanism for decoding repetition/definition levels (as well as dictionaries) using a block-wide mechanic
instead of just a single warp. It is considerably faster (up to 80% in the preprocess kernel and 50% in the decode kernel).
Conversion of the remainder of the preprocess and decode kernels to also operate block-wide. The preprocess kernel optimization has been merged. Optimization to decoding of parquet level streams #13203

For the decode kernel, overall optimization works, but growing the kernel from being warp-based to being block based resulted in a large growth in register and shared memory usage that undid the performance wins. Full heavyweight kernel can be found here: https://github.com/nvdbaranec/cudf/blob/3b13588f01bfb5ba55970c660641478de5fbcb38/cpp/src/io/parquet/page_data.cu#L2460

The conclusion reached was that we need to start exploring using multiple, smaller kernels that target specific types of page data (fixed width vs nested vs. dictionary, etc) and run them in parallel on mulitple streams. The idea is that the individual kernels remain slimmer and retain the optimization wins specific to their cases. A proof of concept for fixed-width, non-dictionary data can be found here: #13622

What remains now is to determine the exact mix of kernels to use. There are many possible ways to break them down, but starting with the ones that use the most similar profile of shared memory (eg, dictionary vs. non-dictionary) is probably a good start.

GregoryKimball · 2024-02-17T00:12:14Z

Closing in favor of #14953

nvdbaranec added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue improvement Improvement / enhancement to an existing function labels Jan 27, 2023

nvdbaranec self-assigned this Jan 27, 2023

mattahrens added the Performance Performance related issue label Jan 27, 2023

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023

mattahrens removed the Needs Triage Need team to review and classify label Apr 17, 2023

mattahrens assigned abellina Dec 15, 2023

GregoryKimball added this to the Parquet continuous improvement milestone Feb 15, 2024

GregoryKimball closed this as completed Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Optimization of repetition and definition level decoding in the parquet reader kernel. #12633

[FEA] Optimization of repetition and definition level decoding in the parquet reader kernel. #12633

nvdbaranec commented Jan 27, 2023

nvdbaranec commented Jun 30, 2023

GregoryKimball commented Feb 17, 2024

[FEA] Optimization of repetition and definition level decoding in the parquet reader kernel. #12633

[FEA] Optimization of repetition and definition level decoding in the parquet reader kernel. #12633

Comments

nvdbaranec commented Jan 27, 2023

nvdbaranec commented Jun 30, 2023

GregoryKimball commented Feb 17, 2024