Parquet multi kernel optimization example. #13622

nvdbaranec · 2023-06-26T18:31:37Z

This PR is intended to be a concrete implementation of the multi-kernel parquet decode concept. Not necessarily to be checked in in this current form. In short: having large monolithic kernels that can decode all permutations of page data (fixed-width, variable-width, nested, dictionaries, etc) results in poor performance. Increased worst-case shared memory usage, higher per-thread register counts, etc. I spent some time working on an optimization that made the decoder work in a more fundamentally block-wide way, but this ended up not being too much of a win for the above state reasons. My initial pass was quite fast and then as I slowly added new features it ground back down.

So the fundamental idea here is to break up the various types of data in the pages of a parquet file and use stripped down kernels specific to each type. Each kernel is launched on a separate, non-default stream to ensure overlapping. There are several benefits to doing it this way:

Each individual kernel can be launched with it's own set of grid parameters. For example, fixed-width decoders might perform better with larger blocks than variable-width decoders (or vice-versa).
Shared memory usage can be dramatically dropped and limited to exactly what is needed for a given kernel.
The slimmer kernels can flow around the bigger ones as hardware becomes available. With the monolithic kernel, the single grid size guarantees you will always be running as the biggest, heaviest kernel we have.

This PR does several things:

It breaks up page_data.cu into several component pieces. decode.cuh represents common code shared between the various kernels. decode_general.cu contains the existing, monolithic kernel we already use. decode_fixed.cu contains an optimized version of the decode kernel that targets fixed-width, non-nested, non-dictionaried data.
The new kernel uses the rle_stream to implement a block-wide decoding scheme, instead of the overlapped-multiple-warp technique of the existing one.
At header decode time, each page is assigned a mask value (via get_kernel_mask) that represents the specific kernel it should be decoded with. At data decode time, we or all of the required kernel bits for all the pages together, and launch which kernels we need (see DecodePageData in page_data.cu)
Many functions and structures have been templatized to account for the variable grid sizes and shared memory requirements.

General performance is very good for this data subset: roughly 50% speedup across the board in the decode step. This does not lead to that level of speedup for the cudf benchmarks for 2 main reasons:

They tend to spend about half their time in nvcomp decompression.
Many of the benchmarks used a wide variety of data types, so this new kernel is only invoked on a subset of the data.

Nonetheless in situations where we have little or no compression, the wins are clear:

Before

| data_type |      io       | bytes_per_second | encoded_file_size |
|     FLOAT | DEVICE_BUFFER |  35470873266     |       510.303 MiB |

After

| data_type |      io       | bytes_per_second | encoded_file_size |
|     FLOAT | DEVICE_BUFFER | 54260511096      |       510.303 MiB |

For cases where we have to go through nvcomp and we're dealing with a mix of new and old kernels, the wins are more mild, even though the decode step itself is still in the 50% speedup ballpark. Below is an example with integers.

Before. Total load time was 37. Only 18ms of that was the parquet decoder.

After. Total load time was 29ms. But only 10.5 of that was the parquet decoder, with a mix of the new and old kernels.

So a total time decode time reduction from 18ms to 10.5 ms.

You may notice that in the "after" case, there does not appear to be any kernel overlap. There's two things going on here. First, the fixed-width kernel runs at 100% occupancy, so there's no unused hardware. Second, you can see that when it is near completion, the general purpose kernel starts to flow in (and also runs at considerably lower occupancy)

Related: I have a branch up (not an actual PR) with the full heavyweight implementation of all features in the block-wide style. Hypothetically, all we need to do is cherry pick out the individual chunks of logic from there to build more kernels. This branch also includes a change to rle_stream that distributes the level-decoding work for arbitrarily sized runs evenly across all warps instead of the greedy one-warp-per-run approach.

https://github.com/nvdbaranec/cudf/tree/parquet_decode_optimization
https://github.com/nvdbaranec/cudf/blob/parquet_decode_optimization/cpp/src/io/parquet/page_data.cu

nvdbaranec · 2023-06-26T18:54:47Z

Open questions:

We would need to figure out what the right subset of kernels actually is. Based on our feature set, it's easy to imagine 16+ kernels with slightly different features. But that might be overkill. I tend to think the primary way of breaking them down should be based on shared memory usage differences.
There's more optimization that could be pushed through here. For example, the individual "copy this piece of data" functions are shared between the new and old kernels, but they all do a dictionary if-check. This could be if constexpr'd away if we wanted to go crazier.
This file layout is just a suggestion. page_data.cu was getting pretty crazy so I wanted to try out some ideas on splitting things up.

nvdbaranec · 2023-06-26T19:08:30Z

Also, there will be a decent amount of work to get this merged with the big string decoding change from @etseidl

etseidl · 2023-06-26T20:18:27Z

cpp/src/io/parquet/page_data.cu

-      <<<dim_grid, dim_block, 0, stream.value()>>>(pages.device_ptr(), chunks, min_row, num_rows);
+  // invoke all relevant kernels. each one will only process the pages whose masks match
+  // their own, and early-out on the rest.
+  if (kernel_mask & KERNEL_MASK_FIXED_WIDTH_NO_DICT) {


I've been thinking about this one. Would it be better to break up pages into separate batches grouped by kernel_mask? That way, each batch could make it's own way through pre-process and process, with each batch getting its own stream. Maybe fixed-width decode logic could overlap with some of the string preprocessing, for instance.

) Part of #13501. This adds support for decoding Parquet pages that are DELTA_BINARY_PACKED. In addition to adding delta support, this PR incorporates changes introduced in #13622, such as using a mask to determine which decoding kernels to run, and adding parameters to the `page_state_buffers_s` struct to reduce the amount of shared memory used. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #13637

GregoryKimball · 2024-01-29T20:47:21Z

FYI adding specialized decode kernels to parquet reader came up again today in a discussion about the query performance roadmap. I believe the next step was to group the different data types, nullability and encodings by register usage. Then from there we could start adding specialized kernels one at a time.

Please let me know if you agree.

nvdbaranec · 2024-03-04T16:41:03Z

Obsolete with #15159

@nvdbaranec

…t decode (#15159) This PR adds parquet decode fixed width and fixed width dictionary kernels micro kernels based on #13622. We change `rle_stream` to allow us to process dictionary pages, and so are able to use it in `gpuDecodePageDataFixedDict` allowing it to take a "count" to be decoded, which we provide as the number of valid (non-null) elements that were found in the definition stream. Most of the intrusive changes are in `rle_stream` so that it can work with this argument. One big change is that prior to this we used to "spill" runs that would not fit in the current iteration. We have changed it so that we don't spill anymore and we could have in the `runs` array a large run that won't be decoded until several calls to `decode` later. This opens the possibility for us to throw more decode threads at the accumulated run fairly easily, and that may be worked on in this PR or shortly after (load balancing a large run). The code here is really mostly @nvdbaranec and makes use of @etseidl's great work on `rle_stream`. It is marked in draft because it's not entirely done (not all testing has been performed). That said, NDS, nvbench and `PARQUET_TEST` passes. In order to use it, please set `USE_FIXED_OP=2` which means we enable both the `gpuDecodePageDataFixed` and `gpuDecodePageDataFixedDict` kernels. Here are `USE_FIXED_OP=2` nvbench results against 24.04 on my RTX6000: ``` # parquet_read_decode ## [0] Quadro RTX 6000 | data_type | io_type | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |-------------|---------------|---------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------| | INTEGRAL | DEVICE_BUFFER | 0 | 1 | 24.299 ms | 2.84% | 20.192 ms | 0.38% | -4106.917 us | -16.90% | FAIL | | INTEGRAL | DEVICE_BUFFER | 1000 | 1 | 23.668 ms | 3.20% | 20.042 ms | 3.93% | -3625.853 us | -15.32% | FAIL | | INTEGRAL | DEVICE_BUFFER | 0 | 32 | 21.226 ms | 0.21% | 17.781 ms | 0.15% | -3444.841 us | -16.23% | FAIL | | INTEGRAL | DEVICE_BUFFER | 1000 | 32 | 20.532 ms | 0.30% | 17.221 ms | 0.46% | -3311.125 us | -16.13% | FAIL | | FLOAT | DEVICE_BUFFER | 0 | 1 | 12.201 ms | 0.18% | 9.438 ms | 0.49% | -2763.086 us | -22.65% | FAIL | | FLOAT | DEVICE_BUFFER | 1000 | 1 | 14.530 ms | 0.50% | 12.578 ms | 0.50% | -1952.573 us | -13.44% | FAIL | | FLOAT | DEVICE_BUFFER | 0 | 32 | 12.195 ms | 0.45% | 10.204 ms | 0.45% | -1990.871 us | -16.33% | FAIL | | FLOAT | DEVICE_BUFFER | 1000 | 32 | 12.765 ms | 0.31% | 10.703 ms | 0.19% | -2061.599 us | -16.15% | FAIL | | DECIMAL | DEVICE_BUFFER | 0 | 1 | 27.572 ms | 2.47% | 23.911 ms | 0.32% | -3661.435 us | -13.28% | FAIL | | DECIMAL | DEVICE_BUFFER | 1000 | 1 | 12.410 ms | 0.55% | 11.530 ms | 3.89% | -880.084 us | -7.09% | FAIL | | DECIMAL | DEVICE_BUFFER | 0 | 32 | 13.479 ms | 0.26% | 12.119 ms | 0.37% | -1360.157 us | -10.09% | FAIL | | DECIMAL | DEVICE_BUFFER | 1000 | 32 | 9.583 ms | 0.43% | 8.182 ms | 0.23% | -1400.545 us | -14.62% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 0 | 1 | 34.390 ms | 0.50% | 32.211 ms | 0.41% | -2178.951 us | -6.34% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 1000 | 1 | 12.432 ms | 0.43% | 10.741 ms | 0.42% | -1691.559 us | -13.61% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 0 | 32 | 14.759 ms | 0.40% | 12.941 ms | 0.19% | -1817.825 us | -12.32% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 1000 | 32 | 10.613 ms | 0.32% | 8.791 ms | 0.19% | -1822.373 us | -17.17% | FAIL | | DURATION | DEVICE_BUFFER | 0 | 1 | 14.849 ms | 0.29% | 12.812 ms | 0.21% | -2037.408 us | -13.72% | FAIL | | DURATION | DEVICE_BUFFER | 1000 | 1 | 11.806 ms | 0.32% | 10.110 ms | 0.43% | -1695.815 us | -14.36% | FAIL | | DURATION | DEVICE_BUFFER | 0 | 32 | 11.620 ms | 0.24% | 9.751 ms | 0.15% | -1869.041 us | -16.08% | FAIL | | DURATION | DEVICE_BUFFER | 1000 | 32 | 10.307 ms | 0.27% | 8.398 ms | 0.19% | -1909.239 us | -18.52% | FAIL | | STRING | DEVICE_BUFFER | 0 | 1 | 55.028 ms | 1.00% | 54.751 ms | 0.68% | -277.519 us | -0.50% | PASS | | STRING | DEVICE_BUFFER | 1000 | 1 | 19.503 ms | 0.46% | 19.399 ms | 0.30% | -104.924 us | -0.54% | FAIL | | STRING | DEVICE_BUFFER | 0 | 32 | 55.287 ms | 0.78% | 54.857 ms | 0.38% | -430.236 us | -0.78% | FAIL | | STRING | DEVICE_BUFFER | 1000 | 32 | 15.392 ms | 0.62% | 15.527 ms | 1.62% | 135.949 us | 0.88% | FAIL | | LIST | DEVICE_BUFFER | 0 | 1 | 85.392 ms | 0.64% | 85.956 ms | 0.36% | 564.047 us | 0.66% | FAIL | | LIST | DEVICE_BUFFER | 1000 | 1 | 82.151 ms | 0.77% | 82.977 ms | 0.76% | 825.975 us | 1.01% | FAIL | | LIST | DEVICE_BUFFER | 0 | 32 | 71.257 ms | 0.77% | 72.425 ms | 0.67% | 1.168 ms | 1.64% | FAIL | | LIST | DEVICE_BUFFER | 1000 | 32 | 72.176 ms | 0.19% | 73.750 ms | 1.47% | 1.574 ms | 2.18% | FAIL | | STRUCT | DEVICE_BUFFER | 0 | 1 | 66.675 ms | 1.41% | 66.663 ms | 1.26% | -11.513 us | -0.02% | PASS | | STRUCT | DEVICE_BUFFER | 1000 | 1 | 39.667 ms | 0.41% | 39.758 ms | 0.50% | 91.341 us | 0.23% | PASS | | STRUCT | DEVICE_BUFFER | 0 | 32 | 66.765 ms | 1.59% | 66.873 ms | 1.40% | 107.569 us | 0.16% | PASS | | STRUCT | DEVICE_BUFFER | 1000 | 32 | 34.681 ms | 0.16% | 34.748 ms | 0.28% | 67.095 us | 0.19% | FAIL | # parquet_read_io_compression ## [0] Quadro RTX 6000 | io_type | compression_type | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |---------------|--------------------|---------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------| | FILEPATH | SNAPPY | 0 | 1 | 1.438 s | 0.42% | 1.444 s | 0.32% | 5.797 ms | 0.40% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 1 | 1.376 s | 0.13% | 1.386 s | 0.14% | 10.316 ms | 0.75% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 1 | 1.341 s | 0.57% | 1.349 s | 0.11% | 8.160 ms | 0.61% | FAIL | | FILEPATH | NONE | 0 | 1 | 1.220 s | 0.43% | 1.216 s | 0.48% | -4048.560 us | -0.33% | PASS | | HOST_BUFFER | NONE | 0 | 1 | 1.154 s | 0.13% | 1.156 s | 0.06% | 2.176 ms | 0.19% | FAIL | | DEVICE_BUFFER | NONE | 0 | 1 | 1.113 s | 0.02% | 1.115 s | 0.15% | 2.028 ms | 0.18% | FAIL | | FILEPATH | SNAPPY | 1000 | 1 | 1.317 s | 0.75% | 1.326 s | 0.16% | 9.613 ms | 0.73% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 1 | 1.291 s | 0.63% | 1.298 s | 0.68% | 7.390 ms | 0.57% | PASS | | DEVICE_BUFFER | SNAPPY | 1000 | 1 | 1.274 s | 0.67% | 1.288 s | 0.64% | 13.666 ms | 1.07% | FAIL | | FILEPATH | NONE | 1000 | 1 | 1.200 s | 0.64% | 1.218 s | 0.42% | 18.036 ms | 1.50% | FAIL | | HOST_BUFFER | NONE | 1000 | 1 | 1.181 s | 0.62% | 1.191 s | 0.69% | 10.329 ms | 0.87% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 1 | 1.166 s | 0.79% | 1.178 s | 0.67% | 11.599 ms | 0.99% | FAIL | | FILEPATH | SNAPPY | 0 | 32 | 1.094 s | 0.64% | 1.106 s | 0.71% | 11.885 ms | 1.09% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 32 | 1.086 s | 0.72% | 1.101 s | 0.74% | 14.529 ms | 1.34% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 32 | 1.082 s | 0.63% | 1.098 s | 0.67% | 16.798 ms | 1.55% | FAIL | | FILEPATH | NONE | 0 | 32 | 1.050 s | 0.27% | 1.059 s | 0.73% | 9.672 ms | 0.92% | FAIL | | HOST_BUFFER | NONE | 0 | 32 | 1.035 s | 0.67% | 1.048 s | 0.73% | 13.583 ms | 1.31% | FAIL | | DEVICE_BUFFER | NONE | 0 | 32 | 1.034 s | 0.80% | 1.045 s | 0.64% | 11.315 ms | 1.09% | FAIL | | FILEPATH | SNAPPY | 1000 | 32 | 1.117 s | 0.53% | 1.130 s | 0.49% | 12.376 ms | 1.11% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 32 | 1.109 s | 0.49% | 1.123 s | 0.50% | 14.328 ms | 1.29% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 32 | 1.106 s | 0.59% | 1.126 s | 0.17% | 20.326 ms | 1.84% | FAIL | | FILEPATH | NONE | 1000 | 32 | 1.029 s | 0.53% | 1.045 s | 0.61% | 15.633 ms | 1.52% | FAIL | | HOST_BUFFER | NONE | 1000 | 32 | 1.025 s | 0.49% | 1.040 s | 0.06% | 14.977 ms | 1.46% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 32 | 1.031 s | 0.26% | 1.038 s | 0.17% | 7.165 ms | 0.70% | FAIL | # parquet_read_chunks ## [0] Quadro RTX 6000 | T | io_type | cardinality | run_length | byte_limit | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |-----------|---------------|---------------|--------------|--------------|------------|-------------|------------|-------------|---------------|---------|----------| | INTEGRAL | DEVICE_BUFFER | 0 | 1 | 0 | 24.646 ms | 0.07% | 20.620 ms | 0.25% | -4025.493 us | -16.33% | FAIL | | INTEGRAL | DEVICE_BUFFER | 1000 | 1 | 0 | 24.005 ms | 0.35% | 20.284 ms | 0.24% | -3721.359 us | -15.50% | FAIL | | INTEGRAL | DEVICE_BUFFER | 0 | 32 | 0 | 21.492 ms | 0.50% | 18.074 ms | 0.50% | -3418.141 us | -15.90% | FAIL | | INTEGRAL | DEVICE_BUFFER | 1000 | 32 | 0 | 20.759 ms | 0.16% | 17.426 ms | 0.43% | -3332.946 us | -16.06% | FAIL | | INTEGRAL | DEVICE_BUFFER | 0 | 1 | 500000 | 206.076 ms | 0.13% | 210.958 ms | 0.20% | 4.882 ms | 2.37% | FAIL | | INTEGRAL | DEVICE_BUFFER | 1000 | 1 | 500000 | 211.737 ms | 0.06% | 206.936 ms | 0.05% | -4801.517 us | -2.27% | FAIL | | INTEGRAL | DEVICE_BUFFER | 0 | 32 | 500000 | 191.610 ms | 0.10% | 190.775 ms | 0.19% | -835.413 us | -0.44% | FAIL | | INTEGRAL | DEVICE_BUFFER | 1000 | 32 | 500000 | 191.122 ms | 0.21% | 189.118 ms | 0.18% | -2004.013 us | -1.05% | FAIL | | FLOAT | DEVICE_BUFFER | 0 | 1 | 0 | 12.332 ms | 0.50% | 9.576 ms | 0.16% | -2755.392 us | -22.34% | FAIL | | FLOAT | DEVICE_BUFFER | 1000 | 1 | 0 | 14.700 ms | 0.50% | 12.748 ms | 0.50% | -1951.554 us | -13.28% | FAIL | | FLOAT | DEVICE_BUFFER | 0 | 32 | 0 | 12.429 ms | 5.39% | 10.384 ms | 0.16% | -2044.469 us | -16.45% | FAIL | | FLOAT | DEVICE_BUFFER | 1000 | 32 | 0 | 13.010 ms | 4.35% | 10.871 ms | 0.19% | -2138.692 us | -16.44% | FAIL | | FLOAT | DEVICE_BUFFER | 0 | 1 | 500000 | 97.573 ms | 0.23% | 66.537 ms | 0.16% | -31035.966 us | -31.81% | FAIL | | FLOAT | DEVICE_BUFFER | 1000 | 1 | 500000 | 107.469 ms | 0.27% | 84.752 ms | 0.28% | -22716.950 us | -21.14% | FAIL | | FLOAT | DEVICE_BUFFER | 0 | 32 | 500000 | 95.086 ms | 0.26% | 74.493 ms | 0.15% | -20592.478 us | -21.66% | FAIL | | FLOAT | DEVICE_BUFFER | 1000 | 32 | 500000 | 95.634 ms | 0.18% | 73.872 ms | 0.21% | -21761.426 us | -22.75% | FAIL | | DECIMAL | DEVICE_BUFFER | 0 | 1 | 0 | 28.070 ms | 0.57% | 24.134 ms | 0.42% | -3936.545 us | -14.02% | FAIL | | DECIMAL | DEVICE_BUFFER | 1000 | 1 | 0 | 12.604 ms | 1.97% | 11.663 ms | 2.94% | -940.845 us | -7.46% | FAIL | | DECIMAL | DEVICE_BUFFER | 0 | 32 | 0 | 13.551 ms | 0.28% | 12.220 ms | 0.18% | -1330.788 us | -9.82% | FAIL | | DECIMAL | DEVICE_BUFFER | 1000 | 32 | 0 | 9.656 ms | 0.28% | 8.279 ms | 0.22% | -1377.165 us | -14.26% | FAIL | | DECIMAL | DEVICE_BUFFER | 0 | 1 | 500000 | 97.520 ms | 0.50% | 59.952 ms | 0.07% | -37568.352 us | -38.52% | FAIL | | DECIMAL | DEVICE_BUFFER | 1000 | 1 | 500000 | 69.184 ms | 0.34% | 60.300 ms | 0.40% | -8883.777 us | -12.84% | FAIL | | DECIMAL | DEVICE_BUFFER | 0 | 32 | 500000 | 63.886 ms | 0.15% | 53.062 ms | 0.21% | -10824.109 us | -16.94% | FAIL | | DECIMAL | DEVICE_BUFFER | 1000 | 32 | 500000 | 60.067 ms | 0.23% | 48.565 ms | 0.23% | -11501.788 us | -19.15% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 0 | 1 | 0 | 34.702 ms | 0.50% | 32.566 ms | 0.50% | -2136.855 us | -6.16% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 1000 | 1 | 0 | 12.558 ms | 0.41% | 10.874 ms | 0.41% | -1684.535 us | -13.41% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 0 | 32 | 0 | 14.803 ms | 0.18% | 13.131 ms | 0.24% | -1671.818 us | -11.29% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 1000 | 32 | 0 | 10.720 ms | 0.31% | 8.928 ms | 0.16% | -1791.785 us | -16.71% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 0 | 1 | 500000 | 105.994 ms | 0.31% | 93.799 ms | 0.13% | -12194.481 us | -11.50% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 1000 | 1 | 500000 | 87.865 ms | 0.20% | 69.684 ms | 0.23% | -18181.288 us | -20.69% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 0 | 32 | 500000 | 81.901 ms | 0.09% | 65.145 ms | 0.18% | -16755.397 us | -20.46% | FAIL | | TIMESTAMP | DEVICE_BUFFER | 1000 | 32 | 500000 | 77.813 ms | 0.15% | 60.106 ms | 0.27% | -17707.050 us | -22.76% | FAIL | | DURATION | DEVICE_BUFFER | 0 | 1 | 0 | 14.926 ms | 0.25% | 12.981 ms | 0.47% | -1945.295 us | -13.03% | FAIL | | DURATION | DEVICE_BUFFER | 1000 | 1 | 0 | 11.977 ms | 0.41% | 10.241 ms | 0.45% | -1736.000 us | -14.49% | FAIL | | DURATION | DEVICE_BUFFER | 0 | 32 | 0 | 11.707 ms | 0.25% | 9.898 ms | 0.14% | -1809.748 us | -15.46% | FAIL | | DURATION | DEVICE_BUFFER | 1000 | 32 | 0 | 10.402 ms | 0.49% | 8.535 ms | 0.11% | -1867.021 us | -17.95% | FAIL | | DURATION | DEVICE_BUFFER | 0 | 1 | 500000 | 84.585 ms | 0.14% | 68.369 ms | 0.24% | -16216.952 us | -19.17% | FAIL | | DURATION | DEVICE_BUFFER | 1000 | 1 | 500000 | 85.377 ms | 0.20% | 66.735 ms | 0.21% | -18642.646 us | -21.84% | FAIL | | DURATION | DEVICE_BUFFER | 0 | 32 | 500000 | 76.875 ms | 0.15% | 59.947 ms | 0.19% | -16928.464 us | -22.02% | FAIL | | DURATION | DEVICE_BUFFER | 1000 | 32 | 500000 | 75.601 ms | 0.18% | 57.722 ms | 0.23% | -17879.257 us | -23.65% | FAIL | | STRING | DEVICE_BUFFER | 0 | 1 | 0 | 55.084 ms | 0.50% | 55.018 ms | 0.78% | -66.611 us | -0.12% | PASS | | STRING | DEVICE_BUFFER | 1000 | 1 | 0 | 19.502 ms | 0.27% | 19.617 ms | 0.49% | 115.500 us | 0.59% | FAIL | | STRING | DEVICE_BUFFER | 0 | 32 | 0 | 55.233 ms | 0.72% | 55.394 ms | 0.77% | 161.028 us | 0.29% | PASS | | STRING | DEVICE_BUFFER | 1000 | 32 | 0 | 15.445 ms | 0.53% | 15.674 ms | 0.50% | 229.270 us | 1.48% | FAIL | | STRING | DEVICE_BUFFER | 0 | 1 | 500000 | 197.625 ms | 0.33% | 198.769 ms | 0.48% | 1.144 ms | 0.58% | FAIL | | STRING | DEVICE_BUFFER | 1000 | 1 | 500000 | 90.339 ms | 0.48% | 90.797 ms | 0.58% | 457.714 us | 0.51% | FAIL | | STRING | DEVICE_BUFFER | 0 | 32 | 500000 | 198.090 ms | 0.35% | 198.238 ms | 0.31% | 147.910 us | 0.07% | PASS | | STRING | DEVICE_BUFFER | 1000 | 32 | 500000 | 67.181 ms | 0.17% | 67.796 ms | 0.34% | 615.115 us | 0.92% | FAIL | | LIST | DEVICE_BUFFER | 0 | 1 | 0 | 90.825 ms | 0.75% | 91.843 ms | 0.58% | 1.019 ms | 1.12% | FAIL | | LIST | DEVICE_BUFFER | 1000 | 1 | 0 | 87.731 ms | 0.63% | 88.633 ms | 0.69% | 901.582 us | 1.03% | FAIL | | LIST | DEVICE_BUFFER | 0 | 32 | 0 | 76.089 ms | 0.47% | 77.107 ms | 0.39% | 1.018 ms | 1.34% | FAIL | | LIST | DEVICE_BUFFER | 1000 | 32 | 0 | 77.148 ms | 0.69% | 78.719 ms | 0.63% | 1.571 ms | 2.04% | FAIL | | LIST | DEVICE_BUFFER | 0 | 1 | 500000 | 1.470 s | 0.15% | 1.469 s | 0.09% | -336.597 us | -0.02% | PASS | | LIST | DEVICE_BUFFER | 1000 | 1 | 500000 | 1.092 s | 0.15% | 1.092 s | 0.37% | 269.702 us | 0.02% | PASS | | LIST | DEVICE_BUFFER | 0 | 32 | 500000 | 956.759 ms | 0.31% | 956.879 ms | 0.24% | 119.287 us | 0.01% | PASS | | LIST | DEVICE_BUFFER | 1000 | 32 | 500000 | 959.021 ms | 0.43% | 957.862 ms | 0.24% | -1159.729 us | -0.12% | PASS | | STRUCT | DEVICE_BUFFER | 0 | 1 | 0 | 66.450 ms | 2.05% | 66.381 ms | 1.19% | -69.128 us | -0.10% | PASS | | STRUCT | DEVICE_BUFFER | 1000 | 1 | 0 | 39.866 ms | 0.50% | 39.669 ms | 0.36% | -197.133 us | -0.49% | FAIL | | STRUCT | DEVICE_BUFFER | 0 | 32 | 0 | 66.996 ms | 1.50% | 66.634 ms | 1.54% | -362.507 us | -0.54% | PASS | | STRUCT | DEVICE_BUFFER | 1000 | 32 | 0 | 34.995 ms | 0.18% | 34.746 ms | 0.24% | -249.649 us | -0.71% | FAIL | | STRUCT | DEVICE_BUFFER | 0 | 1 | 500000 | 387.274 ms | 0.24% | 381.353 ms | 0.28% | -5920.825 us | -1.53% | FAIL | | STRUCT | DEVICE_BUFFER | 1000 | 1 | 500000 | 313.150 ms | 0.23% | 308.024 ms | 0.07% | -5125.220 us | -1.64% | FAIL | | STRUCT | DEVICE_BUFFER | 0 | 32 | 500000 | 388.062 ms | 0.49% | 381.960 ms | 0.12% | -6102.039 us | -1.57% | FAIL | | STRUCT | DEVICE_BUFFER | 1000 | 32 | 500000 | 306.761 ms | 0.07% | 302.670 ms | 0.15% | -4090.417 us | -1.33% | FAIL | # parquet_read_io_small_mixed ## [0] Quadro RTX 6000 | io_type | cardinality | run_length | num_string_cols | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |-----------|---------------|--------------|-------------------|------------|-------------|------------|-------------|-------------|---------|----------| | FILEPATH | 0 | 1 | 1 | 6.298 ms | 3.83% | 6.324 ms | 2.82% | 26.068 us | 0.41% | PASS | | FILEPATH | 1000 | 1 | 1 | 4.380 ms | 9.20% | 4.389 ms | 7.10% | 8.962 us | 0.20% | PASS | | FILEPATH | 0 | 32 | 1 | 5.849 ms | 3.40% | 5.846 ms | 3.32% | -2.973 us | -0.05% | PASS | | FILEPATH | 1000 | 32 | 1 | 3.005 ms | 0.50% | 3.104 ms | 2.46% | 98.349 us | 3.27% | FAIL | | FILEPATH | 0 | 1 | 2 | 6.827 ms | 1.81% | 6.828 ms | 1.85% | 0.700 us | 0.01% | PASS | | FILEPATH | 1000 | 1 | 2 | 4.362 ms | 3.89% | 4.363 ms | 4.08% | 1.252 us | 0.03% | PASS | | FILEPATH | 0 | 32 | 2 | 6.498 ms | 1.72% | 6.462 ms | 1.94% | -36.489 us | -0.56% | PASS | | FILEPATH | 1000 | 32 | 2 | 3.020 ms | 0.61% | 3.098 ms | 2.52% | 78.572 us | 2.60% | FAIL | | FILEPATH | 0 | 1 | 3 | 7.271 ms | 2.64% | 7.412 ms | 2.19% | 141.040 us | 1.94% | PASS | | FILEPATH | 1000 | 1 | 3 | 4.368 ms | 6.54% | 4.360 ms | 2.48% | -7.720 us | -0.18% | PASS | | FILEPATH | 0 | 32 | 3 | 7.236 ms | 3.17% | 7.096 ms | 2.32% | -139.268 us | -1.92% | PASS | | FILEPATH | 1000 | 32 | 3 | 3.101 ms | 2.41% | 3.118 ms | 2.45% | 17.422 us | 0.56% | PASS | # parquet_read_row_selection ## [0] Quadro RTX 6000 | column_selection | row_selection | str_to_categories | uses_pandas_metadata | timestamp_type | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |--------------------|-----------------|---------------------|------------------------|------------------|------------|-------------|------------|-------------|-----------|---------|----------| | ALL | ALL | YES | YES | EMPTY | 1.463 s | 0.23% | 1.472 s | 0.21% | 8.307 ms | 0.57% | FAIL | | ALL | NROWS | YES | YES | EMPTY | 1.669 s | 0.04% | 1.680 s | 0.02% | 11.129 ms | 0.67% | FAIL | | ALL | ROW_GROUPS | YES | YES | EMPTY | 1.548 s | 0.01% | 1.555 s | 0.00% | 6.236 ms | 0.40% | FAIL | # parquet_read_column_selection ## [0] Quadro RTX 6000 | column_selection | row_selection | str_to_categories | uses_pandas_metadata | timestamp_type | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |--------------------|-----------------|---------------------|------------------------|------------------|------------|-------------|------------|-------------|-----------|---------|----------| | ALL | ALL | YES | YES | EMPTY | 1.458 s | 0.49% | 1.472 s | 0.10% | 13.627 ms | 0.93% | FAIL | | ALTERNATE | ALL | YES | YES | EMPTY | 1.451 s | 0.01% | 1.463 s | 0.52% | 12.089 ms | 0.83% | FAIL | | FIRST_HALF | ALL | YES | YES | EMPTY | 1.438 s | 0.02% | 1.446 s | 0.60% | 7.382 ms | 0.51% | FAIL | | SECOND_HALF | ALL | YES | YES | EMPTY | 1.447 s | 0.02% | 1.456 s | 0.53% | 8.770 ms | 0.61% | FAIL | # parquet_read_misc_options ## [0] Quadro RTX 6000 | column_selection | row_selection | str_to_categories | uses_pandas_metadata | timestamp_type | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |--------------------|-----------------|---------------------|------------------------|------------------|------------|-------------|------------|-------------|-----------|---------|----------| | ALL | ALL | YES | YES | EMPTY | 1.456 s | 0.50% | 1.471 s | 0.49% | 15.157 ms | 1.04% | FAIL | | ALL | ALL | YES | NO | EMPTY | 1.464 s | 0.14% | 1.473 s | 0.49% | 9.022 ms | 0.62% | FAIL | | ALL | ALL | NO | YES | EMPTY | 1.457 s | 0.41% | 1.476 s | 0.23% | 19.434 ms | 1.33% | FAIL | | ALL | ALL | NO | NO | EMPTY | 1.460 s | 0.47% | 1.476 s | 0.09% | 15.917 ms | 1.09% | FAIL | # Summary - Total Matches: 143 - Pass (diff <= min_noise): 25 - Unknown (infinite noise): 0 - Failure (diff > min_noise): 118 ``` Authors: - Alessandro Bellina (https://github.com/abellina) - https://github.com/nvdbaranec Approvers: - Robert Maynard (https://github.com/robertmaynard) - https://github.com/nvdbaranec - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) URL: #15159

nvdbaranec added 2 commits June 23, 2023 14:00

First pass of refactoring parquet kernels.

5b8fce9

Full multi-kernel integration. Tuning.

5a259a4

nvdbaranec requested review from a team as code owners June 26, 2023 18:31

nvdbaranec requested review from karthikeyann and vuule June 26, 2023 18:31

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jun 26, 2023

nvdbaranec marked this pull request as draft June 26, 2023 18:31

ttnghia self-requested a review June 26, 2023 19:44

etseidl reviewed Jun 26, 2023

View reviewed changes

etseidl added a commit to etseidl/cudf that referenced this pull request Jun 27, 2023

start merging in changes from rapidsai#13622

9326321

etseidl added a commit to etseidl/cudf that referenced this pull request Jun 28, 2023

pull in changes from rapidsai#13622

0e181a8

etseidl mentioned this pull request Jun 29, 2023

[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader #13637

Merged

3 tasks

nvdbaranec mentioned this pull request Jun 30, 2023

[FEA] Optimization of repetition and definition level decoding in the parquet reader kernel. #12633

Closed

GregoryKimball mentioned this pull request Jul 3, 2023

[FEA] Support V2 encodings in Parquet reader and writer #13501

Closed

etseidl added a commit to etseidl/cudf that referenced this pull request Aug 3, 2023

start merging in changes from rapidsai#13622

e409d4f

etseidl added a commit to etseidl/cudf that referenced this pull request Aug 3, 2023

pull in changes from rapidsai#13622

41fc850

etseidl added a commit to etseidl/cudf that referenced this pull request Aug 3, 2023

start merging in changes from rapidsai#13622

8f5aa01

etseidl added a commit to etseidl/cudf that referenced this pull request Aug 3, 2023

pull in changes from rapidsai#13622

a12b8a4

GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment cuIO cuIO issue labels Aug 7, 2023

GregoryKimball mentioned this pull request Sep 10, 2023

[FEA] Improve ORC reader filtering and performance #13882

Open

abellina mentioned this pull request Feb 1, 2024

[FEA] parquet: rle_stream for dictionary pages #14950

Open

abellina mentioned this pull request Feb 27, 2024

Add microkernels for fixed-width and fixed-width dictionary in Parquet decode #15159

Merged

3 tasks

nvdbaranec closed this Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet multi kernel optimization example. #13622

Parquet multi kernel optimization example. #13622

nvdbaranec commented Jun 26, 2023 •

edited

Loading

nvdbaranec commented Jun 26, 2023

nvdbaranec commented Jun 26, 2023

etseidl Jun 26, 2023

GregoryKimball commented Jan 29, 2024

nvdbaranec commented Mar 4, 2024

Parquet multi kernel optimization example. #13622

Parquet multi kernel optimization example. #13622

Conversation

nvdbaranec commented Jun 26, 2023 • edited Loading

nvdbaranec commented Jun 26, 2023

nvdbaranec commented Jun 26, 2023

etseidl Jun 26, 2023

Choose a reason for hiding this comment

GregoryKimball commented Jan 29, 2024

nvdbaranec commented Mar 4, 2024

nvdbaranec commented Jun 26, 2023 •

edited

Loading