Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet multi kernel optimization example. #13622

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Jun 26, 2023

This PR is intended to be a concrete implementation of the multi-kernel parquet decode concept. Not necessarily to be checked in in this current form. In short: having large monolithic kernels that can decode all permutations of page data (fixed-width, variable-width, nested, dictionaries, etc) results in poor performance. Increased worst-case shared memory usage, higher per-thread register counts, etc. I spent some time working on an optimization that made the decoder work in a more fundamentally block-wide way, but this ended up not being too much of a win for the above state reasons. My initial pass was quite fast and then as I slowly added new features it ground back down.

So the fundamental idea here is to break up the various types of data in the pages of a parquet file and use stripped down kernels specific to each type. Each kernel is launched on a separate, non-default stream to ensure overlapping. There are several benefits to doing it this way:

  • Each individual kernel can be launched with it's own set of grid parameters. For example, fixed-width decoders might perform better with larger blocks than variable-width decoders (or vice-versa).
  • Shared memory usage can be dramatically dropped and limited to exactly what is needed for a given kernel.
  • The slimmer kernels can flow around the bigger ones as hardware becomes available. With the monolithic kernel, the single grid size guarantees you will always be running as the biggest, heaviest kernel we have.

This PR does several things:

  • It breaks up page_data.cu into several component pieces. decode.cuh represents common code shared between the various kernels. decode_general.cu contains the existing, monolithic kernel we already use. decode_fixed.cu contains an optimized version of the decode kernel that targets fixed-width, non-nested, non-dictionaried data.
  • The new kernel uses the rle_stream to implement a block-wide decoding scheme, instead of the overlapped-multiple-warp technique of the existing one.
  • At header decode time, each page is assigned a mask value (via get_kernel_mask) that represents the specific kernel it should be decoded with. At data decode time, we or all of the required kernel bits for all the pages together, and launch which kernels we need (see DecodePageData in page_data.cu)
  • Many functions and structures have been templatized to account for the variable grid sizes and shared memory requirements.

General performance is very good for this data subset: roughly 50% speedup across the board in the decode step. This does not lead to that level of speedup for the cudf benchmarks for 2 main reasons:

  • They tend to spend about half their time in nvcomp decompression.
  • Many of the benchmarks used a wide variety of data types, so this new kernel is only invoked on a subset of the data.

Nonetheless in situations where we have little or no compression, the wins are clear:

Before

| data_type |      io       | bytes_per_second | encoded_file_size |
|     FLOAT | DEVICE_BUFFER |  35470873266     |       510.303 MiB |

image

After

| data_type |      io       | bytes_per_second | encoded_file_size |
|     FLOAT | DEVICE_BUFFER | 54260511096      |       510.303 MiB |

image

For cases where we have to go through nvcomp and we're dealing with a mix of new and old kernels, the wins are more mild, even though the decode step itself is still in the 50% speedup ballpark. Below is an example with integers.

Before. Total load time was 37. Only 18ms of that was the parquet decoder.
image
image

After. Total load time was 29ms. But only 10.5 of that was the parquet decoder, with a mix of the new and old kernels.
image
image

So a total time decode time reduction from 18ms to 10.5 ms.

You may notice that in the "after" case, there does not appear to be any kernel overlap. There's two things going on here. First, the fixed-width kernel runs at 100% occupancy, so there's no unused hardware. Second, you can see that when it is near completion, the general purpose kernel starts to flow in (and also runs at considerably lower occupancy)
image

Related: I have a branch up (not an actual PR) with the full heavyweight implementation of all features in the block-wide style. Hypothetically, all we need to do is cherry pick out the individual chunks of logic from there to build more kernels. This branch also includes a change to rle_stream that distributes the level-decoding work for arbitrarily sized runs evenly across all warps instead of the greedy one-warp-per-run approach.

https://github.com/nvdbaranec/cudf/tree/parquet_decode_optimization
https://github.com/nvdbaranec/cudf/blob/parquet_decode_optimization/cpp/src/io/parquet/page_data.cu

@nvdbaranec nvdbaranec requested review from a team as code owners June 26, 2023 18:31
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jun 26, 2023
@nvdbaranec nvdbaranec marked this pull request as draft June 26, 2023 18:31
@nvdbaranec
Copy link
Contributor Author

Open questions:

  • We would need to figure out what the right subset of kernels actually is. Based on our feature set, it's easy to imagine 16+ kernels with slightly different features. But that might be overkill. I tend to think the primary way of breaking them down should be based on shared memory usage differences.

  • There's more optimization that could be pushed through here. For example, the individual "copy this piece of data" functions are shared between the new and old kernels, but they all do a dictionary if-check. This could be if constexpr'd away if we wanted to go crazier.

  • This file layout is just a suggestion. page_data.cu was getting pretty crazy so I wanted to try out some ideas on splitting things up.

@nvdbaranec
Copy link
Contributor Author

Also, there will be a decent amount of work to get this merged with the big string decoding change from @etseidl

@ttnghia ttnghia self-requested a review June 26, 2023 19:44
<<<dim_grid, dim_block, 0, stream.value()>>>(pages.device_ptr(), chunks, min_row, num_rows);
// invoke all relevant kernels. each one will only process the pages whose masks match
// their own, and early-out on the rest.
if (kernel_mask & KERNEL_MASK_FIXED_WIDTH_NO_DICT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this one. Would it be better to break up pages into separate batches grouped by kernel_mask? That way, each batch could make it's own way through pre-process and process, with each batch getting its own stream. Maybe fixed-width decode logic could overlap with some of the string preprocessing, for instance.

etseidl added a commit to etseidl/cudf that referenced this pull request Jun 27, 2023
etseidl added a commit to etseidl/cudf that referenced this pull request Jun 28, 2023
etseidl added a commit to etseidl/cudf that referenced this pull request Aug 3, 2023
etseidl added a commit to etseidl/cudf that referenced this pull request Aug 3, 2023
etseidl added a commit to etseidl/cudf that referenced this pull request Aug 3, 2023
etseidl added a commit to etseidl/cudf that referenced this pull request Aug 3, 2023
@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment cuIO cuIO issue labels Aug 7, 2023
rapids-bot bot pushed a commit that referenced this pull request Aug 23, 2023
)

Part of #13501. This adds support for decoding Parquet pages that are DELTA_BINARY_PACKED.

In addition to adding delta support, this PR incorporates changes introduced in #13622, such as using a mask to determine which decoding kernels to run, and adding parameters to  the `page_state_buffers_s` struct to reduce the amount of shared memory used.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/nvdbaranec
  - Bradley Dice (https://github.com/bdice)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #13637
@GregoryKimball
Copy link
Contributor

FYI adding specialized decode kernels to parquet reader came up again today in a discussion about the query performance roadmap. I believe the next step was to group the different data types, nullability and encodings by register usage. Then from there we could start adding specialized kernels one at a time.

Please let me know if you agree.

@nvdbaranec
Copy link
Contributor Author

Obsolete with #15159

@nvdbaranec nvdbaranec closed this Mar 4, 2024
rapids-bot bot pushed a commit that referenced this pull request Mar 14, 2024
…t decode (#15159)

This PR adds parquet decode fixed width and fixed width dictionary kernels micro kernels based on #13622.

We change `rle_stream` to allow us to process dictionary pages, and so are able to use it in `gpuDecodePageDataFixedDict` allowing it to take a "count" to be decoded, which we provide as the number of valid (non-null) elements that were found in the definition stream. Most of the intrusive changes are in `rle_stream` so that it can work with this argument. One big change is that prior to this we used to "spill" runs that would not fit in the current iteration. We have changed it so that we don't spill anymore and we could have in the `runs` array a large run that won't be decoded until several calls to `decode` later. This opens the possibility for us to throw more decode threads at the accumulated run fairly easily, and that may be worked on in this PR or shortly after (load balancing a large run).

The code here is really mostly @nvdbaranec and makes use of @etseidl's great work on `rle_stream`. It is marked in draft because it's not entirely done (not all testing has been performed). That said, NDS, nvbench and `PARQUET_TEST` passes. In order to use it, please set `USE_FIXED_OP=2` which means we enable both the `gpuDecodePageDataFixed` and `gpuDecodePageDataFixedDict` kernels. 

Here are `USE_FIXED_OP=2` nvbench results against 24.04 on my RTX6000:

```
# parquet_read_decode

## [0] Quadro RTX 6000

|  data_type  |    io_type    |  cardinality  |  run_length  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------------|---------------|---------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  INTEGRAL   | DEVICE_BUFFER |       0       |      1       |  24.299 ms |       2.84% |  20.192 ms |       0.38% | -4106.917 us | -16.90% |   FAIL   |
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      1       |  23.668 ms |       3.20% |  20.042 ms |       3.93% | -3625.853 us | -15.32% |   FAIL   |
|  INTEGRAL   | DEVICE_BUFFER |       0       |      32      |  21.226 ms |       0.21% |  17.781 ms |       0.15% | -3444.841 us | -16.23% |   FAIL   |
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      32      |  20.532 ms |       0.30% |  17.221 ms |       0.46% | -3311.125 us | -16.13% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |       0       |      1       |  12.201 ms |       0.18% |   9.438 ms |       0.49% | -2763.086 us | -22.65% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |     1000      |      1       |  14.530 ms |       0.50% |  12.578 ms |       0.50% | -1952.573 us | -13.44% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |       0       |      32      |  12.195 ms |       0.45% |  10.204 ms |       0.45% | -1990.871 us | -16.33% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |     1000      |      32      |  12.765 ms |       0.31% |  10.703 ms |       0.19% | -2061.599 us | -16.15% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER |       0       |      1       |  27.572 ms |       2.47% |  23.911 ms |       0.32% | -3661.435 us | -13.28% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER |     1000      |      1       |  12.410 ms |       0.55% |  11.530 ms |       3.89% |  -880.084 us |  -7.09% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER |       0       |      32      |  13.479 ms |       0.26% |  12.119 ms |       0.37% | -1360.157 us | -10.09% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER |     1000      |      32      |   9.583 ms |       0.43% |   8.182 ms |       0.23% | -1400.545 us | -14.62% |   FAIL   |
|  TIMESTAMP  | DEVICE_BUFFER |       0       |      1       |  34.390 ms |       0.50% |  32.211 ms |       0.41% | -2178.951 us |  -6.34% |   FAIL   |
|  TIMESTAMP  | DEVICE_BUFFER |     1000      |      1       |  12.432 ms |       0.43% |  10.741 ms |       0.42% | -1691.559 us | -13.61% |   FAIL   |
|  TIMESTAMP  | DEVICE_BUFFER |       0       |      32      |  14.759 ms |       0.40% |  12.941 ms |       0.19% | -1817.825 us | -12.32% |   FAIL   |
|  TIMESTAMP  | DEVICE_BUFFER |     1000      |      32      |  10.613 ms |       0.32% |   8.791 ms |       0.19% | -1822.373 us | -17.17% |   FAIL   |
|  DURATION   | DEVICE_BUFFER |       0       |      1       |  14.849 ms |       0.29% |  12.812 ms |       0.21% | -2037.408 us | -13.72% |   FAIL   |
|  DURATION   | DEVICE_BUFFER |     1000      |      1       |  11.806 ms |       0.32% |  10.110 ms |       0.43% | -1695.815 us | -14.36% |   FAIL   |
|  DURATION   | DEVICE_BUFFER |       0       |      32      |  11.620 ms |       0.24% |   9.751 ms |       0.15% | -1869.041 us | -16.08% |   FAIL   |
|  DURATION   | DEVICE_BUFFER |     1000      |      32      |  10.307 ms |       0.27% |   8.398 ms |       0.19% | -1909.239 us | -18.52% |   FAIL   |
|   STRING    | DEVICE_BUFFER |       0       |      1       |  55.028 ms |       1.00% |  54.751 ms |       0.68% |  -277.519 us |  -0.50% |   PASS   |
|   STRING    | DEVICE_BUFFER |     1000      |      1       |  19.503 ms |       0.46% |  19.399 ms |       0.30% |  -104.924 us |  -0.54% |   FAIL   |
|   STRING    | DEVICE_BUFFER |       0       |      32      |  55.287 ms |       0.78% |  54.857 ms |       0.38% |  -430.236 us |  -0.78% |   FAIL   |
|   STRING    | DEVICE_BUFFER |     1000      |      32      |  15.392 ms |       0.62% |  15.527 ms |       1.62% |   135.949 us |   0.88% |   FAIL   |
|    LIST     | DEVICE_BUFFER |       0       |      1       |  85.392 ms |       0.64% |  85.956 ms |       0.36% |   564.047 us |   0.66% |   FAIL   |
|    LIST     | DEVICE_BUFFER |     1000      |      1       |  82.151 ms |       0.77% |  82.977 ms |       0.76% |   825.975 us |   1.01% |   FAIL   |
|    LIST     | DEVICE_BUFFER |       0       |      32      |  71.257 ms |       0.77% |  72.425 ms |       0.67% |     1.168 ms |   1.64% |   FAIL   |
|    LIST     | DEVICE_BUFFER |     1000      |      32      |  72.176 ms |       0.19% |  73.750 ms |       1.47% |     1.574 ms |   2.18% |   FAIL   |
|   STRUCT    | DEVICE_BUFFER |       0       |      1       |  66.675 ms |       1.41% |  66.663 ms |       1.26% |   -11.513 us |  -0.02% |   PASS   |
|   STRUCT    | DEVICE_BUFFER |     1000      |      1       |  39.667 ms |       0.41% |  39.758 ms |       0.50% |    91.341 us |   0.23% |   PASS   |
|   STRUCT    | DEVICE_BUFFER |       0       |      32      |  66.765 ms |       1.59% |  66.873 ms |       1.40% |   107.569 us |   0.16% |   PASS   |
|   STRUCT    | DEVICE_BUFFER |     1000      |      32      |  34.681 ms |       0.16% |  34.748 ms |       0.28% |    67.095 us |   0.19% |   FAIL   |

# parquet_read_io_compression

## [0] Quadro RTX 6000

|    io_type    |  compression_type  |  cardinality  |  run_length  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|---------------|--------------------|---------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   FILEPATH    |       SNAPPY       |       0       |      1       |    1.438 s |       0.42% |    1.444 s |       0.32% |     5.797 ms |   0.40% |   FAIL   |
|  HOST_BUFFER  |       SNAPPY       |       0       |      1       |    1.376 s |       0.13% |    1.386 s |       0.14% |    10.316 ms |   0.75% |   FAIL   |
| DEVICE_BUFFER |       SNAPPY       |       0       |      1       |    1.341 s |       0.57% |    1.349 s |       0.11% |     8.160 ms |   0.61% |   FAIL   |
|   FILEPATH    |        NONE        |       0       |      1       |    1.220 s |       0.43% |    1.216 s |       0.48% | -4048.560 us |  -0.33% |   PASS   |
|  HOST_BUFFER  |        NONE        |       0       |      1       |    1.154 s |       0.13% |    1.156 s |       0.06% |     2.176 ms |   0.19% |   FAIL   |
| DEVICE_BUFFER |        NONE        |       0       |      1       |    1.113 s |       0.02% |    1.115 s |       0.15% |     2.028 ms |   0.18% |   FAIL   |
|   FILEPATH    |       SNAPPY       |     1000      |      1       |    1.317 s |       0.75% |    1.326 s |       0.16% |     9.613 ms |   0.73% |   FAIL   |
|  HOST_BUFFER  |       SNAPPY       |     1000      |      1       |    1.291 s |       0.63% |    1.298 s |       0.68% |     7.390 ms |   0.57% |   PASS   |
| DEVICE_BUFFER |       SNAPPY       |     1000      |      1       |    1.274 s |       0.67% |    1.288 s |       0.64% |    13.666 ms |   1.07% |   FAIL   |
|   FILEPATH    |        NONE        |     1000      |      1       |    1.200 s |       0.64% |    1.218 s |       0.42% |    18.036 ms |   1.50% |   FAIL   |
|  HOST_BUFFER  |        NONE        |     1000      |      1       |    1.181 s |       0.62% |    1.191 s |       0.69% |    10.329 ms |   0.87% |   FAIL   |
| DEVICE_BUFFER |        NONE        |     1000      |      1       |    1.166 s |       0.79% |    1.178 s |       0.67% |    11.599 ms |   0.99% |   FAIL   |
|   FILEPATH    |       SNAPPY       |       0       |      32      |    1.094 s |       0.64% |    1.106 s |       0.71% |    11.885 ms |   1.09% |   FAIL   |
|  HOST_BUFFER  |       SNAPPY       |       0       |      32      |    1.086 s |       0.72% |    1.101 s |       0.74% |    14.529 ms |   1.34% |   FAIL   |
| DEVICE_BUFFER |       SNAPPY       |       0       |      32      |    1.082 s |       0.63% |    1.098 s |       0.67% |    16.798 ms |   1.55% |   FAIL   |
|   FILEPATH    |        NONE        |       0       |      32      |    1.050 s |       0.27% |    1.059 s |       0.73% |     9.672 ms |   0.92% |   FAIL   |
|  HOST_BUFFER  |        NONE        |       0       |      32      |    1.035 s |       0.67% |    1.048 s |       0.73% |    13.583 ms |   1.31% |   FAIL   |
| DEVICE_BUFFER |        NONE        |       0       |      32      |    1.034 s |       0.80% |    1.045 s |       0.64% |    11.315 ms |   1.09% |   FAIL   |
|   FILEPATH    |       SNAPPY       |     1000      |      32      |    1.117 s |       0.53% |    1.130 s |       0.49% |    12.376 ms |   1.11% |   FAIL   |
|  HOST_BUFFER  |       SNAPPY       |     1000      |      32      |    1.109 s |       0.49% |    1.123 s |       0.50% |    14.328 ms |   1.29% |   FAIL   |
| DEVICE_BUFFER |       SNAPPY       |     1000      |      32      |    1.106 s |       0.59% |    1.126 s |       0.17% |    20.326 ms |   1.84% |   FAIL   |
|   FILEPATH    |        NONE        |     1000      |      32      |    1.029 s |       0.53% |    1.045 s |       0.61% |    15.633 ms |   1.52% |   FAIL   |
|  HOST_BUFFER  |        NONE        |     1000      |      32      |    1.025 s |       0.49% |    1.040 s |       0.06% |    14.977 ms |   1.46% |   FAIL   |
| DEVICE_BUFFER |        NONE        |     1000      |      32      |    1.031 s |       0.26% |    1.038 s |       0.17% |     7.165 ms |   0.70% |   FAIL   |

# parquet_read_chunks

## [0] Quadro RTX 6000

|     T     |    io_type    |  cardinality  |  run_length  |  byte_limit  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-----------|---------------|---------------|--------------|--------------|------------|-------------|------------|-------------|---------------|---------|----------|
| INTEGRAL  | DEVICE_BUFFER |       0       |      1       |      0       |  24.646 ms |       0.07% |  20.620 ms |       0.25% |  -4025.493 us | -16.33% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |     1000      |      1       |      0       |  24.005 ms |       0.35% |  20.284 ms |       0.24% |  -3721.359 us | -15.50% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |       0       |      32      |      0       |  21.492 ms |       0.50% |  18.074 ms |       0.50% |  -3418.141 us | -15.90% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |     1000      |      32      |      0       |  20.759 ms |       0.16% |  17.426 ms |       0.43% |  -3332.946 us | -16.06% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |       0       |      1       |    500000    | 206.076 ms |       0.13% | 210.958 ms |       0.20% |      4.882 ms |   2.37% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |     1000      |      1       |    500000    | 211.737 ms |       0.06% | 206.936 ms |       0.05% |  -4801.517 us |  -2.27% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |       0       |      32      |    500000    | 191.610 ms |       0.10% | 190.775 ms |       0.19% |   -835.413 us |  -0.44% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |     1000      |      32      |    500000    | 191.122 ms |       0.21% | 189.118 ms |       0.18% |  -2004.013 us |  -1.05% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |       0       |      1       |      0       |  12.332 ms |       0.50% |   9.576 ms |       0.16% |  -2755.392 us | -22.34% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |     1000      |      1       |      0       |  14.700 ms |       0.50% |  12.748 ms |       0.50% |  -1951.554 us | -13.28% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |       0       |      32      |      0       |  12.429 ms |       5.39% |  10.384 ms |       0.16% |  -2044.469 us | -16.45% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |     1000      |      32      |      0       |  13.010 ms |       4.35% |  10.871 ms |       0.19% |  -2138.692 us | -16.44% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |       0       |      1       |    500000    |  97.573 ms |       0.23% |  66.537 ms |       0.16% | -31035.966 us | -31.81% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |     1000      |      1       |    500000    | 107.469 ms |       0.27% |  84.752 ms |       0.28% | -22716.950 us | -21.14% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |       0       |      32      |    500000    |  95.086 ms |       0.26% |  74.493 ms |       0.15% | -20592.478 us | -21.66% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |     1000      |      32      |    500000    |  95.634 ms |       0.18% |  73.872 ms |       0.21% | -21761.426 us | -22.75% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |       0       |      1       |      0       |  28.070 ms |       0.57% |  24.134 ms |       0.42% |  -3936.545 us | -14.02% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |     1000      |      1       |      0       |  12.604 ms |       1.97% |  11.663 ms |       2.94% |   -940.845 us |  -7.46% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |       0       |      32      |      0       |  13.551 ms |       0.28% |  12.220 ms |       0.18% |  -1330.788 us |  -9.82% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |     1000      |      32      |      0       |   9.656 ms |       0.28% |   8.279 ms |       0.22% |  -1377.165 us | -14.26% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |       0       |      1       |    500000    |  97.520 ms |       0.50% |  59.952 ms |       0.07% | -37568.352 us | -38.52% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |     1000      |      1       |    500000    |  69.184 ms |       0.34% |  60.300 ms |       0.40% |  -8883.777 us | -12.84% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |       0       |      32      |    500000    |  63.886 ms |       0.15% |  53.062 ms |       0.21% | -10824.109 us | -16.94% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |     1000      |      32      |    500000    |  60.067 ms |       0.23% |  48.565 ms |       0.23% | -11501.788 us | -19.15% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |       0       |      1       |      0       |  34.702 ms |       0.50% |  32.566 ms |       0.50% |  -2136.855 us |  -6.16% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |     1000      |      1       |      0       |  12.558 ms |       0.41% |  10.874 ms |       0.41% |  -1684.535 us | -13.41% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |       0       |      32      |      0       |  14.803 ms |       0.18% |  13.131 ms |       0.24% |  -1671.818 us | -11.29% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |     1000      |      32      |      0       |  10.720 ms |       0.31% |   8.928 ms |       0.16% |  -1791.785 us | -16.71% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |       0       |      1       |    500000    | 105.994 ms |       0.31% |  93.799 ms |       0.13% | -12194.481 us | -11.50% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |     1000      |      1       |    500000    |  87.865 ms |       0.20% |  69.684 ms |       0.23% | -18181.288 us | -20.69% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |       0       |      32      |    500000    |  81.901 ms |       0.09% |  65.145 ms |       0.18% | -16755.397 us | -20.46% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |     1000      |      32      |    500000    |  77.813 ms |       0.15% |  60.106 ms |       0.27% | -17707.050 us | -22.76% |   FAIL   |
| DURATION  | DEVICE_BUFFER |       0       |      1       |      0       |  14.926 ms |       0.25% |  12.981 ms |       0.47% |  -1945.295 us | -13.03% |   FAIL   |
| DURATION  | DEVICE_BUFFER |     1000      |      1       |      0       |  11.977 ms |       0.41% |  10.241 ms |       0.45% |  -1736.000 us | -14.49% |   FAIL   |
| DURATION  | DEVICE_BUFFER |       0       |      32      |      0       |  11.707 ms |       0.25% |   9.898 ms |       0.14% |  -1809.748 us | -15.46% |   FAIL   |
| DURATION  | DEVICE_BUFFER |     1000      |      32      |      0       |  10.402 ms |       0.49% |   8.535 ms |       0.11% |  -1867.021 us | -17.95% |   FAIL   |
| DURATION  | DEVICE_BUFFER |       0       |      1       |    500000    |  84.585 ms |       0.14% |  68.369 ms |       0.24% | -16216.952 us | -19.17% |   FAIL   |
| DURATION  | DEVICE_BUFFER |     1000      |      1       |    500000    |  85.377 ms |       0.20% |  66.735 ms |       0.21% | -18642.646 us | -21.84% |   FAIL   |
| DURATION  | DEVICE_BUFFER |       0       |      32      |    500000    |  76.875 ms |       0.15% |  59.947 ms |       0.19% | -16928.464 us | -22.02% |   FAIL   |
| DURATION  | DEVICE_BUFFER |     1000      |      32      |    500000    |  75.601 ms |       0.18% |  57.722 ms |       0.23% | -17879.257 us | -23.65% |   FAIL   |
|  STRING   | DEVICE_BUFFER |       0       |      1       |      0       |  55.084 ms |       0.50% |  55.018 ms |       0.78% |    -66.611 us |  -0.12% |   PASS   |
|  STRING   | DEVICE_BUFFER |     1000      |      1       |      0       |  19.502 ms |       0.27% |  19.617 ms |       0.49% |    115.500 us |   0.59% |   FAIL   |
|  STRING   | DEVICE_BUFFER |       0       |      32      |      0       |  55.233 ms |       0.72% |  55.394 ms |       0.77% |    161.028 us |   0.29% |   PASS   |
|  STRING   | DEVICE_BUFFER |     1000      |      32      |      0       |  15.445 ms |       0.53% |  15.674 ms |       0.50% |    229.270 us |   1.48% |   FAIL   |
|  STRING   | DEVICE_BUFFER |       0       |      1       |    500000    | 197.625 ms |       0.33% | 198.769 ms |       0.48% |      1.144 ms |   0.58% |   FAIL   |
|  STRING   | DEVICE_BUFFER |     1000      |      1       |    500000    |  90.339 ms |       0.48% |  90.797 ms |       0.58% |    457.714 us |   0.51% |   FAIL   |
|  STRING   | DEVICE_BUFFER |       0       |      32      |    500000    | 198.090 ms |       0.35% | 198.238 ms |       0.31% |    147.910 us |   0.07% |   PASS   |
|  STRING   | DEVICE_BUFFER |     1000      |      32      |    500000    |  67.181 ms |       0.17% |  67.796 ms |       0.34% |    615.115 us |   0.92% |   FAIL   |
|   LIST    | DEVICE_BUFFER |       0       |      1       |      0       |  90.825 ms |       0.75% |  91.843 ms |       0.58% |      1.019 ms |   1.12% |   FAIL   |
|   LIST    | DEVICE_BUFFER |     1000      |      1       |      0       |  87.731 ms |       0.63% |  88.633 ms |       0.69% |    901.582 us |   1.03% |   FAIL   |
|   LIST    | DEVICE_BUFFER |       0       |      32      |      0       |  76.089 ms |       0.47% |  77.107 ms |       0.39% |      1.018 ms |   1.34% |   FAIL   |
|   LIST    | DEVICE_BUFFER |     1000      |      32      |      0       |  77.148 ms |       0.69% |  78.719 ms |       0.63% |      1.571 ms |   2.04% |   FAIL   |
|   LIST    | DEVICE_BUFFER |       0       |      1       |    500000    |    1.470 s |       0.15% |    1.469 s |       0.09% |   -336.597 us |  -0.02% |   PASS   |
|   LIST    | DEVICE_BUFFER |     1000      |      1       |    500000    |    1.092 s |       0.15% |    1.092 s |       0.37% |    269.702 us |   0.02% |   PASS   |
|   LIST    | DEVICE_BUFFER |       0       |      32      |    500000    | 956.759 ms |       0.31% | 956.879 ms |       0.24% |    119.287 us |   0.01% |   PASS   |
|   LIST    | DEVICE_BUFFER |     1000      |      32      |    500000    | 959.021 ms |       0.43% | 957.862 ms |       0.24% |  -1159.729 us |  -0.12% |   PASS   |
|  STRUCT   | DEVICE_BUFFER |       0       |      1       |      0       |  66.450 ms |       2.05% |  66.381 ms |       1.19% |    -69.128 us |  -0.10% |   PASS   |
|  STRUCT   | DEVICE_BUFFER |     1000      |      1       |      0       |  39.866 ms |       0.50% |  39.669 ms |       0.36% |   -197.133 us |  -0.49% |   FAIL   |
|  STRUCT   | DEVICE_BUFFER |       0       |      32      |      0       |  66.996 ms |       1.50% |  66.634 ms |       1.54% |   -362.507 us |  -0.54% |   PASS   |
|  STRUCT   | DEVICE_BUFFER |     1000      |      32      |      0       |  34.995 ms |       0.18% |  34.746 ms |       0.24% |   -249.649 us |  -0.71% |   FAIL   |
|  STRUCT   | DEVICE_BUFFER |       0       |      1       |    500000    | 387.274 ms |       0.24% | 381.353 ms |       0.28% |  -5920.825 us |  -1.53% |   FAIL   |
|  STRUCT   | DEVICE_BUFFER |     1000      |      1       |    500000    | 313.150 ms |       0.23% | 308.024 ms |       0.07% |  -5125.220 us |  -1.64% |   FAIL   |
|  STRUCT   | DEVICE_BUFFER |       0       |      32      |    500000    | 388.062 ms |       0.49% | 381.960 ms |       0.12% |  -6102.039 us |  -1.57% |   FAIL   |
|  STRUCT   | DEVICE_BUFFER |     1000      |      32      |    500000    | 306.761 ms |       0.07% | 302.670 ms |       0.15% |  -4090.417 us |  -1.33% |   FAIL   |

# parquet_read_io_small_mixed

## [0] Quadro RTX 6000

|  io_type  |  cardinality  |  run_length  |  num_string_cols  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|-----------|---------------|--------------|-------------------|------------|-------------|------------|-------------|-------------|---------|----------|
| FILEPATH  |       0       |      1       |         1         |   6.298 ms |       3.83% |   6.324 ms |       2.82% |   26.068 us |   0.41% |   PASS   |
| FILEPATH  |     1000      |      1       |         1         |   4.380 ms |       9.20% |   4.389 ms |       7.10% |    8.962 us |   0.20% |   PASS   |
| FILEPATH  |       0       |      32      |         1         |   5.849 ms |       3.40% |   5.846 ms |       3.32% |   -2.973 us |  -0.05% |   PASS   |
| FILEPATH  |     1000      |      32      |         1         |   3.005 ms |       0.50% |   3.104 ms |       2.46% |   98.349 us |   3.27% |   FAIL   |
| FILEPATH  |       0       |      1       |         2         |   6.827 ms |       1.81% |   6.828 ms |       1.85% |    0.700 us |   0.01% |   PASS   |
| FILEPATH  |     1000      |      1       |         2         |   4.362 ms |       3.89% |   4.363 ms |       4.08% |    1.252 us |   0.03% |   PASS   |
| FILEPATH  |       0       |      32      |         2         |   6.498 ms |       1.72% |   6.462 ms |       1.94% |  -36.489 us |  -0.56% |   PASS   |
| FILEPATH  |     1000      |      32      |         2         |   3.020 ms |       0.61% |   3.098 ms |       2.52% |   78.572 us |   2.60% |   FAIL   |
| FILEPATH  |       0       |      1       |         3         |   7.271 ms |       2.64% |   7.412 ms |       2.19% |  141.040 us |   1.94% |   PASS   |
| FILEPATH  |     1000      |      1       |         3         |   4.368 ms |       6.54% |   4.360 ms |       2.48% |   -7.720 us |  -0.18% |   PASS   |
| FILEPATH  |       0       |      32      |         3         |   7.236 ms |       3.17% |   7.096 ms |       2.32% | -139.268 us |  -1.92% |   PASS   |
| FILEPATH  |     1000      |      32      |         3         |   3.101 ms |       2.41% |   3.118 ms |       2.45% |   17.422 us |   0.56% |   PASS   |

# parquet_read_row_selection

## [0] Quadro RTX 6000

|  column_selection  |  row_selection  |  str_to_categories  |  uses_pandas_metadata  |  timestamp_type  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|--------------------|-----------------|---------------------|------------------------|------------------|------------|-------------|------------|-------------|-----------|---------|----------|
|        ALL         |       ALL       |         YES         |          YES           |      EMPTY       |    1.463 s |       0.23% |    1.472 s |       0.21% |  8.307 ms |   0.57% |   FAIL   |
|        ALL         |      NROWS      |         YES         |          YES           |      EMPTY       |    1.669 s |       0.04% |    1.680 s |       0.02% | 11.129 ms |   0.67% |   FAIL   |
|        ALL         |   ROW_GROUPS    |         YES         |          YES           |      EMPTY       |    1.548 s |       0.01% |    1.555 s |       0.00% |  6.236 ms |   0.40% |   FAIL   |

# parquet_read_column_selection

## [0] Quadro RTX 6000

|  column_selection  |  row_selection  |  str_to_categories  |  uses_pandas_metadata  |  timestamp_type  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|--------------------|-----------------|---------------------|------------------------|------------------|------------|-------------|------------|-------------|-----------|---------|----------|
|        ALL         |       ALL       |         YES         |          YES           |      EMPTY       |    1.458 s |       0.49% |    1.472 s |       0.10% | 13.627 ms |   0.93% |   FAIL   |
|     ALTERNATE      |       ALL       |         YES         |          YES           |      EMPTY       |    1.451 s |       0.01% |    1.463 s |       0.52% | 12.089 ms |   0.83% |   FAIL   |
|     FIRST_HALF     |       ALL       |         YES         |          YES           |      EMPTY       |    1.438 s |       0.02% |    1.446 s |       0.60% |  7.382 ms |   0.51% |   FAIL   |
|    SECOND_HALF     |       ALL       |         YES         |          YES           |      EMPTY       |    1.447 s |       0.02% |    1.456 s |       0.53% |  8.770 ms |   0.61% |   FAIL   |

# parquet_read_misc_options

## [0] Quadro RTX 6000

|  column_selection  |  row_selection  |  str_to_categories  |  uses_pandas_metadata  |  timestamp_type  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|--------------------|-----------------|---------------------|------------------------|------------------|------------|-------------|------------|-------------|-----------|---------|----------|
|        ALL         |       ALL       |         YES         |          YES           |      EMPTY       |    1.456 s |       0.50% |    1.471 s |       0.49% | 15.157 ms |   1.04% |   FAIL   |
|        ALL         |       ALL       |         YES         |           NO           |      EMPTY       |    1.464 s |       0.14% |    1.473 s |       0.49% |  9.022 ms |   0.62% |   FAIL   |
|        ALL         |       ALL       |         NO          |          YES           |      EMPTY       |    1.457 s |       0.41% |    1.476 s |       0.23% | 19.434 ms |   1.33% |   FAIL   |
|        ALL         |       ALL       |         NO          |           NO           |      EMPTY       |    1.460 s |       0.47% |    1.476 s |       0.09% | 15.917 ms |   1.09% |   FAIL   |

# Summary

- Total Matches: 143
  - Pass    (diff <= min_noise): 25
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  118
```

Authors:
  - Alessandro Bellina (https://github.com/abellina)
  - https://github.com/nvdbaranec

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - https://github.com/nvdbaranec
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15159
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants