Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add microkernels for fixed-width and fixed-width dictionary in Parquet decode #15159

Merged

Conversation

abellina
Copy link
Contributor

@abellina abellina commented Feb 27, 2024

This PR adds parquet decode fixed width and fixed width dictionary kernels micro kernels based on #13622.

We change rle_stream to allow us to process dictionary pages, and so are able to use it in gpuDecodePageDataFixedDict allowing it to take a "count" to be decoded, which we provide as the number of valid (non-null) elements that were found in the definition stream. Most of the intrusive changes are in rle_stream so that it can work with this argument. One big change is that prior to this we used to "spill" runs that would not fit in the current iteration. We have changed it so that we don't spill anymore and we could have in the runs array a large run that won't be decoded until several calls to decode later. This opens the possibility for us to throw more decode threads at the accumulated run fairly easily, and that may be worked on in this PR or shortly after (load balancing a large run).

The code here is really mostly @nvdbaranec and makes use of @etseidl's great work on rle_stream. It is marked in draft because it's not entirely done (not all testing has been performed). That said, NDS, nvbench and PARQUET_TEST passes. In order to use it, please set USE_FIXED_OP=2 which means we enable both the gpuDecodePageDataFixed and gpuDecodePageDataFixedDict kernels.

Here are USE_FIXED_OP=2 nvbench results against 24.04 on my RTX6000:

# parquet_read_decode

## [0] Quadro RTX 6000

|  data_type  |    io_type    |  cardinality  |  run_length  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------------|---------------|---------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  INTEGRAL   | DEVICE_BUFFER |       0       |      1       |  24.299 ms |       2.84% |  20.192 ms |       0.38% | -4106.917 us | -16.90% |   FAIL   |
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      1       |  23.668 ms |       3.20% |  20.042 ms |       3.93% | -3625.853 us | -15.32% |   FAIL   |
|  INTEGRAL   | DEVICE_BUFFER |       0       |      32      |  21.226 ms |       0.21% |  17.781 ms |       0.15% | -3444.841 us | -16.23% |   FAIL   |
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      32      |  20.532 ms |       0.30% |  17.221 ms |       0.46% | -3311.125 us | -16.13% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |       0       |      1       |  12.201 ms |       0.18% |   9.438 ms |       0.49% | -2763.086 us | -22.65% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |     1000      |      1       |  14.530 ms |       0.50% |  12.578 ms |       0.50% | -1952.573 us | -13.44% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |       0       |      32      |  12.195 ms |       0.45% |  10.204 ms |       0.45% | -1990.871 us | -16.33% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |     1000      |      32      |  12.765 ms |       0.31% |  10.703 ms |       0.19% | -2061.599 us | -16.15% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER |       0       |      1       |  27.572 ms |       2.47% |  23.911 ms |       0.32% | -3661.435 us | -13.28% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER |     1000      |      1       |  12.410 ms |       0.55% |  11.530 ms |       3.89% |  -880.084 us |  -7.09% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER |       0       |      32      |  13.479 ms |       0.26% |  12.119 ms |       0.37% | -1360.157 us | -10.09% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER |     1000      |      32      |   9.583 ms |       0.43% |   8.182 ms |       0.23% | -1400.545 us | -14.62% |   FAIL   |
|  TIMESTAMP  | DEVICE_BUFFER |       0       |      1       |  34.390 ms |       0.50% |  32.211 ms |       0.41% | -2178.951 us |  -6.34% |   FAIL   |
|  TIMESTAMP  | DEVICE_BUFFER |     1000      |      1       |  12.432 ms |       0.43% |  10.741 ms |       0.42% | -1691.559 us | -13.61% |   FAIL   |
|  TIMESTAMP  | DEVICE_BUFFER |       0       |      32      |  14.759 ms |       0.40% |  12.941 ms |       0.19% | -1817.825 us | -12.32% |   FAIL   |
|  TIMESTAMP  | DEVICE_BUFFER |     1000      |      32      |  10.613 ms |       0.32% |   8.791 ms |       0.19% | -1822.373 us | -17.17% |   FAIL   |
|  DURATION   | DEVICE_BUFFER |       0       |      1       |  14.849 ms |       0.29% |  12.812 ms |       0.21% | -2037.408 us | -13.72% |   FAIL   |
|  DURATION   | DEVICE_BUFFER |     1000      |      1       |  11.806 ms |       0.32% |  10.110 ms |       0.43% | -1695.815 us | -14.36% |   FAIL   |
|  DURATION   | DEVICE_BUFFER |       0       |      32      |  11.620 ms |       0.24% |   9.751 ms |       0.15% | -1869.041 us | -16.08% |   FAIL   |
|  DURATION   | DEVICE_BUFFER |     1000      |      32      |  10.307 ms |       0.27% |   8.398 ms |       0.19% | -1909.239 us | -18.52% |   FAIL   |
|   STRING    | DEVICE_BUFFER |       0       |      1       |  55.028 ms |       1.00% |  54.751 ms |       0.68% |  -277.519 us |  -0.50% |   PASS   |
|   STRING    | DEVICE_BUFFER |     1000      |      1       |  19.503 ms |       0.46% |  19.399 ms |       0.30% |  -104.924 us |  -0.54% |   FAIL   |
|   STRING    | DEVICE_BUFFER |       0       |      32      |  55.287 ms |       0.78% |  54.857 ms |       0.38% |  -430.236 us |  -0.78% |   FAIL   |
|   STRING    | DEVICE_BUFFER |     1000      |      32      |  15.392 ms |       0.62% |  15.527 ms |       1.62% |   135.949 us |   0.88% |   FAIL   |
|    LIST     | DEVICE_BUFFER |       0       |      1       |  85.392 ms |       0.64% |  85.956 ms |       0.36% |   564.047 us |   0.66% |   FAIL   |
|    LIST     | DEVICE_BUFFER |     1000      |      1       |  82.151 ms |       0.77% |  82.977 ms |       0.76% |   825.975 us |   1.01% |   FAIL   |
|    LIST     | DEVICE_BUFFER |       0       |      32      |  71.257 ms |       0.77% |  72.425 ms |       0.67% |     1.168 ms |   1.64% |   FAIL   |
|    LIST     | DEVICE_BUFFER |     1000      |      32      |  72.176 ms |       0.19% |  73.750 ms |       1.47% |     1.574 ms |   2.18% |   FAIL   |
|   STRUCT    | DEVICE_BUFFER |       0       |      1       |  66.675 ms |       1.41% |  66.663 ms |       1.26% |   -11.513 us |  -0.02% |   PASS   |
|   STRUCT    | DEVICE_BUFFER |     1000      |      1       |  39.667 ms |       0.41% |  39.758 ms |       0.50% |    91.341 us |   0.23% |   PASS   |
|   STRUCT    | DEVICE_BUFFER |       0       |      32      |  66.765 ms |       1.59% |  66.873 ms |       1.40% |   107.569 us |   0.16% |   PASS   |
|   STRUCT    | DEVICE_BUFFER |     1000      |      32      |  34.681 ms |       0.16% |  34.748 ms |       0.28% |    67.095 us |   0.19% |   FAIL   |

# parquet_read_io_compression

## [0] Quadro RTX 6000

|    io_type    |  compression_type  |  cardinality  |  run_length  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|---------------|--------------------|---------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   FILEPATH    |       SNAPPY       |       0       |      1       |    1.438 s |       0.42% |    1.444 s |       0.32% |     5.797 ms |   0.40% |   FAIL   |
|  HOST_BUFFER  |       SNAPPY       |       0       |      1       |    1.376 s |       0.13% |    1.386 s |       0.14% |    10.316 ms |   0.75% |   FAIL   |
| DEVICE_BUFFER |       SNAPPY       |       0       |      1       |    1.341 s |       0.57% |    1.349 s |       0.11% |     8.160 ms |   0.61% |   FAIL   |
|   FILEPATH    |        NONE        |       0       |      1       |    1.220 s |       0.43% |    1.216 s |       0.48% | -4048.560 us |  -0.33% |   PASS   |
|  HOST_BUFFER  |        NONE        |       0       |      1       |    1.154 s |       0.13% |    1.156 s |       0.06% |     2.176 ms |   0.19% |   FAIL   |
| DEVICE_BUFFER |        NONE        |       0       |      1       |    1.113 s |       0.02% |    1.115 s |       0.15% |     2.028 ms |   0.18% |   FAIL   |
|   FILEPATH    |       SNAPPY       |     1000      |      1       |    1.317 s |       0.75% |    1.326 s |       0.16% |     9.613 ms |   0.73% |   FAIL   |
|  HOST_BUFFER  |       SNAPPY       |     1000      |      1       |    1.291 s |       0.63% |    1.298 s |       0.68% |     7.390 ms |   0.57% |   PASS   |
| DEVICE_BUFFER |       SNAPPY       |     1000      |      1       |    1.274 s |       0.67% |    1.288 s |       0.64% |    13.666 ms |   1.07% |   FAIL   |
|   FILEPATH    |        NONE        |     1000      |      1       |    1.200 s |       0.64% |    1.218 s |       0.42% |    18.036 ms |   1.50% |   FAIL   |
|  HOST_BUFFER  |        NONE        |     1000      |      1       |    1.181 s |       0.62% |    1.191 s |       0.69% |    10.329 ms |   0.87% |   FAIL   |
| DEVICE_BUFFER |        NONE        |     1000      |      1       |    1.166 s |       0.79% |    1.178 s |       0.67% |    11.599 ms |   0.99% |   FAIL   |
|   FILEPATH    |       SNAPPY       |       0       |      32      |    1.094 s |       0.64% |    1.106 s |       0.71% |    11.885 ms |   1.09% |   FAIL   |
|  HOST_BUFFER  |       SNAPPY       |       0       |      32      |    1.086 s |       0.72% |    1.101 s |       0.74% |    14.529 ms |   1.34% |   FAIL   |
| DEVICE_BUFFER |       SNAPPY       |       0       |      32      |    1.082 s |       0.63% |    1.098 s |       0.67% |    16.798 ms |   1.55% |   FAIL   |
|   FILEPATH    |        NONE        |       0       |      32      |    1.050 s |       0.27% |    1.059 s |       0.73% |     9.672 ms |   0.92% |   FAIL   |
|  HOST_BUFFER  |        NONE        |       0       |      32      |    1.035 s |       0.67% |    1.048 s |       0.73% |    13.583 ms |   1.31% |   FAIL   |
| DEVICE_BUFFER |        NONE        |       0       |      32      |    1.034 s |       0.80% |    1.045 s |       0.64% |    11.315 ms |   1.09% |   FAIL   |
|   FILEPATH    |       SNAPPY       |     1000      |      32      |    1.117 s |       0.53% |    1.130 s |       0.49% |    12.376 ms |   1.11% |   FAIL   |
|  HOST_BUFFER  |       SNAPPY       |     1000      |      32      |    1.109 s |       0.49% |    1.123 s |       0.50% |    14.328 ms |   1.29% |   FAIL   |
| DEVICE_BUFFER |       SNAPPY       |     1000      |      32      |    1.106 s |       0.59% |    1.126 s |       0.17% |    20.326 ms |   1.84% |   FAIL   |
|   FILEPATH    |        NONE        |     1000      |      32      |    1.029 s |       0.53% |    1.045 s |       0.61% |    15.633 ms |   1.52% |   FAIL   |
|  HOST_BUFFER  |        NONE        |     1000      |      32      |    1.025 s |       0.49% |    1.040 s |       0.06% |    14.977 ms |   1.46% |   FAIL   |
| DEVICE_BUFFER |        NONE        |     1000      |      32      |    1.031 s |       0.26% |    1.038 s |       0.17% |     7.165 ms |   0.70% |   FAIL   |

# parquet_read_chunks

## [0] Quadro RTX 6000

|     T     |    io_type    |  cardinality  |  run_length  |  byte_limit  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-----------|---------------|---------------|--------------|--------------|------------|-------------|------------|-------------|---------------|---------|----------|
| INTEGRAL  | DEVICE_BUFFER |       0       |      1       |      0       |  24.646 ms |       0.07% |  20.620 ms |       0.25% |  -4025.493 us | -16.33% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |     1000      |      1       |      0       |  24.005 ms |       0.35% |  20.284 ms |       0.24% |  -3721.359 us | -15.50% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |       0       |      32      |      0       |  21.492 ms |       0.50% |  18.074 ms |       0.50% |  -3418.141 us | -15.90% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |     1000      |      32      |      0       |  20.759 ms |       0.16% |  17.426 ms |       0.43% |  -3332.946 us | -16.06% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |       0       |      1       |    500000    | 206.076 ms |       0.13% | 210.958 ms |       0.20% |      4.882 ms |   2.37% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |     1000      |      1       |    500000    | 211.737 ms |       0.06% | 206.936 ms |       0.05% |  -4801.517 us |  -2.27% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |       0       |      32      |    500000    | 191.610 ms |       0.10% | 190.775 ms |       0.19% |   -835.413 us |  -0.44% |   FAIL   |
| INTEGRAL  | DEVICE_BUFFER |     1000      |      32      |    500000    | 191.122 ms |       0.21% | 189.118 ms |       0.18% |  -2004.013 us |  -1.05% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |       0       |      1       |      0       |  12.332 ms |       0.50% |   9.576 ms |       0.16% |  -2755.392 us | -22.34% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |     1000      |      1       |      0       |  14.700 ms |       0.50% |  12.748 ms |       0.50% |  -1951.554 us | -13.28% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |       0       |      32      |      0       |  12.429 ms |       5.39% |  10.384 ms |       0.16% |  -2044.469 us | -16.45% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |     1000      |      32      |      0       |  13.010 ms |       4.35% |  10.871 ms |       0.19% |  -2138.692 us | -16.44% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |       0       |      1       |    500000    |  97.573 ms |       0.23% |  66.537 ms |       0.16% | -31035.966 us | -31.81% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |     1000      |      1       |    500000    | 107.469 ms |       0.27% |  84.752 ms |       0.28% | -22716.950 us | -21.14% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |       0       |      32      |    500000    |  95.086 ms |       0.26% |  74.493 ms |       0.15% | -20592.478 us | -21.66% |   FAIL   |
|   FLOAT   | DEVICE_BUFFER |     1000      |      32      |    500000    |  95.634 ms |       0.18% |  73.872 ms |       0.21% | -21761.426 us | -22.75% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |       0       |      1       |      0       |  28.070 ms |       0.57% |  24.134 ms |       0.42% |  -3936.545 us | -14.02% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |     1000      |      1       |      0       |  12.604 ms |       1.97% |  11.663 ms |       2.94% |   -940.845 us |  -7.46% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |       0       |      32      |      0       |  13.551 ms |       0.28% |  12.220 ms |       0.18% |  -1330.788 us |  -9.82% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |     1000      |      32      |      0       |   9.656 ms |       0.28% |   8.279 ms |       0.22% |  -1377.165 us | -14.26% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |       0       |      1       |    500000    |  97.520 ms |       0.50% |  59.952 ms |       0.07% | -37568.352 us | -38.52% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |     1000      |      1       |    500000    |  69.184 ms |       0.34% |  60.300 ms |       0.40% |  -8883.777 us | -12.84% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |       0       |      32      |    500000    |  63.886 ms |       0.15% |  53.062 ms |       0.21% | -10824.109 us | -16.94% |   FAIL   |
|  DECIMAL  | DEVICE_BUFFER |     1000      |      32      |    500000    |  60.067 ms |       0.23% |  48.565 ms |       0.23% | -11501.788 us | -19.15% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |       0       |      1       |      0       |  34.702 ms |       0.50% |  32.566 ms |       0.50% |  -2136.855 us |  -6.16% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |     1000      |      1       |      0       |  12.558 ms |       0.41% |  10.874 ms |       0.41% |  -1684.535 us | -13.41% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |       0       |      32      |      0       |  14.803 ms |       0.18% |  13.131 ms |       0.24% |  -1671.818 us | -11.29% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |     1000      |      32      |      0       |  10.720 ms |       0.31% |   8.928 ms |       0.16% |  -1791.785 us | -16.71% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |       0       |      1       |    500000    | 105.994 ms |       0.31% |  93.799 ms |       0.13% | -12194.481 us | -11.50% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |     1000      |      1       |    500000    |  87.865 ms |       0.20% |  69.684 ms |       0.23% | -18181.288 us | -20.69% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |       0       |      32      |    500000    |  81.901 ms |       0.09% |  65.145 ms |       0.18% | -16755.397 us | -20.46% |   FAIL   |
| TIMESTAMP | DEVICE_BUFFER |     1000      |      32      |    500000    |  77.813 ms |       0.15% |  60.106 ms |       0.27% | -17707.050 us | -22.76% |   FAIL   |
| DURATION  | DEVICE_BUFFER |       0       |      1       |      0       |  14.926 ms |       0.25% |  12.981 ms |       0.47% |  -1945.295 us | -13.03% |   FAIL   |
| DURATION  | DEVICE_BUFFER |     1000      |      1       |      0       |  11.977 ms |       0.41% |  10.241 ms |       0.45% |  -1736.000 us | -14.49% |   FAIL   |
| DURATION  | DEVICE_BUFFER |       0       |      32      |      0       |  11.707 ms |       0.25% |   9.898 ms |       0.14% |  -1809.748 us | -15.46% |   FAIL   |
| DURATION  | DEVICE_BUFFER |     1000      |      32      |      0       |  10.402 ms |       0.49% |   8.535 ms |       0.11% |  -1867.021 us | -17.95% |   FAIL   |
| DURATION  | DEVICE_BUFFER |       0       |      1       |    500000    |  84.585 ms |       0.14% |  68.369 ms |       0.24% | -16216.952 us | -19.17% |   FAIL   |
| DURATION  | DEVICE_BUFFER |     1000      |      1       |    500000    |  85.377 ms |       0.20% |  66.735 ms |       0.21% | -18642.646 us | -21.84% |   FAIL   |
| DURATION  | DEVICE_BUFFER |       0       |      32      |    500000    |  76.875 ms |       0.15% |  59.947 ms |       0.19% | -16928.464 us | -22.02% |   FAIL   |
| DURATION  | DEVICE_BUFFER |     1000      |      32      |    500000    |  75.601 ms |       0.18% |  57.722 ms |       0.23% | -17879.257 us | -23.65% |   FAIL   |
|  STRING   | DEVICE_BUFFER |       0       |      1       |      0       |  55.084 ms |       0.50% |  55.018 ms |       0.78% |    -66.611 us |  -0.12% |   PASS   |
|  STRING   | DEVICE_BUFFER |     1000      |      1       |      0       |  19.502 ms |       0.27% |  19.617 ms |       0.49% |    115.500 us |   0.59% |   FAIL   |
|  STRING   | DEVICE_BUFFER |       0       |      32      |      0       |  55.233 ms |       0.72% |  55.394 ms |       0.77% |    161.028 us |   0.29% |   PASS   |
|  STRING   | DEVICE_BUFFER |     1000      |      32      |      0       |  15.445 ms |       0.53% |  15.674 ms |       0.50% |    229.270 us |   1.48% |   FAIL   |
|  STRING   | DEVICE_BUFFER |       0       |      1       |    500000    | 197.625 ms |       0.33% | 198.769 ms |       0.48% |      1.144 ms |   0.58% |   FAIL   |
|  STRING   | DEVICE_BUFFER |     1000      |      1       |    500000    |  90.339 ms |       0.48% |  90.797 ms |       0.58% |    457.714 us |   0.51% |   FAIL   |
|  STRING   | DEVICE_BUFFER |       0       |      32      |    500000    | 198.090 ms |       0.35% | 198.238 ms |       0.31% |    147.910 us |   0.07% |   PASS   |
|  STRING   | DEVICE_BUFFER |     1000      |      32      |    500000    |  67.181 ms |       0.17% |  67.796 ms |       0.34% |    615.115 us |   0.92% |   FAIL   |
|   LIST    | DEVICE_BUFFER |       0       |      1       |      0       |  90.825 ms |       0.75% |  91.843 ms |       0.58% |      1.019 ms |   1.12% |   FAIL   |
|   LIST    | DEVICE_BUFFER |     1000      |      1       |      0       |  87.731 ms |       0.63% |  88.633 ms |       0.69% |    901.582 us |   1.03% |   FAIL   |
|   LIST    | DEVICE_BUFFER |       0       |      32      |      0       |  76.089 ms |       0.47% |  77.107 ms |       0.39% |      1.018 ms |   1.34% |   FAIL   |
|   LIST    | DEVICE_BUFFER |     1000      |      32      |      0       |  77.148 ms |       0.69% |  78.719 ms |       0.63% |      1.571 ms |   2.04% |   FAIL   |
|   LIST    | DEVICE_BUFFER |       0       |      1       |    500000    |    1.470 s |       0.15% |    1.469 s |       0.09% |   -336.597 us |  -0.02% |   PASS   |
|   LIST    | DEVICE_BUFFER |     1000      |      1       |    500000    |    1.092 s |       0.15% |    1.092 s |       0.37% |    269.702 us |   0.02% |   PASS   |
|   LIST    | DEVICE_BUFFER |       0       |      32      |    500000    | 956.759 ms |       0.31% | 956.879 ms |       0.24% |    119.287 us |   0.01% |   PASS   |
|   LIST    | DEVICE_BUFFER |     1000      |      32      |    500000    | 959.021 ms |       0.43% | 957.862 ms |       0.24% |  -1159.729 us |  -0.12% |   PASS   |
|  STRUCT   | DEVICE_BUFFER |       0       |      1       |      0       |  66.450 ms |       2.05% |  66.381 ms |       1.19% |    -69.128 us |  -0.10% |   PASS   |
|  STRUCT   | DEVICE_BUFFER |     1000      |      1       |      0       |  39.866 ms |       0.50% |  39.669 ms |       0.36% |   -197.133 us |  -0.49% |   FAIL   |
|  STRUCT   | DEVICE_BUFFER |       0       |      32      |      0       |  66.996 ms |       1.50% |  66.634 ms |       1.54% |   -362.507 us |  -0.54% |   PASS   |
|  STRUCT   | DEVICE_BUFFER |     1000      |      32      |      0       |  34.995 ms |       0.18% |  34.746 ms |       0.24% |   -249.649 us |  -0.71% |   FAIL   |
|  STRUCT   | DEVICE_BUFFER |       0       |      1       |    500000    | 387.274 ms |       0.24% | 381.353 ms |       0.28% |  -5920.825 us |  -1.53% |   FAIL   |
|  STRUCT   | DEVICE_BUFFER |     1000      |      1       |    500000    | 313.150 ms |       0.23% | 308.024 ms |       0.07% |  -5125.220 us |  -1.64% |   FAIL   |
|  STRUCT   | DEVICE_BUFFER |       0       |      32      |    500000    | 388.062 ms |       0.49% | 381.960 ms |       0.12% |  -6102.039 us |  -1.57% |   FAIL   |
|  STRUCT   | DEVICE_BUFFER |     1000      |      32      |    500000    | 306.761 ms |       0.07% | 302.670 ms |       0.15% |  -4090.417 us |  -1.33% |   FAIL   |

# parquet_read_io_small_mixed

## [0] Quadro RTX 6000

|  io_type  |  cardinality  |  run_length  |  num_string_cols  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|-----------|---------------|--------------|-------------------|------------|-------------|------------|-------------|-------------|---------|----------|
| FILEPATH  |       0       |      1       |         1         |   6.298 ms |       3.83% |   6.324 ms |       2.82% |   26.068 us |   0.41% |   PASS   |
| FILEPATH  |     1000      |      1       |         1         |   4.380 ms |       9.20% |   4.389 ms |       7.10% |    8.962 us |   0.20% |   PASS   |
| FILEPATH  |       0       |      32      |         1         |   5.849 ms |       3.40% |   5.846 ms |       3.32% |   -2.973 us |  -0.05% |   PASS   |
| FILEPATH  |     1000      |      32      |         1         |   3.005 ms |       0.50% |   3.104 ms |       2.46% |   98.349 us |   3.27% |   FAIL   |
| FILEPATH  |       0       |      1       |         2         |   6.827 ms |       1.81% |   6.828 ms |       1.85% |    0.700 us |   0.01% |   PASS   |
| FILEPATH  |     1000      |      1       |         2         |   4.362 ms |       3.89% |   4.363 ms |       4.08% |    1.252 us |   0.03% |   PASS   |
| FILEPATH  |       0       |      32      |         2         |   6.498 ms |       1.72% |   6.462 ms |       1.94% |  -36.489 us |  -0.56% |   PASS   |
| FILEPATH  |     1000      |      32      |         2         |   3.020 ms |       0.61% |   3.098 ms |       2.52% |   78.572 us |   2.60% |   FAIL   |
| FILEPATH  |       0       |      1       |         3         |   7.271 ms |       2.64% |   7.412 ms |       2.19% |  141.040 us |   1.94% |   PASS   |
| FILEPATH  |     1000      |      1       |         3         |   4.368 ms |       6.54% |   4.360 ms |       2.48% |   -7.720 us |  -0.18% |   PASS   |
| FILEPATH  |       0       |      32      |         3         |   7.236 ms |       3.17% |   7.096 ms |       2.32% | -139.268 us |  -1.92% |   PASS   |
| FILEPATH  |     1000      |      32      |         3         |   3.101 ms |       2.41% |   3.118 ms |       2.45% |   17.422 us |   0.56% |   PASS   |

# parquet_read_row_selection

## [0] Quadro RTX 6000

|  column_selection  |  row_selection  |  str_to_categories  |  uses_pandas_metadata  |  timestamp_type  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|--------------------|-----------------|---------------------|------------------------|------------------|------------|-------------|------------|-------------|-----------|---------|----------|
|        ALL         |       ALL       |         YES         |          YES           |      EMPTY       |    1.463 s |       0.23% |    1.472 s |       0.21% |  8.307 ms |   0.57% |   FAIL   |
|        ALL         |      NROWS      |         YES         |          YES           |      EMPTY       |    1.669 s |       0.04% |    1.680 s |       0.02% | 11.129 ms |   0.67% |   FAIL   |
|        ALL         |   ROW_GROUPS    |         YES         |          YES           |      EMPTY       |    1.548 s |       0.01% |    1.555 s |       0.00% |  6.236 ms |   0.40% |   FAIL   |

# parquet_read_column_selection

## [0] Quadro RTX 6000

|  column_selection  |  row_selection  |  str_to_categories  |  uses_pandas_metadata  |  timestamp_type  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|--------------------|-----------------|---------------------|------------------------|------------------|------------|-------------|------------|-------------|-----------|---------|----------|
|        ALL         |       ALL       |         YES         |          YES           |      EMPTY       |    1.458 s |       0.49% |    1.472 s |       0.10% | 13.627 ms |   0.93% |   FAIL   |
|     ALTERNATE      |       ALL       |         YES         |          YES           |      EMPTY       |    1.451 s |       0.01% |    1.463 s |       0.52% | 12.089 ms |   0.83% |   FAIL   |
|     FIRST_HALF     |       ALL       |         YES         |          YES           |      EMPTY       |    1.438 s |       0.02% |    1.446 s |       0.60% |  7.382 ms |   0.51% |   FAIL   |
|    SECOND_HALF     |       ALL       |         YES         |          YES           |      EMPTY       |    1.447 s |       0.02% |    1.456 s |       0.53% |  8.770 ms |   0.61% |   FAIL   |

# parquet_read_misc_options

## [0] Quadro RTX 6000

|  column_selection  |  row_selection  |  str_to_categories  |  uses_pandas_metadata  |  timestamp_type  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|--------------------|-----------------|---------------------|------------------------|------------------|------------|-------------|------------|-------------|-----------|---------|----------|
|        ALL         |       ALL       |         YES         |          YES           |      EMPTY       |    1.456 s |       0.50% |    1.471 s |       0.49% | 15.157 ms |   1.04% |   FAIL   |
|        ALL         |       ALL       |         YES         |           NO           |      EMPTY       |    1.464 s |       0.14% |    1.473 s |       0.49% |  9.022 ms |   0.62% |   FAIL   |
|        ALL         |       ALL       |         NO          |          YES           |      EMPTY       |    1.457 s |       0.41% |    1.476 s |       0.23% | 19.434 ms |   1.33% |   FAIL   |
|        ALL         |       ALL       |         NO          |           NO           |      EMPTY       |    1.460 s |       0.47% |    1.476 s |       0.09% | 15.917 ms |   1.09% |   FAIL   |

# Summary

- Total Matches: 143
  - Pass    (diff <= min_noise): 25
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  118

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Feb 27, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 27, 2024
@abellina abellina added Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS labels Feb 27, 2024
dictionary

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina abellina force-pushed the fixed_ukernel_rlestream_24.04_rebase branch from d6e2b02 to d66835d Compare February 27, 2024 20:27
@abellina
Copy link
Contributor Author

I force pushed a signed version of my commit, fyi.

@vuule vuule self-requested a review February 27, 2024 22:32
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went over everything, except the rle_stream changes; that one will take a bit more time.
The fixed width kernels look great!

Posted some questions and nitpicks.

cpp/src/io/parquet/decode_fixed.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_hdr.cu Outdated Show resolved Hide resolved
__global__ void __launch_bounds__(decode_block_size) gpuDecodePageDataFixed(
PageInfo* pages, device_span<ColumnChunkDesc const> chunks, size_t min_row, size_t num_rows)
{
__shared__ __align__(16) page_state_s state_g;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we gain anything by splitting page_state_s so that only the relevant members are included in the specialized kernels?
I assume its size is small compared to rle_run objects so it might not move the needle WRT shared memory use, but asking just in case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. We should add this to another follow on for us to tackle in the future. Will file. @nvdbaranec fyi.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed this: #15267

cpp/src/io/parquet/page_hdr.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_data.cuh Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_data.cuh Outdated Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far. Just some random comments.

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
__shared__ rle_run<level_t> def_runs[rle_run_buffer_size];
rle_stream<level_t, decode_block_size, rolling_buf_size> def_decoder{def_runs};

bool const nullable = s->col.max_level[level_type::DEFINITION] > 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know null counts by the time we get here? Maybe via metadata (v2 headers, page indexes)? It would be nice to skip a RLE decode of a stream of all 1's (esp. given how spark likes to make everything nullable).

Copy link
Contributor

@etseidl etseidl Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup: I did a quick check with some data that I know has no nulls, but the schema says OPTIONAL, so the max def level is 1 (but always 1). Setting nullable to false reduces the dict decode time to 6.7ms from 9.8ms for this data set. Total times to decode integer data were 12.6ms for monolithic, 10.7ms for this PR (some flat, some dict), and 7.7ms with nullable = false.

For non-nullable we also don't need the nz_idx array, which saves 1000 bytes of shared mem, but the expense of populating it is pretty negligible. Just food for thought if we want to go with a kernel exclusively for flat, non-nullable (or no nulls present) data.

I was also thinking we could reason about the presence of nulls by examining the head of the def levels...if the first run has a size equal to num_values, then we know it's either all nulls or non nulls, depending on the encoded run value.


// should the size be 1/2 (128?)
__shared__ rle_run<uint32_t> dict_runs[rle_run_buffer_size]; // should be array of 6
rle_stream<uint32_t, decode_block_size, rolling_buf_size> dict_stream{dict_runs};
Copy link
Contributor

@etseidl etseidl Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could really go crazy and parameterize this on the number of bits used for the dictionary keys. I'd bet that in the large majority of cases the keys would fit in a uint16_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes? @nvdbaranec has plans for a "grand templatization" so I think this fits right in. Lets do as follow on.

cpp/src/io/parquet/page_hdr.cu Outdated Show resolved Hide resolved
@etseidl
Copy link
Contributor

etseidl commented Feb 29, 2024

Just to beat the no nulls case to death, I modified the definition of nullable.

  bool const no_nulls =
    s->page.num_input_values == (s->initial_rle_run[level_type::DEFINITION] >> 1) &&
    s->initial_rle_value[level_type::DEFINITION] == s->col.max_level[level_type::DEFINITION];
  bool const nullable = s->col.max_level[level_type::DEFINITION] > 0 and not no_nulls;

Then I modified the null probability in the parquet_read_decode benchmark to be 0 and compared monolithic, this PR, and my modification.

baseline
| data_type |    io_type    | cardinality | run_length | Samples | CPU Time  | Noise | GPU Time  | Noise | bytes_per_second | 
|-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|
|  INTEGRAL | DEVICE_BUFFER |           0 |          1 |   1400x | 10.668 ms | 0.67% | 10.664 ms | 0.67% |      50346204219 |         
|  INTEGRAL | DEVICE_BUFFER |        1000 |          1 |     60x | 10.509 ms | 0.50% | 10.504 ms | 0.50% |      51109003358 |       
|  INTEGRAL | DEVICE_BUFFER |           0 |         32 |     62x | 11.500 ms | 0.50% | 11.495 ms | 0.50% |      46703489028 |       
|  INTEGRAL | DEVICE_BUFFER |        1000 |         32 |   1352x | 11.048 ms | 1.08% | 11.043 ms | 1.08% |      48614861959 |       

this PR
| data_type |    io_type    | cardinality | run_length | Samples | CPU Time  | Noise | GPU Time  | Noise | bytes_per_second | 
|-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|
|  INTEGRAL | DEVICE_BUFFER |           0 |          1 |   1312x | 10.270 ms | 1.04% | 10.266 ms | 1.04% |      52297918168 |         
|  INTEGRAL | DEVICE_BUFFER |        1000 |          1 |    180x | 10.778 ms | 0.50% | 10.774 ms | 0.50% |      49832097922 |       
|  INTEGRAL | DEVICE_BUFFER |           0 |         32 |   1334x | 11.196 ms | 1.31% | 11.191 ms | 1.31% |      47973598600 |       
|  INTEGRAL | DEVICE_BUFFER |        1000 |         32 |   1418x | 10.533 ms | 5.63% | 10.528 ms | 5.60% |      50995279694 |       

my mods
| data_type |    io_type    | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | 
|-----------|---------------|-------------|------------|---------|----------|-------|----------|-------|------------------|
|  INTEGRAL | DEVICE_BUFFER |           0 |          1 |    912x | 8.789 ms | 1.75% | 8.784 ms | 1.75% |      61117276983 |         
|  INTEGRAL | DEVICE_BUFFER |        1000 |          1 |   1675x | 8.908 ms | 7.62% | 8.902 ms | 7.52% |      60305784528 |       
|  INTEGRAL | DEVICE_BUFFER |           0 |         32 |    160x | 9.090 ms | 1.64% | 9.085 ms | 1.63% |      59093445255 |       
|  INTEGRAL | DEVICE_BUFFER |        1000 |         32 |   1751x | 8.518 ms | 3.61% | 8.513 ms | 3.61% |      63061423469 | 

@abellina abellina marked this pull request as ready for review March 4, 2024 15:17
@abellina abellina requested review from a team as code owners March 4, 2024 15:17
@abellina
Copy link
Contributor Author

abellina commented Mar 4, 2024

Moved out of draft. I've tested this with the tests we have on hand (spark integration tests, NDS, etc). And I am not seeing differences. I'll work on incorporating feedback and retesting.

Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving CMake changes

@nvdbaranec
Copy link
Contributor

Just to beat the no nulls case to death, I modified the definition of nullable.

  bool const no_nulls =
    s->page.num_input_values == (s->initial_rle_run[level_type::DEFINITION] >> 1) &&
    s->initial_rle_value[level_type::DEFINITION] == s->col.max_level[level_type::DEFINITION];
  bool const nullable = s->col.max_level[level_type::DEFINITION] > 0 and not no_nulls;

Interesting idea. So this would catch one common case where we have one giant repeated run of "not null". It's a little wonky, but I can't think of a failure case. If some writer somehow managed to encode an all-valid set of data using more than one run, this would still work - it would just traverse the data the normal way.

@etseidl
Copy link
Contributor

etseidl commented Mar 4, 2024

Interesting idea. So this would catch one common case where we have one giant repeated run of "not null". It's a little wonky, but I can't think of a failure case. If some writer somehow managed to encode an all-valid set of data using more than one run, this would still work - it would just traverse the data the normal way.

Exactly...can't guard against sub optimal writers :) But we could also augment this test with either V2 header stats or page indexes (when they're available for use...shameless #14973 plug 🤣).

cpp/src/io/parquet/decode_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_hdr.cu Outdated Show resolved Hide resolved
@nvdbaranec nvdbaranec self-requested a review March 11, 2024 15:28
@nvdbaranec
Copy link
Contributor

/ok to test

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. A few more nits that can be ignored. Thanks!

Not for this PR, but it would be nice to go back and change all of the various RLE decoders to use the new is_literal_run() and is_repeated_run().

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
int const valid_map_offset = ni.valid_map_offset;
int const row_index_lower_bound = s->row_index_lower_bound;

__syncthreads();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this sync is needed. Might want to add a comment to remind future us why it's here.

@nvdbaranec
Copy link
Contributor

/ok to test

Comment on lines +169 to +175
if (!is_string_col(chunk) && !is_nested(chunk) && !is_byte_array(chunk) && !is_boolean(chunk)) {
if (page.encoding == Encoding::PLAIN) {
return decode_kernel_mask::FIXED_WIDTH_NO_DICT;
} else if (page.encoding == Encoding::PLAIN_DICTIONARY) {
return decode_kernel_mask::FIXED_WIDTH_DICT;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this moves to the bottom, the !is_string_col() check can go away. Should also check for Encoding::RLE_DICTIONARY since PLAIN_DICTIONARY is deprecated for V2 headers.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a few questions (and a headache).

cpp/src/io/parquet/rle_stream.cuh Show resolved Hide resolved
cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved
cpp/src/io/parquet/rle_stream.cuh Show resolved Hide resolved
cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved
cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved
@@ -154,154 +176,94 @@ struct rle_stream {
static constexpr int run_buffer_size = rle_stream_required_run_buffer_size<decode_threads>();

int level_bits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts to having prefixes/suffixes to help tell data members apart from local variables?
not too relevant of this PR

Copy link
Contributor

@nvdbaranec nvdbaranec Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past I've use m_ for member variables. Not sure I really loved it though.

cpp/src/io/parquet/rle_stream.cuh Show resolved Hide resolved
@nvdbaranec
Copy link
Contributor

/ok to test

@abellina
Copy link
Contributor Author

/ok to test

@nvdbaranec nvdbaranec added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 13, 2024
@nvdbaranec
Copy link
Contributor

Added do-not-merge so I can finish looking at the odd crash in the python tests.

@nvdbaranec
Copy link
Contributor

/ok to test

@nvdbaranec nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 14, 2024
@nvdbaranec
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 7d4aaaa into rapidsai:branch-24.04 Mar 14, 2024
76 checks passed
rapids-bot bot pushed a commit that referenced this pull request Mar 19, 2024
See #15297. The Parquet string decoder can become a bottleneck in the presence of strings of widely varying sizes. This PR is an attempt to address this, at least as a stop gap solution. A more complete solution may be to rework the string decoder to work in a block-wide fashion, such as the new micro-kernels added in #15159.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15304
rapids-bot bot pushed a commit that referenced this pull request Apr 24, 2024
Closes #15226. Part of #13501.  Adds support for reading and writing `BYTE_STREAM_SPLIT` encoded Parquet data. Includes a "microkernel" version like those introduced by #15159.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15311
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants