-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add microkernels for fixed-width and fixed-width dictionary in Parquet decode #15159
Add microkernels for fixed-width and fixed-width dictionary in Parquet decode #15159
Conversation
dictionary Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
d6e2b02
to
d66835d
Compare
I force pushed a signed version of my commit, fyi. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went over everything, except the rle_stream
changes; that one will take a bit more time.
The fixed width kernels look great!
Posted some questions and nitpicks.
__global__ void __launch_bounds__(decode_block_size) gpuDecodePageDataFixed( | ||
PageInfo* pages, device_span<ColumnChunkDesc const> chunks, size_t min_row, size_t num_rows) | ||
{ | ||
__shared__ __align__(16) page_state_s state_g; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we gain anything by splitting page_state_s
so that only the relevant members are included in the specialized kernels?
I assume its size is small compared to rle_run
objects so it might not move the needle WRT shared memory use, but asking just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. We should add this to another follow on for us to tackle in the future. Will file. @nvdbaranec fyi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed this: #15267
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great so far. Just some random comments.
cpp/src/io/parquet/decode_fixed.cu
Outdated
__shared__ rle_run<level_t> def_runs[rle_run_buffer_size]; | ||
rle_stream<level_t, decode_block_size, rolling_buf_size> def_decoder{def_runs}; | ||
|
||
bool const nullable = s->col.max_level[level_type::DEFINITION] > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know null counts by the time we get here? Maybe via metadata (v2 headers, page indexes)? It would be nice to skip a RLE decode of a stream of all 1's (esp. given how spark likes to make everything nullable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followup: I did a quick check with some data that I know has no nulls, but the schema says OPTIONAL
, so the max def level is 1 (but always 1). Setting nullable
to false reduces the dict decode time to 6.7ms from 9.8ms for this data set. Total times to decode integer data were 12.6ms for monolithic, 10.7ms for this PR (some flat, some dict), and 7.7ms with nullable = false
.
For non-nullable we also don't need the nz_idx
array, which saves 1000 bytes of shared mem, but the expense of populating it is pretty negligible. Just food for thought if we want to go with a kernel exclusively for flat, non-nullable (or no nulls present) data.
I was also thinking we could reason about the presence of nulls by examining the head of the def levels...if the first run has a size equal to num_values, then we know it's either all nulls or non nulls, depending on the encoded run value.
|
||
// should the size be 1/2 (128?) | ||
__shared__ rle_run<uint32_t> dict_runs[rle_run_buffer_size]; // should be array of 6 | ||
rle_stream<uint32_t, decode_block_size, rolling_buf_size> dict_stream{dict_runs}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could really go crazy and parameterize this on the number of bits used for the dictionary keys. I'd bet that in the large majority of cases the keys would fit in a uint16_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes? @nvdbaranec has plans for a "grand templatization" so I think this fits right in. Lets do as follow on.
Just to beat the no nulls case to death, I modified the definition of bool const no_nulls =
s->page.num_input_values == (s->initial_rle_run[level_type::DEFINITION] >> 1) &&
s->initial_rle_value[level_type::DEFINITION] == s->col.max_level[level_type::DEFINITION];
bool const nullable = s->col.max_level[level_type::DEFINITION] > 0 and not no_nulls; Then I modified the null probability in the parquet_read_decode benchmark to be 0 and compared monolithic, this PR, and my modification.
|
Moved out of draft. I've tested this with the tests we have on hand (spark integration tests, NDS, etc). And I am not seeing differences. I'll work on incorporating feedback and retesting. |
…move extra nz_idx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving CMake changes
Interesting idea. So this would catch one common case where we have one giant repeated run of "not null". It's a little wonky, but I can't think of a failure case. If some writer somehow managed to encode an all-valid set of data using more than one run, this would still work - it would just traverse the data the normal way. |
Exactly...can't guard against sub optimal writers :) But we could also augment this test with either V2 header stats or page indexes (when they're available for use...shameless #14973 plug 🤣). |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. A few more nits that can be ignored. Thanks!
Not for this PR, but it would be nice to go back and change all of the various RLE decoders to use the new is_literal_run()
and is_repeated_run()
.
int const valid_map_offset = ni.valid_map_offset; | ||
int const row_index_lower_bound = s->row_index_lower_bound; | ||
|
||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this sync is needed. Might want to add a comment to remind future us why it's here.
/ok to test |
if (!is_string_col(chunk) && !is_nested(chunk) && !is_byte_array(chunk) && !is_boolean(chunk)) { | ||
if (page.encoding == Encoding::PLAIN) { | ||
return decode_kernel_mask::FIXED_WIDTH_NO_DICT; | ||
} else if (page.encoding == Encoding::PLAIN_DICTIONARY) { | ||
return decode_kernel_mask::FIXED_WIDTH_DICT; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this moves to the bottom, the !is_string_col()
check can go away. Should also check for Encoding::RLE_DICTIONARY
since PLAIN_DICTIONARY
is deprecated for V2 headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a few questions (and a headache).
@@ -154,154 +176,94 @@ struct rle_stream { | |||
static constexpr int run_buffer_size = rle_stream_required_run_buffer_size<decode_threads>(); | |||
|
|||
int level_bits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts to having prefixes/suffixes to help tell data members apart from local variables?
not too relevant of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past I've use m_
for member variables. Not sure I really loved it though.
/ok to test |
/ok to test |
Added do-not-merge so I can finish looking at the odd crash in the python tests. |
/ok to test |
/merge |
See #15297. The Parquet string decoder can become a bottleneck in the presence of strings of widely varying sizes. This PR is an attempt to address this, at least as a stop gap solution. A more complete solution may be to rework the string decoder to work in a block-wide fashion, such as the new micro-kernels added in #15159. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #15304
Closes #15226. Part of #13501. Adds support for reading and writing `BYTE_STREAM_SPLIT` encoded Parquet data. Includes a "microkernel" version like those introduced by #15159. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Vukasin Milovanovic (https://github.com/vuule) URL: #15311
This PR adds parquet decode fixed width and fixed width dictionary kernels micro kernels based on #13622.
We change
rle_stream
to allow us to process dictionary pages, and so are able to use it ingpuDecodePageDataFixedDict
allowing it to take a "count" to be decoded, which we provide as the number of valid (non-null) elements that were found in the definition stream. Most of the intrusive changes are inrle_stream
so that it can work with this argument. One big change is that prior to this we used to "spill" runs that would not fit in the current iteration. We have changed it so that we don't spill anymore and we could have in theruns
array a large run that won't be decoded until several calls todecode
later. This opens the possibility for us to throw more decode threads at the accumulated run fairly easily, and that may be worked on in this PR or shortly after (load balancing a large run).The code here is really mostly @nvdbaranec and makes use of @etseidl's great work on
rle_stream
. It is marked in draft because it's not entirely done (not all testing has been performed). That said, NDS, nvbench andPARQUET_TEST
passes. In order to use it, please setUSE_FIXED_OP=2
which means we enable both thegpuDecodePageDataFixed
andgpuDecodePageDataFixedDict
kernels.Here are
USE_FIXED_OP=2
nvbench results against 24.04 on my RTX6000:Checklist