Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix writing of Parquet files with many fragments #11869

Merged
merged 12 commits into from
Oct 20, 2022

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 6, 2022

Description

This PR fixes an error that can occur when very small page sizes are used when writing Parquet files. #11551 changed from fixed 5000 row page fragments to a scaled value based on the requested max page size. For small page sizes, the number of fragments to process can exceed 64k. The number of fragments is used as the y dimension when calling gpuInitPageFragments, and when it exceeds 64k the kernel fails to launch, ultimately leading to an invalid memory access.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 6, 2022
@etseidl
Copy link
Contributor Author

etseidl commented Oct 6, 2022

Alternative approaches to consider are either using a single dimension and calculating column and fragment indexes, or using a fixed y dimension and looping over fragments.

I added a test that fails when FRAGSWAP is set to 0.

nvbench and nsys profile don't show any performance degradation with the swap in place.

@codecov
Copy link

codecov bot commented Oct 6, 2022

Codecov Report

Base: 87.40% // Head: 88.12% // Increases project coverage by +0.71% 🎉

Coverage data is based on head (56ebf7c) compared to base (f72c4ce).
Patch coverage: 88.55% of modified lines in pull request are covered.

❗ Current head 56ebf7c differs from pull request most recent head a072482. Consider uploading reports for the commit a072482 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11869      +/-   ##
================================================
+ Coverage         87.40%   88.12%   +0.71%     
================================================
  Files               133      133              
  Lines             21833    21905      +72     
================================================
+ Hits              19084    19304     +220     
+ Misses             2749     2601     -148     
Impacted Files Coverage Δ
python/cudf/cudf/core/dataframe.py 93.77% <ø> (ø)
python/cudf/cudf/core/indexed_frame.py 92.03% <ø> (ø)
python/cudf/cudf/core/udf/__init__.py 97.05% <ø> (+47.05%) ⬆️
python/cudf/cudf/io/orc.py 92.94% <ø> (-0.09%) ⬇️
python/cudf/cudf/testing/dataset_generator.py 72.83% <ø> (-0.42%) ⬇️
...thon/dask_cudf/dask_cudf/tests/test_distributed.py 18.86% <ø> (+4.94%) ⬆️
python/cudf/cudf/core/_base_index.py 82.20% <43.75%> (-3.35%) ⬇️
python/cudf/cudf/io/text.py 91.66% <66.66%> (-8.34%) ⬇️
python/strings_udf/strings_udf/__init__.py 84.31% <76.00%> (-12.57%) ⬇️
python/cudf/cudf/core/index.py 92.96% <95.71%> (+0.33%) ⬆️
... and 20 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@etseidl
Copy link
Contributor Author

etseidl commented Oct 6, 2022

A downside to simply swapping x and y is that now the number of columns is limited to 64k. A 1D grid would allow either number of fragments or number of columns to exceed the 64k limit, but not both obviously. Maybe a fixed y with looping (as suggested offline by @vuule) is the best fix?

@vuule vuule added bug Something isn't working non-breaking Non-breaking change labels Oct 11, 2022
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! Got nothing to contribute :)

@etseidl etseidl marked this pull request as ready for review October 17, 2022 21:08
@etseidl etseidl requested a review from a team as a code owner October 17, 2022 21:08
g->col = &col_desc[column_id];
g->start_row = fragments[column_id][frag_id].start_value_idx;
g->num_rows = fragments[column_id][frag_id].num_leaf_values;
uint32_t const lane_id = threadIdx.x & 0x1f;
Copy link
Contributor

@bdice bdice Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some rather innocuous-seeming magic values that all related to cudf::detail::warp_size in this function. I'll point them out, but I am fine with doing nothing if we feel the current code is better not to change.

Suggested change
uint32_t const lane_id = threadIdx.x & 0x1f;
uint32_t const lane_id = threadIdx.x % cudf::detail::warp_size;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good (although using the mod operator makes my teeth itch 🤣). Does anyone happen to know if there are constants anywhere for the max threadblock dimensions? Or are those per-card values?

Copy link
Contributor

@bdice bdice Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are no constants for that, and that's why we defined cudf::detail::warp_size. It is a constant for all NVIDIA GPUs as far as I am aware.

These two snippets should compile out roughly the same. Compilers can recognize that unsigned modulo by $2^N$ is equivalent to bitwise-and with $2^N - 1$. Evidence: https://godbolt.org/z/r4c41va5P

Copy link
Contributor Author

@etseidl etseidl Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and added a constexpr for the warp mask (before I read your reply)...there are several other instances of 0x1f sprinkled about in this file that can be replaced later.

Thanks for the link @bdice! Should I get rid of my mask constexpr and just use cudf::detail::warp_size everywhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be a bit more precise here, CUDA does provide warpSize, which is available inside device code, and the getDeviceProperties host function, which returns a struct containing the warp size. However, neither of them is a constant and therefore cannot be used in constant expressions (e.g. for declaring a C-style or std::array). The warp size is indeed constant across all current compute capabilities. In theory that's not something that we promise, so the technically correct answer is that we can't use a compile-time constant because in theory someone could run on a new architecture with a different answer. In practice, NVIDIA has no plans to change the warp size AFAIK and many examples of GPU code (even lots of code written by NVIDIA) define a warp_size constant. Lots of places use it assuming that it is in fact a compile-time constant and would have to be rewritten if we ever had any cards with a different warp size, so that's a much bigger problem to deal with another day if that ever changes :)

uint32_t const column_id = blockIdx.x;
uint32_t const num_fragments_per_column = fragments.size().second;

uint32_t frag_id = blockIdx.y * 4 + (threadIdx.x >> 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint32_t frag_id = blockIdx.y * 4 + (threadIdx.x >> 5);
uint32_t frag_id = blockIdx.y * 4 + (threadIdx.x * cudf::detail::warp_size);

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of magic values that don't need to be magical -- otherwise LGTM.

@etseidl
Copy link
Contributor Author

etseidl commented Oct 19, 2022

Tests seem to be failing on the mimesis stuff now. Should I merge with 22.12 to pull in #11906?

@bdice
Copy link
Contributor

bdice commented Oct 19, 2022

@etseidl Merging the upstream or commenting “rerun tests” should work.

@bdice
Copy link
Contributor

bdice commented Oct 19, 2022

rerun tests

@vuule
Copy link
Contributor

vuule commented Oct 20, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 98185fe into rapidsai:branch-22.12 Oct 20, 2022
@etseidl etseidl deleted the feature/fragments_fix branch October 20, 2022 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants