Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Fix case insensitive match on native parquet column pruning #10747

Closed
revans2 opened this issue Apr 27, 2022 · 7 comments
Closed

[FEA] Fix case insensitive match on native parquet column pruning #10747

revans2 opened this issue Apr 27, 2022 · 7 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@revans2
Copy link
Contributor

revans2 commented Apr 27, 2022

After NVIDIA/spark-rapids-jni#199 and NVIDIA/spark-rapids#5310 we will have an option to use native code to do column pruning and parsing of the footer for parquet. One of the issues is that C++ does not have built in APIs to convert a unicode string to lower case. It can do it a single character at a time, and that works most of the time, but in some cases it can have problems. This is to find a better way to make the strings lowercase.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify labels Apr 27, 2022
@devavret
Copy link
Contributor

Is this feature requested in the cuIO reader? Is case insensitivity part of the parquet spec? I believe we can do this one layer above libcudf.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball GregoryKimball added Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 28, 2022
@GregoryKimball
Copy link
Contributor

@davidwendt, does strings::case::to_lower work with unicode?

@davidwendt
Copy link
Contributor

For reference: https://docs.rapids.ai/api/libcudf/stable/group__strings__case.html#ga8ec672aad6467cc71f37b1a3ac8179eb
There is no case namespace just cudf::strings::to_lower.
There is no unicode support anywhere in libcudf. All strings in libcudf are expected to be UTF-8.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball
Copy link
Contributor

@revans2 Is this still needed? Also is this a parquet project or a strings project?

@revans2
Copy link
Contributor Author

revans2 commented Feb 20, 2024

This is not needed. It was a nice to have even when it was filed. feel free to close it.

@vyasr vyasr closed this as completed Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

5 participants