Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Filtering with Hive results on zero rows with a external table created on top of cudf to_orc files #10868

Closed
mlahir1 opened this issue May 16, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@mlahir1
Copy link

mlahir1 commented May 16, 2022

Filtering with Hive results on zero rows with a external table created on top of cudf to_orc files. This issues happens with cudf 21.12 and not with 22.06 but however because of #10755, we cannot upgrade to 22.06.

Steps/Code to reproduce bug

df = cudf.DataFrame()
df['a'] = range(1000)
df['b'] = df.a%5
df.to_orc('test.orc', index=False)

you hdfs put or copy the files to hdfs location.
create external table:

drop table if exists test1;
 CREATE EXTERNAL TABLE test1(
   visit_hi_id bigint,
   visit_low_id bigint)
 STORED AS ORC
 LOCATION
   'gs://<bucket path>/test1';
msck repair table test1; 

run queries:

select visit_hi_id, visit_low_id from test1 limit 100;

+--------------+---------------+--+
| visit_hi_id  | visit_low_id  |
+--------------+---------------+--+
| 0            | 0             |
| 1            | 1             |
| 2            | 2             |
| 3            | 3             |
| 4            | 4             |
| 5            | 0             |
| 6            | 1             |
| 7            | 2             |
...
| 99           | 4             |
+--------------+---------------+--+
100 rows selected (1.638 seconds)

Query with where clause:

select visit_hi_id, visit_low_id from test1 where visit_low_id = 2 limit 10;
+--------------+---------------+--+
| visit_hi_id  | visit_low_id  |
+--------------+---------------+--+
+--------------+---------------+--+

Expected behavior
When i run the same query with cudf files written with rapids 22.02 I get correct results

> select visit_hi_id, visit_low_id from test1 where visit_low_id = 2 limit 10;
+--------------+---------------+--+
| visit_hi_id  | visit_low_id  |
+--------------+---------------+--+
| 2            | 2             |
| 7            | 2             |
| 12           | 2             |
| 17           | 2             |
| 22           | 2             |
| 27           | 2             |
| 32           | 2             |
| 37           | 2             |
| 42           | 2             |
| 47           | 2             |
+--------------+---------------+--+
@mlahir1 mlahir1 added Needs Triage Need team to review and classify bug Something isn't working labels May 16, 2022
@mlahir1
Copy link
Author

mlahir1 commented May 16, 2022

Hive version: Hive 1.2.1000.2.6.5.0-292

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball
Copy link
Contributor

#10755 is fixed now, please let us know if this is still an issue.

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants