Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DateType support for AST expressions #3752

Closed
jlowe opened this issue Oct 5, 2021 · 4 comments · Fixed by #4431
Closed

Add DateType support for AST expressions #3752

jlowe opened this issue Oct 5, 2021 · 4 comments · Fixed by #4431
Assignees
Labels
feature request New feature or request good first issue Good for newcomers

Comments

@jlowe
Copy link
Member

jlowe commented Oct 5, 2021

libcudf AST supports timestamp types, and Spark's DateType is treated as a timestamp type in libcudf. We should be able to extend the existing AST expression support to include DateType inputs.

@jlowe jlowe added feature request New feature or request good first issue Good for newcomers ? - Needs Triage Need team to review and classify labels Oct 5, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Oct 5, 2021
@HaoYang670 HaoYang670 self-assigned this Dec 22, 2021
@HaoYang670
Copy link
Collaborator

I would like to try

@HaoYang670
Copy link
Collaborator

HaoYang670 commented Dec 23, 2021

I am a little curious about what kind of optimization we could do on AST.

@revans2
Copy link
Collaborator

revans2 commented Dec 30, 2021

I am a little curious about what kind of optimization we could do on AST.

There are a few reasons. The biggest one is around the context in which the expression runs. If we want to do the comparison as a part of a join. For example Join A and B on A.a > B.b. If we want to do that join efficiently we don't want to materialize all of the potential candidates with a cross join and then filter out just the results that matched. Instead we want to test in on the GPU before we decide to gather the rows or not. This allows us to do that.

The second reason is memory bandwidth. For example, if I want to run something like a + b + c + d. If I run it the normal, non-AST way. I need to add a+b, and produce a temp result. Then add that temp result to c and produce another temp result. Then add the other temp result to d to get the final answer. That means I had to call 3 kernels, write out 3 columns of data to the GPU's memory, and read in 6 columns of data from the GPU's memory. With the AST, in theory we run 1 kernel, read 4 columns of data and write out 1. That should speed it up by 2x, in theory.

We have started to work on the first use case with joins. Just because there is no other way to do it even remotely efficiently for some types of joins otherwise. For the second use case we have not seen projections be enough of an issue that we have really started to tackle it.

@HaoYang670
Copy link
Collaborator

Thank you for your explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants