Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Let CPU handle Delta table's metadata related queries #5624

Closed
viadea opened this issue May 24, 2022 · 3 comments · Fixed by #5912
Closed

[FEA] Let CPU handle Delta table's metadata related queries #5624

viadea opened this issue May 24, 2022 · 3 comments · Fixed by #5912
Assignees
Labels
feature request New feature or request performance A performance related task/issue

Comments

@viadea
Copy link
Collaborator

viadea commented May 24, 2022

I wish we can just let CPU handle Delta table's metadata related queries.

The reason is there are some CPU fallbacks for Delta table's metadata queries such as the one reading _delta_log(Json files).
If the _delta_log is huge(say millions of rows), then the CPU fallback's performance penalty is not trivial.

If we can just let CPU handle those metadata queries, then at least the metadata queries' performance should be similar to CPU run.

@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 24, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 25, 2022
@sameerz sameerz added the performance A performance related task/issue label May 25, 2022
@andygrove andygrove self-assigned this Jun 15, 2022
@andygrove andygrove added this to the Jun 6 - Jun 17 milestone Jun 15, 2022
@andygrove
Copy link
Contributor

Here is an example of some of the expensive transitions in a delta lake metadata query.

delta-lake-meta

@andygrove
Copy link
Contributor

Setting spark.rapids.sql.optimizer.enabled=true removes some of the overhead here by avoiding moving to GPU for the final projection:

delta-lake-opt

@viadea
Copy link
Collaborator Author

viadea commented Jun 24, 2022

@andygrove I checked the query plan and time before and after setting spark.rapids.sql.optimizer.enabled=true and the result is similar -- 30s or so.
The plan are different though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants