Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Do not fallback write files if output columns contain Spark internal metadata #4661

Merged
merged 1 commit into from
Feb 6, 2024

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Feb 6, 2024

What changes were proposed in this pull request?

If the metadata in attribute is leaked by Spark itself, we should not make write files fallback. This pr does cleanup spark internal metadata manually. See apache/spark#40776.

How was this patch tested?

add tests

Copy link

github-actions bot commented Feb 6, 2024

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

github-actions bot commented Feb 6, 2024

Run Gluten Clickhouse CI

@ulysses-you
Copy link
Contributor Author

cc @JkSelf thank you

Copy link
Contributor

@JkSelf JkSelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Catch. LGTM. Thanks.

@@ -234,6 +237,15 @@ class GlutenInsertSuite extends InsertSuite with GlutenSQLTestsBaseTrait {
}
}
}

testGluten("Do not fallback write files if output columns contain Spark internal metadata") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulysses-you Which INTERNAL_METADATA_KEYS this suite contains? Can we add some comments here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this test, it's __autoGeneratedAlias

@ulysses-you ulysses-you merged commit b486a55 into apache:main Feb 6, 2024
19 of 20 checks passed
@ulysses-you ulysses-you deleted the writefiles branch February 6, 2024 09:57
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4661_time.csv log/native_master_02_04_2024_d1b29e1bc_time.csv difference percentage
q1 31.95 33.68 1.728 105.41%
q2 24.65 24.23 -0.419 98.30%
q3 36.08 37.37 1.295 103.59%
q4 38.81 35.96 -2.842 92.68%
q5 70.43 70.39 -0.043 99.94%
q6 7.09 6.99 -0.092 98.70%
q7 84.75 84.51 -0.239 99.72%
q8 84.25 84.14 -0.112 99.87%
q9 119.11 121.28 2.175 101.83%
q10 43.70 43.47 -0.229 99.48%
q11 20.16 20.46 0.301 101.49%
q12 26.87 26.23 -0.642 97.61%
q13 44.87 45.25 0.381 100.85%
q14 18.76 15.93 -2.836 84.88%
q15 29.52 26.74 -2.777 90.59%
q16 13.70 14.11 0.408 102.97%
q17 103.53 102.75 -0.774 99.25%
q18 149.05 148.66 -0.390 99.74%
q19 12.50 13.50 0.994 107.95%
q20 26.16 26.71 0.547 102.09%
q21 223.66 221.36 -2.295 98.97%
q22 13.64 13.64 -0.005 99.96%
total 1223.24 1217.37 -5.867 99.52%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants