-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable reading StringViewArray
by default from Parquet
#12092
base: main
Are you sure you want to change the base?
Conversation
StringViewArray
by default from Parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow-rs skips now the interval parts with 0? so interval tests are broken now
That is due to the arrow upgrade for sure -- you can see the changes needed here (in their own PR): #12032 |
bc5d7f7
to
27f7d1e
Compare
I think the last remaining failure is due to #12123 |
27f7d1e
to
ce5470d
Compare
@@ -264,8 +264,12 @@ impl PruningStatistics for BloomFilterStatistics { | |||
.iter() | |||
.map(|value| { | |||
match value { | |||
ScalarValue::Utf8(Some(v)) => sbbf.check(&v.as_str()), | |||
ScalarValue::Binary(Some(v)) => sbbf.check(v), | |||
ScalarValue::Utf8(Some(v)) | ScalarValue::Utf8View(Some(v)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs its own ticket / test I think. Will do it shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #12499
Still working on benchmarks, but the code is looking good |
2275216
to
e8f7384
Compare
This comment was marked as outdated.
This comment was marked as outdated.
I ran benchmarks and posted results, -- clickbench looks great 🚀 . However, interesting |
I figured out what is going on (different than I thought). I believe The query is SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer")
FROM hits_partitioned
WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25; I will think about the best way to proceed here |
I plan to take a closer look at this on Saturday if no one has beat me to it, cc @alamb |
@@ -1034,14 +1034,22 @@ fn binary_to_string_coercion( | |||
) -> Option<DataType> { | |||
use arrow::datatypes::DataType::*; | |||
match (lhs_type, rhs_type) { | |||
// Note: added rules to coerce from BinaryView --> Utf8View |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #12500
@@ -264,8 +264,12 @@ impl PruningStatistics for BloomFilterStatistics { | |||
.iter() | |||
.map(|value| { | |||
match value { | |||
ScalarValue::Utf8(Some(v)) => sbbf.check(&v.as_str()), | |||
ScalarValue::Binary(Some(v)) => sbbf.check(v), | |||
ScalarValue::Utf8(Some(v)) | ScalarValue::Utf8View(Some(v)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #12499
Thanks @XiangpengHao -- I
|
Found and filed another slowdown |
Here is a proposal I think that could potentially fix the regression: #12509 (comment) (basically push the casting down into ParquetExec). 🤔 |
55e0dc2
to
6500361
Compare
Benchmarks results clickbench_1:Is slower due to #12771
|
clickbench_extendedLooking good here
|
clickbench_partitionedSlowing down also due to #12509 / #12788
|
Current status: #11682 (comment) |
6500361
to
5b67afe
Compare
Draft as it builds on:
53.1.0
/ fix clippy #12724hits_partitioned
Which issue does this PR close?
Closes #11682
Rationale for this change
Reading data as
StringViewArray
is significantly faster thanStringArray
. We have been testing this behind a feature flag but it is now stable enough to enable by default.See blog post #11603:
Benchmark Results
What changes are included in this PR?
schema_force_view_types
to trueAre these changes tested?
Yes, by CI tests
Are there any user-facing changes?
If you see an error related to StringView use, you can disable this feature using the schema_force_string_view option
Context
@XiangpengHao debugged these tests previously using #11862