[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14941

sameeragarwal · 2016-09-02T20:11:19Z

What changes were proposed in this pull request?

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

How was this patch tested?

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

…shouldn't throw an error

sameeragarwal · 2016-09-02T20:27:24Z

cc @davies

davies · 2016-09-02T20:30:02Z

LGTM, pending jenkins.

heroldus · 2016-09-02T20:39:15Z

@sameeragarwal: Do you expect any performace impact of this commit? It's an additional if (!column.isNullAt(i)) for every single value read.

davies · 2016-09-02T21:06:08Z

@heroldus decodeDictionaryIds() is only used when a batch across pages with different encoding (dictionary or plain), so it's not in the hot pass, I think the performance impact should be fine.

heroldus · 2016-09-02T22:06:46Z

@davies Fine, thx.

SparkQA · 2016-09-02T22:11:32Z

Test build #64870 has finished for PR 14941 at commit efda298.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-09-02T22:15:34Z

Merging this into master and 2.0 branch, thanks!

… row groups shouldn't throw an error This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14941 from sameeragarwal/parquet-exception-2. (cherry picked from commit a2c9acb) Signed-off-by: Davies Liu <davies.liu@gmail.com>

… row groups shouldn't throw an error This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes apache#14941 from sameeragarwal/parquet-exception-2.

sameeragarwal · 2016-09-06T16:33:16Z

@heroldus @davies I'll try to benchmark the worse case performance regression for this special case (while reading row batches of all but one dictionary encoded pages). I'll let you know if we find a substantial regression.

…consecutive row groups shouldn't throw an error ## What changes were proposed in this pull request? Backports #14941 in 2.0. This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <sameeragcs.berkeley.edu> Closes #14941 from sameeragarwal/parquet-exception-2. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14944 from sameeragarwal/branch-2.0.

Reusing dictionary column vectors for reading consecutive row groups …

efda298

…shouldn't throw an error

asfgit closed this in a2c9acb Sep 2, 2016

sameeragarwal mentioned this pull request Sep 2, 2016

[SPARK-16334][BACKPORT] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14944

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14941

[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14941

sameeragarwal commented Sep 2, 2016

sameeragarwal commented Sep 2, 2016

davies commented Sep 2, 2016

heroldus commented Sep 2, 2016

davies commented Sep 2, 2016

heroldus commented Sep 2, 2016

SparkQA commented Sep 2, 2016

davies commented Sep 2, 2016

sameeragarwal commented Sep 6, 2016

[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14941

[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14941

Conversation

sameeragarwal commented Sep 2, 2016

What changes were proposed in this pull request?

How was this patch tested?

sameeragarwal commented Sep 2, 2016

davies commented Sep 2, 2016

heroldus commented Sep 2, 2016

davies commented Sep 2, 2016

heroldus commented Sep 2, 2016

SparkQA commented Sep 2, 2016

davies commented Sep 2, 2016

sameeragarwal commented Sep 6, 2016