[SPARK-16334][BACKPORT] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14944

sameeragarwal · 2016-09-02T23:17:00Z

What changes were proposed in this pull request?

Backports #14941 in 2.0.

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

Author: Sameer Agarwal sameerag@cs.berkeley.edu

Closes #14941 from sameeragarwal/parquet-exception-2.

… row groups shouldn't throw an error This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes apache#14941 from sameeragarwal/parquet-exception-2.

sameeragarwal · 2016-09-02T23:17:51Z

cc @davies

SparkQA · 2016-09-03T00:25:36Z

Test build #64883 has finished for PR 14944 at commit facf221.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2016-09-03T00:29:42Z

seems like the failure is related to #14797?

gatorsmile · 2016-09-03T15:59:01Z

Yeah, I also hit it.

hvanhovell · 2016-09-05T06:51:04Z

retest this please

SparkQA · 2016-09-05T08:31:03Z

Test build #64930 has finished for PR 14944 at commit facf221.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-09-06T17:48:36Z

Merging this into 2.0 branch.

…consecutive row groups shouldn't throw an error ## What changes were proposed in this pull request? Backports #14941 in 2.0. This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <sameeragcs.berkeley.edu> Closes #14941 from sameeragarwal/parquet-exception-2. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14944 from sameeragarwal/branch-2.0.

sameeragarwal closed this Sep 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16334][BACKPORT] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14944

[SPARK-16334][BACKPORT] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14944

sameeragarwal commented Sep 2, 2016 •

edited

Loading

sameeragarwal commented Sep 2, 2016

SparkQA commented Sep 3, 2016

sameeragarwal commented Sep 3, 2016

gatorsmile commented Sep 3, 2016

hvanhovell commented Sep 5, 2016

SparkQA commented Sep 5, 2016

davies commented Sep 6, 2016

[SPARK-16334][BACKPORT] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14944

[SPARK-16334][BACKPORT] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14944

Conversation

sameeragarwal commented Sep 2, 2016 • edited Loading

What changes were proposed in this pull request?

sameeragarwal commented Sep 2, 2016

SparkQA commented Sep 3, 2016

sameeragarwal commented Sep 3, 2016

gatorsmile commented Sep 3, 2016

hvanhovell commented Sep 5, 2016

SparkQA commented Sep 5, 2016

davies commented Sep 6, 2016

sameeragarwal commented Sep 2, 2016 •

edited

Loading