Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16334][BACKPORT] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14944

Closed
wants to merge 1 commit into from

Conversation

sameeragarwal
Copy link
Member

@sameeragarwal sameeragarwal commented Sep 2, 2016

What changes were proposed in this pull request?

Backports #14941 in 2.0.

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

Author: Sameer Agarwal sameerag@cs.berkeley.edu

Closes #14941 from sameeragarwal/parquet-exception-2.

… row groups shouldn't throw an error

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes apache#14941 from sameeragarwal/parquet-exception-2.
@sameeragarwal
Copy link
Member Author

cc @davies

@SparkQA
Copy link

SparkQA commented Sep 3, 2016

Test build #64883 has finished for PR 14944 at commit facf221.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member Author

seems like the failure is related to #14797?

@gatorsmile
Copy link
Member

Yeah, I also hit it.

@hvanhovell
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 5, 2016

Test build #64930 has finished for PR 14944 at commit facf221.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Sep 6, 2016

Merging this into 2.0 branch.

asfgit pushed a commit that referenced this pull request Sep 6, 2016
…consecutive row groups shouldn't throw an error

## What changes were proposed in this pull request?

Backports #14941 in 2.0.

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

Author: Sameer Agarwal <sameeragcs.berkeley.edu>

Closes #14941 from sameeragarwal/parquet-exception-2.

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14944 from sameeragarwal/branch-2.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants