Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14941

Closed

Conversation

sameeragarwal
Copy link
Member

What changes were proposed in this pull request?

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

How was this patch tested?

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

@sameeragarwal
Copy link
Member Author

cc @davies

@davies
Copy link
Contributor

davies commented Sep 2, 2016

LGTM, pending jenkins.

@heroldus
Copy link

heroldus commented Sep 2, 2016

@sameeragarwal: Do you expect any performace impact of this commit? It's an additional if (!column.isNullAt(i)) for every single value read.

@davies
Copy link
Contributor

davies commented Sep 2, 2016

@heroldus decodeDictionaryIds() is only used when a batch across pages with different encoding (dictionary or plain), so it's not in the hot pass, I think the performance impact should be fine.

@heroldus
Copy link

heroldus commented Sep 2, 2016

@davies Fine, thx.

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64870 has finished for PR 14941 at commit efda298.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Sep 2, 2016

Merging this into master and 2.0 branch, thanks!

@asfgit asfgit closed this in a2c9acb Sep 2, 2016
asfgit pushed a commit that referenced this pull request Sep 2, 2016
… row groups shouldn't throw an error

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14941 from sameeragarwal/parquet-exception-2.

(cherry picked from commit a2c9acb)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
sameeragarwal added a commit to sameeragarwal/spark that referenced this pull request Sep 2, 2016
… row groups shouldn't throw an error

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes apache#14941 from sameeragarwal/parquet-exception-2.
@sameeragarwal
Copy link
Member Author

@heroldus @davies I'll try to benchmark the worse case performance regression for this special case (while reading row batches of all but one dictionary encoded pages). I'll let you know if we find a substantial regression.

asfgit pushed a commit that referenced this pull request Sep 6, 2016
…consecutive row groups shouldn't throw an error

## What changes were proposed in this pull request?

Backports #14941 in 2.0.

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

Author: Sameer Agarwal <sameeragcs.berkeley.edu>

Closes #14941 from sameeragarwal/parquet-exception-2.

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14944 from sameeragarwal/branch-2.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants