Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use C++ to parse and filter parquet footers. #5310

Merged
merged 11 commits into from
May 11, 2022

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Apr 25, 2022

This still needs tests and documentation, but I wanted to get a PR up sooner than later.

This depends on NVIDIA/spark-rapids-jni#199

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
@jlowe jlowe added the performance A performance related task/issue label Apr 26, 2022
@jlowe jlowe added this to the Apr 18 - Apr 29 milestone Apr 26, 2022
ParquetMetadataConverter.range(file.start, file.start + file.length))
val footer = footerReader match {
case ParquetFooterReaderType.NATIVE =>
System.err.println("NATIVE FOOTER READER...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to real logging once done with WIP

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @gerashegalov

ParquetMetadataConverter.range(file.start, file.start + file.length))
val footer = footerReader match {
case ParquetFooterReaderType.NATIVE =>
System.err.println("NATIVE FOOTER READER...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @gerashegalov


@throws[IOException]
override def read(buf: ByteBuffer): Int =
if (buf.hasArray) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a small comment somewhere on why we need two implementations one for direct and one for JVM byte buffers.

"msg=constructor ParquetFileReader in class ParquetFileReader is deprecated"
)
private def addNamesAndCount(names: ArrayBuffer[String], children: ArrayBuffer[Int],
name: String, num_children: Int): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: String, num_children: Int): Unit = {
name: String, numChildren: Int): Unit = {

@revans2
Copy link
Collaborator Author

revans2 commented May 9, 2022

build

@revans2
Copy link
Collaborator Author

revans2 commented May 9, 2022

build

@revans2
Copy link
Collaborator Author

revans2 commented May 9, 2022

@jlowe @abellina @gerashegalov could you please take another look? I think I have addressed all of the review comments.

jlowe
jlowe previously approved these changes May 9, 2022
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my comments are minor, mostly about potentially simplifying/optimizing the copies leveraging HostMemoryBuffer's ability to alias as a DirectByteBuffer. Nothing is must-fix on my end.

@revans2
Copy link
Collaborator Author

revans2 commented May 10, 2022

build

@revans2
Copy link
Collaborator Author

revans2 commented May 10, 2022

Thanks for the review @jlowe I think I have addressed everything.

@throws[IOException]
override def readFully(buf: ByteBuffer): Unit = {
val requested = buf.remaining()
val avail = available()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious why this readFully checks for available() throwing before the read, whereas others are checking the amount read and throwing after the read. But this is not a blocker.

abellina
abellina previously approved these changes May 10, 2022
jlowe
jlowe previously approved these changes May 10, 2022
@revans2 revans2 marked this pull request as draft May 10, 2022 19:23
@revans2
Copy link
Collaborator Author

revans2 commented May 10, 2022

Converted to draft because I hit a bug with the latest code that I need to debug.

@revans2 revans2 dismissed stale reviews from jlowe and abellina via 3ff8b21 May 10, 2022 19:45
@revans2 revans2 marked this pull request as ready for review May 10, 2022 19:46
@revans2
Copy link
Collaborator Author

revans2 commented May 10, 2022

build

@revans2
Copy link
Collaborator Author

revans2 commented May 10, 2022

I reverted the last set of changes to the StreamReader. They were causing issues when reading large files. I don't know why ByteBuffers are only used for large files but they are. Because the comments were nits I decided to revert this and I will file a follow on issue to see if we can make it work properly, along with some investigation into why tests were not taking that code path.

@revans2
Copy link
Collaborator Author

revans2 commented May 10, 2022

I filed #5452 as the follow on issue.

@revans2
Copy link
Collaborator Author

revans2 commented May 11, 2022

build

@revans2 revans2 merged commit 00c0a6c into NVIDIA:branch-22.06 May 11, 2022
@revans2 revans2 deleted the cpp_parquet_footer_parse branch May 11, 2022 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants