Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support faster copy for a custom DataSource V2 which supplies Arrow data #1622

Merged
merged 51 commits into from
Jan 29, 2021

Conversation

tgravescs
Copy link
Collaborator

This provides support for more efficiently copying data to the GPU when a datasource V2 source provides the data as an ArrowColumnVector. The CUDF side of this has already been merged. rapidsai/cudf#7222

This currently only supports primitive types and strings. Decimal types and nested types are not supported. It will fallback to the regular copy code if it sees one of those types not supported.

The integration test require an extra jar which contains a datasource v2 which supplies ArrowColumnVector. I'm looking into pulling that code in and how best to automate those tests, filed #1620 to track that.

You will notice I added a new AccessibleArrowColumnVector. This is because the Spark ArrowColumnVector doesn't have the Arrow ValueVector publicly accessible. I use reflection in here to get a hold of that, but another option is just for user to use AccessibleArrowColumnVector. I would like peoples feedback on whether to keep that or not?

I had to add a shim layer for the Arrow code because Spark 3.1.0 changed the arrow version and ArrowBuf class is now different.

@tgravescs tgravescs added feature request New feature or request P0 Must have for release labels Jan 29, 2021
@tgravescs tgravescs added this to the Jan 18 - Jan 29 milestone Jan 29, 2021
@tgravescs tgravescs self-assigned this Jan 29, 2021
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

upmerging

@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

build

"access its Arrow ValueVector", e)
}
case av: AccessibleArrowColumnVector =>
// val arrowVec = av.asInstanceOf[AccessibleArrowColumnVector]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this can be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops missed that

@tgravescs
Copy link
Collaborator Author

build

@tgravescs tgravescs merged commit f4c912a into NVIDIA:branch-0.4 Jan 29, 2021
@tgravescs tgravescs deleted the sktDatasourceV2 branch January 29, 2021 18:58
@tgravescs tgravescs restored the sktDatasourceV2 branch February 10, 2021 21:34
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…ata (NVIDIA#1622)

* Add in data source v2, csv file and test for arrow copy
* remove commented out line
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…ata (NVIDIA#1622)

* Add in data source v2, csv file and test for arrow copy
* remove commented out line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request P0 Must have for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants