[FEA/Proposal] Use Arrow backed filesystem objects for reading remote files #7475

ayushdg · 2021-03-01T18:06:54Z

Is your feature request related to a problem? Please describe.
The current python reading pipeline reads the entire buffer when reading from non-local datasources like HDFS, s3 etc. This is suboptimal for cases where a user reads a subset of the data (few columns, row based filtering) as the entire buffer is still pulled from the datasource.
All remote reading currently goes via fsspec which provides a python filesystem interface for reading remote sources. Since the python filesystem object is not backed by a cpp object, the raw buffer needs read and passed down to libcudf.

This can hamper performance, especially in network bound environments when reading from the cloud.

Describe the solution you'd like
To use Arrow FileSystem objects, wherever applicable. These objects are backed by arrow cpp objects (CRandomAccessFile) which can then be passed down to libcudf, which has the logic to read the buffers as required based on the reader options and the file format.

Implementation Idea

Python - Add a python utility that that uses pyarrow.FileSystem to create a file object wherever applicable in ioutils.get_filepath_or_buffer
Cython : Get c backed file handle from the pyarrow File object.
Use arrow_io_datasource to create an arrow_datasource that can be directly consumed by libcudf.

Describe alternatives you've considered

Additional context
Some initial discussion on the topic
#2760 (comment)
#2760 (comment)

cc: @randerzander @vuule @rjzamora

The text was updated successfully, but these errors were encountered:

jdye64 · 2021-03-23T17:10:16Z

So I made a quick prototype for testing out the performance improvement that could be realized here where only a couple of rows from a "large" parquet file might be desired. It works something like this.

I created a user defined External Datasource that uses the Arrow S3 client, so no new dependencies here, that probably isn't necessary but wanted to keep things separate to make the example more clear here.
This arrow_s3_datasource takes S3 options and configurations, then translates those to the actual S3 connection using Arrow S3
The result is an Arrow RandomAccessFile, which we already have a datasource for. This means it can simply be passed to the existing parquet reader with no further code modifications necessary
From there everything works just as if it was a local file

I saw quite significant performance improvements when reading a parquet file on S3 that was roughly 480mb with 18 columns, but only reading a single row from that file vs the entire thing. The prototype output in included below to show the speedup. Note the results are in microseconds and less is better.

All Columns Runtime: 280,378,402; One Column Runtime: 16,646,308

jdye64 · 2021-03-23T19:30:45Z

@kkraus14 would you be opposed to me making a PR for this? I have a couple of questions first though to try and streamline the process.

I personally don't think we need another external datasource for this. Could I just add another function to the existing arrow datasource in c++? As you mentioned in the thread Ayush linked too, something that could be used with Arrow uri C++ api. So s3://, hdfs://, mock://, etc
I would be interested in your opinion on capturing S3 configuration options from the users. This would be more of a Python thing.

kkraus14 · 2021-03-23T20:19:31Z

I'd welcome a PR to iterate further!

vuule · 2021-03-23T20:43:30Z

Could I just add another function to the existing arrow datasource in c++?

What would this function do? Create a datasource from a uri?

jdye64 · 2021-03-23T20:50:06Z

Could I just add another function to the existing arrow datasource in c++?

What would this function do? Create a datasource from a uri?

Pretty much. This is just me thinking from my head, but shouldn't be much more complicated than that. The full "datasource" I made in the example really the only work was in its constructor so was thinking just another function should suffice.

vuule · 2021-03-23T20:51:42Z

Sounds great to me!

github-actions · 2021-04-22T21:04:42Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

randerzander · 2021-04-22T21:06:09Z

Still a desired [FEA]. PR has activity.

github-actions · 2021-05-22T23:04:36Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

jdye64 · 2021-05-24T14:34:10Z

Still desired. Waiting on upstream arrow changes.

) Arrow offers an API that allows for users to provide a uri definition for target files. This PR will use that api and create a new `arrow_io_source` constructor to accept that information from the user and then create the appropriate FileSystem instance and configure it for access to that file. This closes: #7475 Authors: - Jeremy Dyer (https://github.com/jdye64) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Robert Maynard (https://github.com/robertmaynard) - Mike Wilson (https://github.com/hyperbolic2346) - Vukasin Milovanovic (https://github.com/vuule) URL: #7709

rjzamora · 2021-09-03T14:32:09Z

@jdye64 - This was closed by #7709 , but the cudf-python API will still pass through fsspec and read the entire file into host memory. Can we now pass down an Arrow filesystem object for s3? Can you comment on what needs to be done to address remote-fs problem in python/dask?

ayushdg · 2021-09-03T15:42:51Z

There's some ongoing work by @shridharathi to expose the arrow filesystem objects branch for s3 paths in this PR: #8961. The approach essentially is to check for whether the path is an s3 path and create an arrow filesystem object instead of going via the fsspec route. The PR currently implements this for csv only, but would be extended more generally in ioutils for other file formats as well.

There are some cases around translating fsspec understood storage options to params that can be understood by PyArrow that's still being worked out.

Though I feel it's worth reopening this issue since the python side of things is still a WIP.

vyasr · 2024-05-13T23:06:48Z

This issue is now solved since it is possible to read only the necessary parts of remote files. We may have to revisit whether we can keep the solution as part of #15193, but either way this issue is either going to stay closed or we will need to come up with an alternative solution that doesn't use the arrow::fs.

ayushdg added feature request New feature or request Needs Triage Need team to review and classify labels Mar 1, 2021

ayushdg changed the title ~~[FEA/Proposal] Use Arrow Filesystem objects for supported remote files~~ [FEA/Proposal] Use Arrow backed filesystem objects for reading remote files Mar 1, 2021

kkraus14 added Python Affects Python cuDF API. Cython and removed Needs Triage Need team to review and classify labels Mar 1, 2021

jdye64 mentioned this issue Mar 24, 2021

Use Arrow URI FileSystem backed instance to retrieve remote files #7709

Merged

github-actions bot added the inactive-30d label Apr 22, 2021

github-actions bot removed the inactive-30d label Apr 22, 2021

github-actions bot added the inactive-30d label May 22, 2021

github-actions bot removed the inactive-30d label May 24, 2021

rapids-bot bot closed this as completed in #7709 Jun 14, 2021

ayushdg reopened this Sep 3, 2021

martindurant mentioned this issue Sep 10, 2021

Parquet enhancements dask/dask#8132

Closed

rjzamora mentioned this issue Sep 13, 2021

[Experimental] Optimize cudf/dask-cudf read_parquet for s3/remote filesystems #9225

Closed

vyasr removed the Cython label Feb 23, 2024

vyasr closed this as completed May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA/Proposal] Use Arrow backed filesystem objects for reading remote files #7475

[FEA/Proposal] Use Arrow backed filesystem objects for reading remote files #7475

ayushdg commented Mar 1, 2021

jdye64 commented Mar 23, 2021

jdye64 commented Mar 23, 2021

kkraus14 commented Mar 23, 2021

vuule commented Mar 23, 2021

jdye64 commented Mar 23, 2021

vuule commented Mar 23, 2021

github-actions bot commented Apr 22, 2021

randerzander commented Apr 22, 2021

github-actions bot commented May 22, 2021

jdye64 commented May 24, 2021

rjzamora commented Sep 3, 2021

ayushdg commented Sep 3, 2021

vyasr commented May 13, 2024

[FEA/Proposal] Use Arrow backed filesystem objects for reading remote files #7475

[FEA/Proposal] Use Arrow backed filesystem objects for reading remote files #7475

Comments

ayushdg commented Mar 1, 2021

jdye64 commented Mar 23, 2021

jdye64 commented Mar 23, 2021

kkraus14 commented Mar 23, 2021

vuule commented Mar 23, 2021

jdye64 commented Mar 23, 2021

vuule commented Mar 23, 2021

github-actions bot commented Apr 22, 2021

randerzander commented Apr 22, 2021

github-actions bot commented May 22, 2021

jdye64 commented May 24, 2021

rjzamora commented Sep 3, 2021

ayushdg commented Sep 3, 2021

vyasr commented May 13, 2024