Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA/Proposal] Use Arrow backed filesystem objects for reading remote files #7475

Closed
ayushdg opened this issue Mar 1, 2021 · 13 comments · Fixed by #7709
Closed

[FEA/Proposal] Use Arrow backed filesystem objects for reading remote files #7475

ayushdg opened this issue Mar 1, 2021 · 13 comments · Fixed by #7709
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@ayushdg
Copy link
Member

ayushdg commented Mar 1, 2021

Is your feature request related to a problem? Please describe.
The current python reading pipeline reads the entire buffer when reading from non-local datasources like HDFS, s3 etc. This is suboptimal for cases where a user reads a subset of the data (few columns, row based filtering) as the entire buffer is still pulled from the datasource.
All remote reading currently goes via fsspec which provides a python filesystem interface for reading remote sources. Since the python filesystem object is not backed by a cpp object, the raw buffer needs read and passed down to libcudf.

This can hamper performance, especially in network bound environments when reading from the cloud.

Describe the solution you'd like
To use Arrow FileSystem objects, wherever applicable. These objects are backed by arrow cpp objects (CRandomAccessFile) which can then be passed down to libcudf, which has the logic to read the buffers as required based on the reader options and the file format.

Implementation Idea

  1. Python - Add a python utility that that uses pyarrow.FileSystem to create a file object wherever applicable in ioutils.get_filepath_or_buffer
  2. Cython : Get c backed file handle from the pyarrow File object.
    Use arrow_io_datasource to create an arrow_datasource that can be directly consumed by libcudf.

Describe alternatives you've considered

Additional context
Some initial discussion on the topic
#2760 (comment)
#2760 (comment)

cc: @randerzander @vuule @rjzamora

@ayushdg ayushdg added feature request New feature or request Needs Triage Need team to review and classify labels Mar 1, 2021
@ayushdg ayushdg changed the title [FEA/Proposal] Use Arrow Filesystem objects for supported remote files [FEA/Proposal] Use Arrow backed filesystem objects for reading remote files Mar 1, 2021
@kkraus14 kkraus14 added Python Affects Python cuDF API. Cython and removed Needs Triage Need team to review and classify labels Mar 1, 2021
@jdye64
Copy link
Contributor

jdye64 commented Mar 23, 2021

So I made a quick prototype for testing out the performance improvement that could be realized here where only a couple of rows from a "large" parquet file might be desired. It works something like this.

  1. I created a user defined External Datasource that uses the Arrow S3 client, so no new dependencies here, that probably isn't necessary but wanted to keep things separate to make the example more clear here.
  2. This arrow_s3_datasource takes S3 options and configurations, then translates those to the actual S3 connection using Arrow S3
  3. The result is an Arrow RandomAccessFile, which we already have a datasource for. This means it can simply be passed to the existing parquet reader with no further code modifications necessary
  4. From there everything works just as if it was a local file

I saw quite significant performance improvements when reading a parquet file on S3 that was roughly 480mb with 18 columns, but only reading a single row from that file vs the entire thing. The prototype output in included below to show the speedup. Note the results are in microseconds and less is better.

All Columns Runtime: 280,378,402; One Column Runtime: 16,646,308

@jdye64
Copy link
Contributor

jdye64 commented Mar 23, 2021

@kkraus14 would you be opposed to me making a PR for this? I have a couple of questions first though to try and streamline the process.

  1. I personally don't think we need another external datasource for this. Could I just add another function to the existing arrow datasource in c++? As you mentioned in the thread Ayush linked too, something that could be used with Arrow uri C++ api. So s3://, hdfs://, mock://, etc
  2. I would be interested in your opinion on capturing S3 configuration options from the users. This would be more of a Python thing.

@kkraus14
Copy link
Collaborator

I'd welcome a PR to iterate further!

@vuule
Copy link
Contributor

vuule commented Mar 23, 2021

Could I just add another function to the existing arrow datasource in c++?

What would this function do? Create a datasource from a uri?

@jdye64
Copy link
Contributor

jdye64 commented Mar 23, 2021

Could I just add another function to the existing arrow datasource in c++?

What would this function do? Create a datasource from a uri?

Pretty much. This is just me thinking from my head, but shouldn't be much more complicated than that. The full "datasource" I made in the example really the only work was in its constructor so was thinking just another function should suffice.

@vuule
Copy link
Contributor

vuule commented Mar 23, 2021

Sounds great to me!

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@randerzander
Copy link
Contributor

Still a desired [FEA]. PR has activity.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jdye64
Copy link
Contributor

jdye64 commented May 24, 2021

Still desired. Waiting on upstream arrow changes.

rapids-bot bot pushed a commit that referenced this issue Jun 14, 2021
)

Arrow offers an API that allows for users to provide a uri definition for target files. This PR will use that api and create a new `arrow_io_source` constructor to accept that information from the user and then create the appropriate FileSystem instance and configure it for access to that file.

This closes: #7475

Authors:
  - Jeremy Dyer (https://github.com/jdye64)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Robert Maynard (https://github.com/robertmaynard)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #7709
@rjzamora
Copy link
Member

rjzamora commented Sep 3, 2021

@jdye64 - This was closed by #7709 , but the cudf-python API will still pass through fsspec and read the entire file into host memory. Can we now pass down an Arrow filesystem object for s3? Can you comment on what needs to be done to address remote-fs problem in python/dask?

@ayushdg
Copy link
Member Author

ayushdg commented Sep 3, 2021

There's some ongoing work by @shridharathi to expose the arrow filesystem objects branch for s3 paths in this PR: #8961. The approach essentially is to check for whether the path is an s3 path and create an arrow filesystem object instead of going via the fsspec route. The PR currently implements this for csv only, but would be extended more generally in ioutils for other file formats as well.

There are some cases around translating fsspec understood storage options to params that can be understood by PyArrow that's still being worked out.

Though I feel it's worth reopening this issue since the python side of things is still a WIP.

@vyasr
Copy link
Contributor

vyasr commented May 13, 2024

This issue is now solved since it is possible to read only the necessary parts of remote files. We may have to revisit whether we can keep the solution as part of #15193, but either way this issue is either going to stay closed or we will need to come up with an alternative solution that doesn't use the arrow::fs.

@vyasr vyasr closed this as completed May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants