Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Update read_json to work with s3 paths. #66

Open
ayushdg opened this issue May 15, 2024 · 0 comments
Open

[FEA] Update read_json to work with s3 paths. #66

ayushdg opened this issue May 15, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@ayushdg
Copy link
Collaborator

ayushdg commented May 15, 2024

Is your feature request related to a problem? Please describe.

Currently there is logic in both get_all_files_under & read_json that relies on the files being present locally and doesn't work cleanly with s3. Since dask/cudf/pandas already support reading from s3 via fsspec or s3fs Curator should update some of the methods here to allow passing in the s3:// path and reading directly from s3.

Describe the solution you'd like
Using existing curator scripts/examples, DocumentDataset.read_json and get_all_files_under work with s3 paths.

Describe alternatives you've considered
The alternative is for users to directly use a different library or the dask api to read the datasets in and then create a documentDataset with that.

Additional context
Add any other context or screenshots about the feature request here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant