[FEA] Update read_json to work with s3 paths. #66

ayushdg · 2024-05-15T19:50:16Z

Is your feature request related to a problem? Please describe.

Currently there is logic in both get_all_files_under & read_json that relies on the files being present locally and doesn't work cleanly with s3. Since dask/cudf/pandas already support reading from s3 via fsspec or s3fs Curator should update some of the methods here to allow passing in the s3:// path and reading directly from s3.

Describe the solution you'd like
Using existing curator scripts/examples, DocumentDataset.read_json and get_all_files_under work with s3 paths.

Describe alternatives you've considered
The alternative is for users to directly use a different library or the dask api to read the datasets in and then create a documentDataset with that.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

ayushdg added the enhancement New feature or request label May 15, 2024

ryantwolf mentioned this issue Jul 3, 2024

Enable Sem-dedup #130

Merged

3 tasks

sarahyurick mentioned this issue Sep 9, 2024

Better mimic DocumentDataset's read_* functions to Dask's read_* functions #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Update read_json to work with s3 paths. #66

[FEA] Update read_json to work with s3 paths. #66

ayushdg commented May 15, 2024

[FEA] Update read_json to work with s3 paths. #66

[FEA] Update read_json to work with s3 paths. #66

Comments

ayushdg commented May 15, 2024