You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently there is logic in both get_all_files_under & read_json that relies on the files being present locally and doesn't work cleanly with s3. Since dask/cudf/pandas already support reading from s3 via fsspec or s3fs Curator should update some of the methods here to allow passing in the s3:// path and reading directly from s3.
Describe the solution you'd like
Using existing curator scripts/examples, DocumentDataset.read_json and get_all_files_under work with s3 paths.
Describe alternatives you've considered
The alternative is for users to directly use a different library or the dask api to read the datasets in and then create a documentDataset with that.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Currently there is logic in both
get_all_files_under
& read_json that relies on the files being present locally and doesn't work cleanly with s3. Since dask/cudf/pandas already support reading from s3 via fsspec or s3fs Curator should update some of the methods here to allow passing in the s3:// path and reading directly from s3.Describe the solution you'd like
Using existing curator scripts/examples,
DocumentDataset.read_json
andget_all_files_under
work with s3 paths.Describe alternatives you've considered
The alternative is for users to directly use a different library or the dask api to read the datasets in and then create a documentDataset with that.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: