Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve download and extract utility #80

Closed
ryantwolf opened this issue May 23, 2024 · 0 comments · Fixed by #82
Closed

[FEA] Improve download and extract utility #80

ryantwolf opened this issue May 23, 2024 · 0 comments · Fixed by #82
Labels
enhancement New feature or request

Comments

@ryantwolf
Copy link
Collaborator

Errors when incorrectly using the download and extraction utilities are hard to debug, and the purpose of download and extraction can be unclear.

Describe the solution you'd like
For download_common_crawl when snapshot numbers are invalid we get this error:

Traceback (most recent call last):
  File "/workspace/NeMo-Curator/examples/download_common_crawl.py", line 53, in <module>
    main(attach_args().parse_args())
  File "/workspace/NeMo-Curator/examples/download_common_crawl.py", line 35, in main
    common_crawl = download_common_crawl(
  File "/usr/local/lib/python3.10/dist-packages/nemo_curator/download/commoncrawl.py", line 342, in download_common_crawl
    dataset = download_and_extract(
  File "/usr/local/lib/python3.10/dist-packages/nemo_curator/download/doc_builder.py", line 185, in download_and_extract
    df = dd.from_map(
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/io/io.py", line 992, in from_map
    raise ValueError("All `iterables` must have a non-zero length")

We should do the following:

  • Update the documentation to explicitly call out that the snapshot numbers must be valid
  • Provide a link to the list of valid snapshots in the documentation.
  • Catch this error and rethrow an error with a more informative error message.

Furthermore, we should clarify the purpose of the download_and_extract utility. Users should not be under the impression that NeMo Curator does web crawling. We can optionally provide a simple download and extraction utility that operates on raw urls.

@ryantwolf ryantwolf added the enhancement New feature or request label May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant