[FEA] Improve download and extract utility #80

ryantwolf · 2024-05-23T22:37:49Z

Errors when incorrectly using the download and extraction utilities are hard to debug, and the purpose of download and extraction can be unclear.

Describe the solution you'd like
For download_common_crawl when snapshot numbers are invalid we get this error:

Traceback (most recent call last):
  File "/workspace/NeMo-Curator/examples/download_common_crawl.py", line 53, in <module>
    main(attach_args().parse_args())
  File "/workspace/NeMo-Curator/examples/download_common_crawl.py", line 35, in main
    common_crawl = download_common_crawl(
  File "/usr/local/lib/python3.10/dist-packages/nemo_curator/download/commoncrawl.py", line 342, in download_common_crawl
    dataset = download_and_extract(
  File "/usr/local/lib/python3.10/dist-packages/nemo_curator/download/doc_builder.py", line 185, in download_and_extract
    df = dd.from_map(
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/io/io.py", line 992, in from_map
    raise ValueError("All `iterables` must have a non-zero length")

We should do the following:

Update the documentation to explicitly call out that the snapshot numbers must be valid
Provide a link to the list of valid snapshots in the documentation.
Catch this error and rethrow an error with a more informative error message.

Furthermore, we should clarify the purpose of the download_and_extract utility. Users should not be under the impression that NeMo Curator does web crawling. We can optionally provide a simple download and extraction utility that operates on raw urls.

The text was updated successfully, but these errors were encountered:

ryantwolf added the enhancement New feature or request label May 23, 2024

ryantwolf mentioned this issue May 24, 2024

Improve Common Crawl download #82

Merged

3 tasks

ryantwolf closed this as completed in #82 Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve download and extract utility #80

[FEA] Improve download and extract utility #80

ryantwolf commented May 23, 2024

[FEA] Improve download and extract utility #80

[FEA] Improve download and extract utility #80

Comments

ryantwolf commented May 23, 2024