You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Errors when incorrectly using the download and extraction utilities are hard to debug, and the purpose of download and extraction can be unclear.
Describe the solution you'd like
For download_common_crawl when snapshot numbers are invalid we get this error:
Traceback (most recent call last):
File "/workspace/NeMo-Curator/examples/download_common_crawl.py", line 53, in <module>
main(attach_args().parse_args())
File "/workspace/NeMo-Curator/examples/download_common_crawl.py", line 35, in main
common_crawl = download_common_crawl(
File "/usr/local/lib/python3.10/dist-packages/nemo_curator/download/commoncrawl.py", line 342, in download_common_crawl
dataset = download_and_extract(
File "/usr/local/lib/python3.10/dist-packages/nemo_curator/download/doc_builder.py", line 185, in download_and_extract
df = dd.from_map(
File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/io/io.py", line 992, in from_map
raise ValueError("All `iterables` must have a non-zero length")
We should do the following:
Update the documentation to explicitly call out that the snapshot numbers must be valid
Provide a link to the list of valid snapshots in the documentation.
Catch this error and rethrow an error with a more informative error message.
Furthermore, we should clarify the purpose of the download_and_extract utility. Users should not be under the impression that NeMo Curator does web crawling. We can optionally provide a simple download and extraction utility that operates on raw urls.
The text was updated successfully, but these errors were encountered:
Errors when incorrectly using the download and extraction utilities are hard to debug, and the purpose of download and extraction can be unclear.
Describe the solution you'd like
For
download_common_crawl
when snapshot numbers are invalid we get this error:We should do the following:
Furthermore, we should clarify the purpose of the download_and_extract utility. Users should not be under the impression that NeMo Curator does web crawling. We can optionally provide a simple download and extraction utility that operates on raw urls.
The text was updated successfully, but these errors were encountered: