Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update nemo curator files #206

Merged
merged 2 commits into from
Aug 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 21 additions & 21 deletions docs/user-guide/download.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
.. _data-curator-download:

======================================
Downloading and Extracting Text
Download and Extract Text
======================================
-----------------------------------------
Background
Expand Down Expand Up @@ -33,7 +33,7 @@ By "download", we typically mean the transfer of data from a web-hosted data sou
By "extraction", we typically mean the process of converting a data format from its raw form (e.g., ``.warc.gz``) to a standardized format (e.g., ``.jsonl``) and discarding irrelvant data.

* ``download_common_crawl`` will download and extract the compressed web archive files of common crawl snapshots to a target directory.
Common crawl has an S3 bucket and a direct HTTPS endpoint. If you want to use the S3 bucket, ensure you have properly setup your credentials with `s5cmd <https://github.com/peak/s5cmd>`_.
Common crawl has an S3 bucket and a direct HTTPS endpoint. If you want to use the S3 bucket, ensure you have properly set up your credentials with `s5cmd <https://github.com/peak/s5cmd>`_.
Otherwise, the HTTPS endpoints will be used with ``wget``. Here is a small example of how to use it:

.. code-block:: python
Expand All @@ -47,7 +47,7 @@ By "extraction", we typically mean the process of converting a data format from
* ``"2021-04"`` is the last common crawl snapshot that will be included in the download.
* ``output_type="jsonl"`` is the file format that will be used for storing the data on disk. Currently ``"jsonl"`` and ``"parquet"`` are supported.

The user may choose to modify the HTML text extraction algorithm used in ``download_common_crawl``. See an example below.
You can choose to modify the HTML text extraction algorithm used in ``download_common_crawl``. See an example below.

.. code-block:: python

Expand All @@ -72,9 +72,9 @@ By "extraction", we typically mean the process of converting a data format from

NeMo Curator's Common Crawl extraction process looks like this under the hood:

1. Decode the HTML within the record from binary to text
2. If the HTML can be properly decoded, then with `pyCLD2 <https://github.com/aboSamoor/pycld2>`_, perform language detection on the input HTML
3. Finally, the extract the relevant text with `jusText <https://github.com/miso-belica/jusText>`_ or `Resiliparse <https://github.com/chatnoir-eu/chatnoir-resiliparse>`_ from the HTML and write it out as a single string within the 'text' field of a json entry within a `.jsonl` file
1. Decode the HTML within the record from binary to text.
2. If the HTML can be properly decoded, then with `pyCLD2 <https://github.com/aboSamoor/pycld2>`_, perform language detection on the input HTML.
3. Finally, the extract the relevant text with `jusText <https://github.com/miso-belica/jusText>`_ or `Resiliparse <https://github.com/chatnoir-eu/chatnoir-resiliparse>`_ from the HTML and write it out as a single string within the 'text' field of a json entry within a `.jsonl` file.
* ``download_wikipedia`` will download and extract the latest wikipedia dump. Files are downloaded using ``wget``. Wikipedia might download slower than the other datasets. This is because they limit the number of downloads that can occur per-ip address.

.. code-block:: python
Expand All @@ -86,7 +86,7 @@ By "extraction", we typically mean the process of converting a data format from
* ``"/extracted/output/folder"`` is the path to on your local filesystem where the final extracted files will be placed.
* ``dump_date="20240201"`` fixes the Wikipedia dump to a specific date. If no date is specified, the latest dump is used.

* ``download_arxiv`` will download and extract latex versions of ArXiv papers. They are hosted on S3, so ensure you have properly setup your credentials with `s5cmd <https://github.com/peak/s5cmd>`_.
* ``download_arxiv`` will download and extract latex versions of ArXiv papers. They are hosted on S3, so ensure you have properly set up your credentials with `s5cmd <https://github.com/peak/s5cmd>`_.

.. code-block:: python

Expand All @@ -107,7 +107,7 @@ Related Scripts
In addition to the Python module described above, NeMo Curator provides several CLI scripts that you may find useful for performing the same function.

The :code:`download_and_extract` script within NeMo Curator is a generic tool that can be used to download and extract from a number of different
datasets. In general, it can be called as follows in order to download and extract text from the web
datasets. In general, it can be called as follows in order to download and extract text from the web:

.. code-block:: bash

Expand All @@ -116,9 +116,9 @@ datasets. In general, it can be called as follows in order to download and extra
--builder-config-file=<Path to .yaml file that describes how the data should be downloaded and extracted> \
--output-json-dir=<Path to output directory to which data will be written in .jsonl format>

This utility takes as input a list of URLs that point to files that contain prepared, unextracted data (e.g., pre-crawled web pages from Common Crawl), a config file that describes how to download and extract the data, and the output directory to where the extracted text will be written in jsonl format (one json written to each document per line). For each URL provided in the list of URLs, a corresponding jsonl file will be written to the output directory.
This utility takes as input a list of URLs that point to files that contain prepared, unextracted data (e.g., pre-crawled web pages from Common Crawl), a config file that describes how to download and extract the data, and the output directory where the extracted text will be written in jsonl format (one json written to each document per line). For each URL provided in the list of URLs, a corresponding jsonl file will be written to the output directory.

The config file that must be provided at runtime, should take the following form
The config file that must be provided at runtime, should take the following form:

.. code-block:: yaml

Expand All @@ -137,11 +137,11 @@ Common Crawl Example


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Setup
Set Up Common Crawl
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you prefer, the download process can pull WARC files from S3 using `s5cmd <https://github.com/peak/s5cmd>`_.
This utility is preinstalled in the NeMo Framework Container, but you must have the necessary credentials within :code:`~/.aws/config` in order to use it.
If you would prefer to use this over `wget <https://en.wikipedia.org/wiki/Wget>`_ instead, you may set :code:`aws=True` in the :code:`download_params` as follows
If you prefer to use this method instead of `wget <https://en.wikipedia.org/wiki/Wget>`_ , set :code:`aws=True` in the :code:`download_params` as follows:

.. code-block:: yaml

Expand All @@ -155,11 +155,11 @@ If you would prefer to use this over `wget <https://en.wikipedia.org/wiki/Wget>`


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Downloading and Extracting Common Crawl
Download and Extract Common Crawl
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As described in the first section of this document, the first step towards using the :code:`download_and_extract` for Common Crawl will be to create a list of URLs that point to the location of the WARC files hosted by Common Crawl.
Within NeMo Curator, we provide the utility :code:`get_common_crawl_urls` to obtain these urls. This utility can be run as follows
As described in the first section of this document, the first step in using the :code:`download_and_extract` for Common Crawl is to create a list of URLs that point to the location of the WARC files hosted by Common Crawl.
Within NeMo Curator, we provide the :code:`get_common_crawl_urls` utility to obtain these URLs. This utility can be run as follows:

.. code-block:: bash

Expand All @@ -170,9 +170,9 @@ Within NeMo Curator, we provide the utility :code:`get_common_crawl_urls` to obt
--output-warc-url-file=./url_data/warc_urls_cc_2020_50.txt

This script pulls the Common Crawl index from `https://index.commoncrawl.org` and stores the index to the file
specified by the argument :code:`--cc-snapshot-index-file`. It then retrieves all WARC urls between the
specified by the argument :code:`--cc-snapshot-index-file`. It then retrieves all WARC URLs between the
dates specified by the arguments :code:`--starting-snapshot` and :code:`--ending-snapshot`.
Finally, it writes all WARC urls to the text file :code:`--output-warc-urls`. This file is a simple text file
Finally, it writes all WARC URLs to the text file :code:`--output-warc-urls`. This file is a simple text file
with the following format::

https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00000.warc.gz
Expand All @@ -182,10 +182,10 @@ with the following format::
https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00004.warc.gz
...

For the CC-MAIN-2020-50 snapshot there are a total of 72,000 compressed WARC files each between 800 - 900 MB.
For the CC-MAIN-2020-50 snapshot, there are a total of 72,000 compressed WARC files each between 800 - 900 MB.

Now with the prepared list of URLs, we can use the Common Crawl config included in the :code:`config` directory under the root directory of the repository. This config uses the download, data loader and extraction classes defined in the file :code:`nemo_curator/download/commoncrawl.py`.
With this config and the input list of URLs, the :code:`download_and_extract` utility can be used as follows for downloading and extracting text from Common Crawl
Now with the prepared list of URLs, we can use the Common Crawl config included in the :code:`config` directory under the root directory of the repository. This config uses the download, data loader, and extraction classes defined in the file :code:`nemo_curator/download/commoncrawl.py`.
With this config and the input list of URLs, the :code:`download_and_extract` utility can be used as follows for downloading and extracting text from Common Crawl:

.. code-block:: bash

Expand All @@ -196,7 +196,7 @@ With this config and the input list of URLs, the :code:`download_and_extract` ut


As the text is extracted from the WARC records, the prepared documents are written to the directory specified by :code:`--output-json-dir`. Here is an
example of a single line of an output `.jsonl` file extracted from a WARC record
example of a single line of an output `.jsonl` file extracted from a WARC record:

.. code-block:: json

Expand Down
38 changes: 21 additions & 17 deletions docs/user-guide/semdedup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Unlike exact or fuzzy deduplication, which focus on textual similarity, semantic
As outlined in the paper `SemDeDup: Data-efficient learning at web-scale through semantic deduplication <https://arxiv.org/pdf/2303.09540>`_ by Abbas et al.,
this method can significantly reduce dataset size while maintaining or even improving model performance.
Semantic deduplication is particularly effective for large, uncurated web-scale datasets, where it can remove up to 50% of the data with minimal performance loss.
The semantic deduplication module in NeMo Curator uses embeddings from to identify and remove "semantic duplicates" - data pairs that are semantically similar but not exactly identical.
The semantic deduplication module in NeMo Curator uses embeddings to identify and remove "semantic duplicates" - data pairs that are semantically similar but not exactly identical.
While this documentation primarily focuses on text-based deduplication, the underlying principles can be extended to other modalities with appropriate embedding models.

-----------------------------------------
Expand All @@ -30,7 +30,7 @@ The SemDeDup algorithm consists of the following main steps:
5. Duplicate Removal: From each group of semantic duplicates within a cluster, one representative datapoint is kept (typically the one with the lowest cosine similarity to the cluster centroid) and the rest are removed.

-----------------------------------------
Configuration
Configure Semantic Deduplication
-----------------------------------------

Semantic deduplication in NeMo Curator can be configured using a YAML file. Here's an example `sem_dedup_config.yaml`:
Expand Down Expand Up @@ -73,7 +73,7 @@ Semantic deduplication in NeMo Curator can be configured using a YAML file. Here
You can customize this configuration file to suit your specific needs and dataset characteristics.

-----------------------------------------
Changing Embedding Models
Change Embedding Models
-----------------------------------------

One of the key advantages of the semantic deduplication module is its flexibility in using different pre-trained models for embedding generation.
Expand Down Expand Up @@ -125,15 +125,15 @@ The semantic deduplication process is controlled by two key threshold parameters
This value must be one of the thresholds listed in `eps_thresholds`.

This two-step approach offers several advantages:
- Flexibility to compute matches at multiple thresholds without rerunning the entire process.
- Ability to analyze the impact of different thresholds on your dataset.
- Option to fine-tune the final threshold based on specific needs without recomputing all matches.
* Flexibility to compute matches at multiple thresholds without rerunning the entire process.
* Ability to analyze the impact of different thresholds on your dataset.
* Option to fine-tune the final threshold based on specific needs without recomputing all matches.

Choosing appropriate thresholds:
- Lower thresholds (e.g., 0.001): More strict, resulting in less deduplication but higher confidence in the identified duplicates.
- Higher thresholds (e.g., 0.1): Less strict, leading to more aggressive deduplication but potentially removing documents that are only somewhat similar.
When choosing appropriate thresholds:
* Lower thresholds (e.g., 0.001): More strict, resulting in less deduplication but higher confidence in the identified duplicates.
* Higher thresholds (e.g., 0.1): Less strict, leading to more aggressive deduplication but potentially removing documents that are only somewhat similar.

It's recommended to experiment with different threshold values to find the optimal balance between data reduction and maintaining dataset diversity and quality.
We recommended that you experiment with different threshold values to find the optimal balance between data reduction and maintaining dataset diversity and quality.
The impact of these thresholds can vary depending on the nature and size of your dataset.

Remember, if you want to extract data using a threshold that's not in `eps_thresholds`, you'll need to recompute the semantic matches with the new threshold included in the list.
Expand All @@ -158,7 +158,8 @@ You can use the `add_id` module from NeMo Curator if needed:

To perform semantic deduplication, you can either use individual components or the SemDedup class with a configuration file:

Using individual components:
Use Individual Components
##########################

1. Embedding Creation:

Expand Down Expand Up @@ -194,7 +195,7 @@ Using individual components:
)
clustered_dataset = clustering_model(embeddings_dataset)

1. Semantic Deduplication:
3. Semantic Deduplication:

.. code-block:: python

Expand All @@ -214,7 +215,10 @@ Using individual components:
semantic_dedup.compute_semantic_match_dfs()
deduplicated_dataset_ids = semantic_dedup.extract_dedup_data(eps_to_extract=0.07)

1. Alternatively, you can use the SemDedup class to perform all steps:
Use the SemDedup Class
#######################

Alternatively, you can use the SemDedup class to perform all steps:

.. code-block:: python

Expand Down Expand Up @@ -255,10 +259,10 @@ Output

The semantic deduplication process produces a deduplicated dataset, typically reducing the dataset size by 20-50% while maintaining or improving model performance. The output includes:

1. Embeddings for each datapoint
2. Cluster assignments for each datapoint
3. A list of semantic duplicates
4. The final deduplicated dataset
1. Embeddings for each datapoint.
2. Cluster assignments for each datapoint.
3. A list of semantic duplicates.
4. The final deduplicated dataset.

-----------------------------------------
Performance Considerations
Expand Down