diff --git a/.gitignore b/.gitignore
new file mode 100644
index 00000000..b2a38eaf
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,153 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+*.cpp
+
+# C extensions
+*.so
+*.c
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+# Usually these files are written by a python script from a template
+# before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+# For a library or package, you might want to ignore these files since the code is
+# intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+# However, in case of collaboration, if having platform-specific dependencies or dependencies
+# having no cross-platform support, pipenv may install dependencies that don't work, or not
+# install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# Dask cache
+dask-worker-space/
+
+# Data downloaded and generated when running the examples.
+data/
+
+# SLURM Files
+*.out
+*.err
+
+# Text Editor / IDE Files
+.vscode
diff --git a/.style.yapf b/.style.yapf
new file mode 100644
index 00000000..4861cafe
--- /dev/null
+++ b/.style.yapf
@@ -0,0 +1,3 @@
+[style]
+based_on_style = google
+indent_width = 2
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 00000000..3bd14fc1
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,118 @@
+# Checklist
+
+We are glad you are contributing to NeMo Curator! Before you make a PR, be sure to read over this guide in detail.
+This checklist ensures that NeMo Curator stays easy-to-use by both users and developers.
+Not all steps are necessary for some contributions, so read the linked sections for more information about each item.
+
+1. [Follow the general principles in your design](#general-principles)
+1. [Write your code in the proper place](#repo-structure)
+1. [Write examples and documentation for using your code](#examples-and-documentation)
+1. [Format using the style guide](#python-style)
+1. [Write unit tests](#unit-tests)
+1. [Make a pull request](#pull-requests-pr-guidelines)
+
+## General principles
+1. **User-oriented**: make it easy for end users, even at the cost of writing more code in the background
+1. **Robust**: make it hard for users to make mistakes.
+1. **Reusable**: for every piece of code, think about how it can be reused in the future and make it easy to be reused.
+1. **Readable**: code should be easier to read.
+1. **Legal**: if you copy even one line of code from the Internet, make sure that the code allows the license that NeMo Curator supports. Give credit and link back to the code.
+1. **Sensible**: code should make sense. If you think a piece of code might be confusing, write comments.
+
+## Code Structure
+The repository is home to flexible Python modules, sample scripts, tests, and more.
+Here is a brief overview of where everything lives:
+- [config](config/) - A collection of example configuration files for many of the curator's modules.
+- [docs](docs/) - Walkthroughs and motivations for each of the modules.
+- [examples](examples/) - Example scripts for how users may want to compose the curator.
+- [nemo_curator](nemo_curator/) - The main home for all the NeMo Curator's Python APIs.
+ - [modules](nemo_curator/modules) - Classes for the modules.
+ - [filters](nemo_curator/filters) - Classes for the filters.
+ - [utils](nemo_curator/utils) - Common utilities for file/network operations.
+- [tests](tests/) - Unit tests for each module.
+
+## Examples and Documentation
+Examples provide an easy way for users to see how the curator works in action.
+There should be at least one example per module in the curator.
+They should be incredibly lightweight and rely on the core `nemo_curator` modules for their functionality.
+Most should be designed for a user to get up and running on their local machines, but distributed examples are welcomed if it makes sense.
+Python scripts should be the primary way to showcase your module.
+Though, SLURM scripts or other cluster scripts should be included if there are special steps needed to run the module.
+
+The documentation should complement each example by going through the motivation behind why a user would use each module.
+It should include both an explanation of the module, and how it's used in its corresponding example.
+The documentation should also cover potential pitfalls and performance considerations when running the module at scale.
+This existing examples and documentation should serve as a good reference to what is expected.
+
+## Python style
+We use ``black`` as our style guide. To fix your format run `pip install pre-commit && pre-commit install && pre-commit run --all`.
+
+1. Include docstrings for every class and method exposed to the user.
+1. Avoid wild import: ``from X import *`` unless in ``X.py``, ``__all__`` is defined.
+1. Minimize the use of ``**kwargs``.
+1. ``RaiseError`` is preferred to ``assert``. Write: ```if X: raise Error``` instead of ```assert X```.
+1. Classes are preferred to standalone methods.
+1. Methods should be atomic. A method shouldn't be longer than 75 lines, e.g. can be fit into the computer screen without scrolling.
+1. If a method has arguments that don't fit into one line, each argument should be in its own line for readability.
+1. Add ``__init__.py`` for every folder.
+1. F-strings are prefered to formatted strings.
+1. Loggers are preferred to print.
+1. Private functions (functions start with ``_``) shouldn't be called outside its host file.
+1. If a comment lasts multiple lines, use ``'''`` instead of ``#``.
+
+## Unit tests
+Unit tests should be simple and fast.
+Developers should be able to run them frequently while developing without any slowdown.
+```
+pytest
+# If you don't have NVIDIA GPU do:
+# pytest --cpu
+```
+
+## Pull Requests (PR) Guidelines
+
+**Send your PRs to the `main` or `dev` branch**
+
+1) Make sure your PR does one thing. Have a clear answer to "What does this PR do?".
+2) Read General Principles and style guide below
+3) Make sure you sign your commits. E.g. use ``git commit -sS`` when committing.
+4) Make sure all unittests finish successfully before sending PR ``pytest`` or (if your dev box does not have GPU) ``pytest --cpu`` from the root folder
+5) Send your PR and request a review
+
+The `dev` branch is for active development and may be unstable. Unit tests are expected to pass before merging into `dev` or `main`.
+Every release `dev` and `main` will sync to be the same.
+
+Full text of the DCO:
+
+```
+Developer Certificate of Origin
+Version 1.1
+
+Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
+1 Letterman Drive
+Suite D4700
+San Francisco, CA, 94129
+
+Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
+
+Developer's Certificate of Origin 1.1
+
+By making a contribution to this project, I certify that:
+
+(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
+
+(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
+
+(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
+
+(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
+```
+
+## Whom should you ask for review:
+
+Joseph Jennings (@jjennings) or Ryan Wolf (@rywolf)
+
+They may ask for other reviewers depending on the scope of the change. Your pull requests must pass all checks and peer-review before they can be merged.
+
+
+Thank you for contributing to NeMo Curator!
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 00000000..261eeb9e
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,201 @@
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "[]"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright [yyyy] [name of copyright owner]
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
diff --git a/README.md b/README.md
new file mode 100644
index 00000000..60911ade
--- /dev/null
+++ b/README.md
@@ -0,0 +1,129 @@
+# NeMo Curator
+
+NeMo Curator is a Python library that consists of a collection of scalable data-mining modules for curating natural language processing (NLP) data for training large language models (LLMs). The modules within NeMo Curator enable NLP researchers to mine high-quality text at scale from massive uncurated web corpora. For a demonstration of how each of the modules in NeMo Curator improves downstream performance, check out the [module ablation](#module-ablation).
+
+NeMo Curator is built on [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python interface provides easy methods to expand the functionality of your curation pipeline without worrying about how it will scale. More information can be found in the [usage section](#usage). There are many ways to integrate NeMo Curator in your pipeline. Check out the [installation instructions](#installation) for how to get started using it.
+
+## Features
+We currently support the following data-curation modules. For more details on each module, visit its documentation page in the [NeMo framework user guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/index.html).
+ - [Data download and text extraction](docs/user-guide/Download.rst)
+ - Default implementations of download and extraction of Common Crawl, Wikipedia, and ArXiv data
+ - Users can easily customize the download and extraction and extend to other datasets
+ - [Language identification and separation](docs/user-guide/LanguageIdentificationUnicodeFormatting.rst)
+ - Language identification with [fastText](https://fasttext.cc/docs/en/language-identification.html) and [pycld2](https://pypi.org/project/pycld2/)
+ - [Text reformatting and cleaning](docs/user-guide/LanguageIdentificationUnicodeFormatting.rst)
+ - Fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)
+ - [Quality filtering](docs/user-guide/QualityFiltering.rst)
+ - Multilingual heuristic-based filtering
+ - Classifier-based filtering via [fastText](https://fasttext.cc/)
+ - [Document-level deduplication](docs/user-guide/GpuDeduplication.rst)
+ - Both exact and fuzzy deduplication are accelerated using cuDF and Dask.
+ - For fuzzy deduplication, our implementation follows the method described in [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990).
+ - [Multilingual downstream-task decontamination](docs/user-guide/TaskDecontamination.rst)
+ - Our implementation follows the approach of [OpenAI GPT3](https://arxiv.org/pdf/2005.14165.pdf) and [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990)
+ - [Distributed data classification](docs/user-guide/DistributedDataClassification.rst)
+ - Multi-node multi-GPU classifier inference
+ - Allows for sophisticated domain and quality classification
+ - Flexible interface for extending to your own classifier network
+ - [Personal identifiable information (PII) redaction](docs/user-guide/PersonalIdentifiableInformationIdentificationAndRemoval.rst)
+ - Idenficiation tools for removing addresses, credit card numbers, social security numbers and more.
+
+These modules are designed to be flexible and allow for reordering with few exceptions. The [NeMo Framework Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher) includes prebuilt pipelines for you to start with and modify as needed.
+
+## Learn More
+- [Documentation](docs/)
+- [Examples](examples/)
+- [Module Ablation and Compute Performance](#module-ablation-and-compute-performance)
+
+## Installation
+
+NeMo Curator currently requires a GPU with CUDA 12 or above installed in order to be used.
+
+NeMo Curator can be installed manually by cloning the repository and installing as follows:
+```
+pip install --extra-index-url https://pypi.nvidia.com .
+```
+NeMo Curator is available in the [NeMo Framework Container](https://registry.ngc.nvidia.com/orgs/ea-bignlp/teams/ga-participants/containers/nemofw-training) which can be applied for [here](https://developer.nvidia.com/nemo-framework). It comes preinstalled in the container.
+
+## Usage
+
+### Python Library
+
+```Python
+# Download your dataset
+dataset = download_common_crawl("/datasets/common_crawl/", "2021-04", "2021-10", url_limit=10)
+# Build your pipeline
+curation_pipeline = Sequential([
+ Modify(UnicodeReformatter()),
+ ScoreFilter(WordCountFilter(min_words=80)),
+ ScoreFilter(FastTextQualityFilter(model_path="model.bin")),
+ TaskDecontamination([Winogrande(), Squad(), TriviaQA()])
+])
+# Curate your dataset
+curated_dataset = curation_pipeline(dataset)
+```
+
+NeMo Curator provides a collection of robust python modules that can be chained together to construct your entire data curation pipeline. These modules can be run on your local machine or in a distributed compute environment like SLURM with no modifications. NeMo Curator provides simple base classes that you can inherit from to create your own filters, document modifiers, and other extensions without needing to worry about how they scale. The [examples](examples/) directory contains a bunch of scripts showcasing each of these modules. The data curation section of the [NeMo framework user guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/index.html) provides in-depth documentation on how each of the modules work. If you need more information to modify the NeMo Curator for your usecase, the [implementation section](#implementation) provides a good starting point.
+
+### Scripts
+
+We provide CLI scripts to use as well in case those are more convienent. The scripts under `nemo_curator/scripts` map closely with each of the created python modules. Visit the [documentation](docs) for each of the python modules for more information about the scripts associated with it.
+
+
+### NeMo Framework Launcher
+[NeMo Megatron Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher) is another way to interface with NeMo Curator. The launcher allows
+for easy parameter and cluster configuration and will automatically generate the SLURM batch scripts that wrap around the CLI scripts required to run your pipeline.
+Note: This is not the only way to run NeMo Curator on SLURM. There are example scripts in [`examples/slurm`](examples/slurm/) for running NeMo Curator on SLURM without the launcher.
+
+## Module Ablation and Compute Performance
+
+The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
+in a scalable manner. In order to assess the quality of the Common Crawl documents curated by the modules in NeMo Curator, we performed a series
+of ablation experiments in which we trained a 357M-parameter GPT-style model on the datasets resulting from the different stages of our data curation
+pipeline implemented in NeMo Curator. The figure below demonstrates that the different data curation modules implemented within NeMo Curator
+lead to improved model zero-shot downstream task performance.
+
+
+
+
+
+In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
+
+Additionally, using the CPU-based modules the table below shows the time required and resulting data size reduction of each step of processing the [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge` [Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)):
+
+
+
+
+ Dataset |
+ Download and text extraction |
+ Text cleaning |
+ Quality filtering |
+
+
+
+
+ |
+ Time |
+ Output Size |
+ Time |
+ Output Size |
+ Time |
+ Output Size |
+
+
+ Common Crawl 2020-50 |
+ 36 hrs |
+ 2.8 TB |
+ 1 hr |
+ 2.8 TB |
+ 0.2 hr |
+ 0.52 TB |
+
+
+
+
+## Implementation
+
+As mentioned above, the modules within NeMo Curator enable users to scale data-mining and NLP processing tasks to many nodes within a compute cluster.
+The modules accomplish this using [Dask](https://www.dask.org/) with [cuDF](https://docs.rapids.ai/api/cudf/nightly/user_guide/10min/) (for the GPU-accelerated modules).
+At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
\ No newline at end of file
diff --git a/SECURITY.md b/SECURITY.md
new file mode 100644
index 00000000..2be787ab
--- /dev/null
+++ b/SECURITY.md
@@ -0,0 +1,24 @@
+# Security
+
+NVIDIA is dedicated to the security and trust of our software products and services, including all source code repositories managed through our organization.
+
+If you need to report a security issue, please use the appropriate contact points outlined below. **Please do not report security vulnerabilities through GitHub.**
+
+## Reporting Potential Security Vulnerability in an NVIDIA Product
+
+To report a potential security vulnerability in any NVIDIA product:
+- Web: [Security Vulnerability Submission Form](https://www.nvidia.com/object/submit-security-vulnerability.html)
+- E-Mail: psirt@nvidia.com
+ - We encourage you to use the following PGP key for secure email communication: [NVIDIA public PGP Key for communication](https://www.nvidia.com/en-us/security/pgp-key)
+ - Please include the following information:
+ - Product/Driver name and version/branch that contains the vulnerability
+ - Type of vulnerability (code execution, denial of service, buffer overflow, etc.)
+ - Instructions to reproduce the vulnerability
+ - Proof-of-concept or exploit code
+ - Potential impact of the vulnerability, including how an attacker could exploit the vulnerability
+
+While NVIDIA currently does not have a bug bounty program, we do offer acknowledgement when an externally reported security issue is addressed under our coordinated vulnerability disclosure policy. Please visit our [Product Security Incident Response Team (PSIRT)](https://www.nvidia.com/en-us/security/psirt-policies/) policies page for more information.
+
+## NVIDIA Product Security
+
+For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
\ No newline at end of file
diff --git a/config/arxiv_builder.yaml b/config/arxiv_builder.yaml
new file mode 100644
index 00000000..007566d6
--- /dev/null
+++ b/config/arxiv_builder.yaml
@@ -0,0 +1,11 @@
+download_module: nemo_curator.download.arxiv.ArxivDownloader
+download_params: {}
+iterator_module: nemo_curator.download.arxiv.ArxivIterator
+iterator_params:
+ log_frequency: 1000
+extract_module: nemo_curator.download.arxiv.ArxivExtractor
+extract_params: {}
+format:
+ text: str
+ id: str
+ source_id: str
\ No newline at end of file
diff --git a/config/cc_warc_builder.yaml b/config/cc_warc_builder.yaml
new file mode 100644
index 00000000..3e6a8ed9
--- /dev/null
+++ b/config/cc_warc_builder.yaml
@@ -0,0 +1,12 @@
+download_module: nemo_curator.download.commoncrawl.CommonCrawlWARCDownloader
+download_params: {}
+iterator_module: nemo_curator.download.commoncrawl.CommonCrawlWARCIterator
+iterator_params: {}
+extract_module: nemo_curator.download.commoncrawl.CommonCrawlWARCExtractor
+extract_params: {}
+format:
+ text: str
+ language: str
+ url: str
+ warc_id: str
+ source_id: str
\ No newline at end of file
diff --git a/config/fasttext_langid.yaml b/config/fasttext_langid.yaml
new file mode 100644
index 00000000..86b18761
--- /dev/null
+++ b/config/fasttext_langid.yaml
@@ -0,0 +1,5 @@
+input_field: text
+filters:
+ - name: nemo_curator.filters.classifier_filter.FastTextLangId
+ params:
+ model_path:
diff --git a/config/fasttext_quality_filter.yaml b/config/fasttext_quality_filter.yaml
new file mode 100644
index 00000000..05054781
--- /dev/null
+++ b/config/fasttext_quality_filter.yaml
@@ -0,0 +1,12 @@
+input_field: text
+filters:
+ - name: nemo_curator.filters.classifier_filter.FastTextQualityFilter
+ params:
+ # FastText Model file
+ model_path:
+ # Pareto sampling parameter
+ # (Higher alpha values will allow fewer low-quality documents
+ # to pass through)
+ alpha: 3
+ # The label used for high-quality documents
+ label: "__label__hq"
diff --git a/config/heuristic_filter_code.yaml b/config/heuristic_filter_code.yaml
new file mode 100644
index 00000000..7d6b36e4
--- /dev/null
+++ b/config/heuristic_filter_code.yaml
@@ -0,0 +1,20 @@
+input_field: text
+filters:
+ # The filters below define a chain of heuristic filters to be applied to each document in a corpus.
+ # This particular cascade of filters is intended to filter Python code data.
+ # The filter listed at the top will be applied first, and the following filters will be applied in
+ # the order they appear in this file. Each filter can be removed and re-ordered as desired.
+ # Change this based on the language of the data
+ # Code filter implementations are in nemo_curator/filter/code.py
+ - name: nemo_curator.filters.code.PythonCommentToCodeFilter
+ params:
+ min_comment_to_code_ratio: 0.001
+ max_comment_to_code_ratio: 0.85
+ - name: nemo_curator.filters.code.NumberOfLinesOfCodeFilter
+ params:
+ min_lines: 5
+ max_lines: 20000
+ - name: nemo_curator.filters.code.TokenizerFertilityFilter
+ params:
+ path_to_tokenizer:
+ min_char_to_token_ratio: 2
diff --git a/config/heuristic_filter_en.yaml b/config/heuristic_filter_en.yaml
new file mode 100644
index 00000000..4e3bbb79
--- /dev/null
+++ b/config/heuristic_filter_en.yaml
@@ -0,0 +1,105 @@
+input_field: text
+filters:
+ # The filters below define a chain of heuristic filters to be applied to each document in a corpus.
+ # This particular cascade of filters is intended to filter English language data.
+ # The filter listed at the top will be applied first, and the following filters will be applied in
+ # the order they appear in this file. Each filter can be removed and re-ordered as desired.
+ - name: nemo_curator.filters.heuristic_filter.NonAlphaNumericFilter
+ params:
+ max_non_alpha_numeric_to_text_ratio: 0.25
+ - name: nemo_curator.filters.heuristic_filter.SymbolsToWordsFilter
+ params:
+ max_symbol_to_word_ratio: 0.1
+ - name: nemo_curator.filters.heuristic_filter.NumbersFilter
+ params:
+ max_number_to_text_ratio: 0.15
+ - name: nemo_curator.filters.heuristic_filter.UrlsFilter
+ params:
+ max_url_to_text_ratio: 0.2
+ - name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
+ params:
+ max_white_space_ratio: 0.25
+ - name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
+ params:
+ max_parentheses_ratio: 0.1
+ - name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
+ params:
+ remove_if_at_top_or_bottom: True
+ max_boilerplate_string_ratio: 0.4
+ - name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
+ params:
+ max_repeated_line_fraction: 0.7
+ - name: nemo_curator.filters.heuristic_filter.RepeatedParagraphsFilter
+ params:
+ max_repeated_paragraphs_ratio: 0.7
+ - name: nemo_curator.filters.heuristic_filter.RepeatedLinesByCharFilter
+ params:
+ max_repeated_lines_char_ratio: 0.8
+ - name: nemo_curator.filters.heuristic_filter.RepeatedParagraphsByCharFilter
+ params:
+ max_repeated_paragraphs_char_ratio: 0.8
+ - name: nemo_curator.filters.heuristic_filter.WordCountFilter
+ params:
+ min_words: 50
+ max_words: 100000
+ - name: nemo_curator.filters.heuristic_filter.PunctuationFilter
+ params:
+ max_num_sentences_without_endmark_ratio: 0.85
+ - name: nemo_curator.filters.heuristic_filter.WordsWithoutAlphabetsFilter
+ params:
+ min_words_with_alphabets: 0.8
+ - name: nemo_curator.filters.heuristic_filter.CommonEnglishWordsFilter
+ params:
+ min_num_common_words: 2
+ stop_at_false: True
+ - name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
+ params:
+ max_mean_word_length: 10
+ min_mean_word_length: 3
+ - name: nemo_curator.filters.heuristic_filter.LongWordFilter
+ params:
+ max_word_length: 1000
+ - name: nemo_curator.filters.heuristic_filter.EllipsisFilter
+ params:
+ max_num_lines_ending_with_ellipsis_ratio: 0.3
+ # Top N-Gram filters for N-grams 2, 3, and 4
+ - name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter
+ params:
+ n: 2
+ max_repeating_ngram_ratio: 0.2
+ - name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter
+ params:
+ n: 3
+ max_repeating_ngram_ratio: 0.18
+ - name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter
+ params:
+ n: 4
+ max_repeating_ngram_ratio: 0.16
+ # Duplicate N-gram filters for N-grams 5, 6, 7, 8, 9, and 10
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 5
+ max_repeating_duplicate_ngram_ratio: 0.15
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 6
+ max_repeating_duplicate_ngram_ratio: 0.14
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 7
+ max_repeating_duplicate_ngram_ratio: 0.13
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 8
+ max_repeating_duplicate_ngram_ratio: 0.12
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 9
+ max_repeating_duplicate_ngram_ratio: 0.11
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 10
+ max_repeating_duplicate_ngram_ratio: 0.10
+ - name: nemo_curator.filters.heuristic_filter.BulletsFilter
+ params:
+ max_bullet_lines_ratio: 0.9
\ No newline at end of file
diff --git a/config/heuristic_filter_non-en.yaml b/config/heuristic_filter_non-en.yaml
new file mode 100644
index 00000000..783d0e54
--- /dev/null
+++ b/config/heuristic_filter_non-en.yaml
@@ -0,0 +1,97 @@
+input_field: text
+filters:
+ # The filters below define a chain of heuristic filters to be applied to each document in a corpus.
+ # This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words.
+ # The filter listed at the top will be applied first, and the following filters will be applied in
+ # the order they appear in this file. Each filter can be removed and re-ordered as desired.
+ - name: nemo_curator.filters.heuristic_filter.SymbolsToWordsFilter
+ params:
+ max_symbol_to_word_ratio: 0.1
+ - name: nemo_curator.filters.heuristic_filter.NumbersFilter
+ params:
+ max_number_to_text_ratio: 0.15
+ - name: nemo_curator.filters.heuristic_filter.UrlsFilter
+ params:
+ max_url_to_text_ratio: 0.2
+ - name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
+ params:
+ max_white_space_ratio: 0.25
+ - name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
+ params:
+ max_parentheses_ratio: 0.1
+ - name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
+ params:
+ remove_if_at_top_or_bottom: True
+ max_boilerplate_string_ratio: 0.4
+ - name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
+ params:
+ max_repeated_line_fraction: 0.7
+ - name: nemo_curator.filters.heuristic_filter.RepeatedParagraphsFilter
+ params:
+ max_repeated_paragraphs_ratio: 0.7
+ - name: nemo_curator.filters.heuristic_filter.RepeatedLinesByCharFilter
+ params:
+ max_repeated_lines_char_ratio: 0.8
+ - name: nemo_curator.filters.heuristic_filter.RepeatedParagraphsByCharFilter
+ params:
+ max_repeated_paragraphs_char_ratio: 0.8
+ - name: nemo_curator.filters.heuristic_filter.WordCountFilter
+ params:
+ min_words: 50
+ max_words: 100000
+ # NOTE: This filter tends to remove many documents and will need to
+ # be tuned per language
+ - name: nemo_curator.filters.heuristic_filter.PunctuationFilter
+ params:
+ max_num_sentences_without_endmark_ratio: 0.85
+ - name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
+ params:
+ max_mean_word_length: 10
+ min_mean_word_length: 3
+ - name: nemo_curator.filters.heuristic_filter.LongWordFilter
+ params:
+ max_word_length: 1000
+ - name: nemo_curator.filters.heuristic_filter.EllipsisFilter
+ params:
+ max_num_lines_ending_with_ellipsis_ratio: 0.3
+ # Top N-Gram filters for N-grams 2, 3, and 4
+ - name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter
+ params:
+ n: 2
+ max_repeating_ngram_ratio: 0.2
+ - name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter
+ params:
+ n: 3
+ max_repeating_ngram_ratio: 0.18
+ - name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter
+ params:
+ n: 4
+ max_repeating_ngram_ratio: 0.16
+ # Duplicate N-gram filters for N-grams 5, 6, 7, 8, 9, and 10
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 5
+ max_repeating_duplicate_ngram_ratio: 0.15
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 6
+ max_repeating_duplicate_ngram_ratio: 0.14
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 7
+ max_repeating_duplicate_ngram_ratio: 0.13
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 8
+ max_repeating_duplicate_ngram_ratio: 0.12
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 9
+ max_repeating_duplicate_ngram_ratio: 0.11
+ - name: nemo_curator.filters.heuristic_filter.RepeatingDuplicateNGramsFilter
+ params:
+ n: 10
+ max_repeating_duplicate_ngram_ratio: 0.10
+ - name: nemo_curator.filters.heuristic_filter.BulletsFilter
+ params:
+ max_bullet_lines_ratio: 0.9
\ No newline at end of file
diff --git a/config/lm_tasks.yaml b/config/lm_tasks.yaml
new file mode 100644
index 00000000..3d38ec6f
--- /dev/null
+++ b/config/lm_tasks.yaml
@@ -0,0 +1,48 @@
+tasks:
+ # The Python modules below define language model downstream evaluation
+ # task data. If one of the below tasks is specified, N-grams will
+ # be constructed from the documents that make up the task data
+ # using the script prepare_task_data.
+ # find_matching_ngrams will then search for these N-grams
+ # in the training documents, and remove_matching_ngrams will
+ # split the documents based on matches
+ - name: nemo_curator.tasks.Winogrande
+ params: {}
+ - name: nemo_curator.tasks.Squad
+ params: {}
+ - name: nemo_curator.tasks.TriviaQA
+ params: {}
+ - name: nemo_curator.tasks.Quac
+ params: {}
+ - name: nemo_curator.tasks.WebQA
+ params: {}
+ - name: nemo_curator.tasks.Race
+ params: {}
+ - name: nemo_curator.tasks.Drop
+ params: {}
+ - name: nemo_curator.tasks.WiC
+ params: {}
+ - name: nemo_curator.tasks.PIQA
+ params: {}
+ - name: nemo_curator.tasks.ArcEasy
+ params: {}
+ - name: nemo_curator.tasks.ArcChallenge
+ params: {}
+ - name: nemo_curator.tasks.OpenBookQA
+ params: {}
+ - name: nemo_curator.tasks.BoolQ
+ params: {}
+ - name: nemo_curator.tasks.Copa
+ params: {}
+ - name: nemo_curator.tasks.RTE
+ params: {}
+ - name: nemo_curator.tasks.MultiRC
+ params: {}
+ - name: nemo_curator.tasks.WSC
+ params: {}
+ - name: nemo_curator.tasks.CB
+ params: {}
+ - name: nemo_curator.tasks.ANLI
+ params: {}
+ - name: nemo_curator.tasks.Record
+ params: {}
diff --git a/config/pii_config.yaml b/config/pii_config.yaml
new file mode 100644
index 00000000..725fde30
--- /dev/null
+++ b/config/pii_config.yaml
@@ -0,0 +1,16 @@
+pii_config:
+ language: 'en'
+ supported_entities:
+ - PERSON
+ - ADDRESS
+ anonymize:
+ #type: 'replace'
+ #new_value: ABC
+ action: 'mask'
+ chars_to_mask: 40
+ masking_char: '*'
+
+ #type: 'hash'
+ #hash_type: 'sha256'
+
+ #type: 'redact'
\ No newline at end of file
diff --git a/config/wikipedia_builder.yaml b/config/wikipedia_builder.yaml
new file mode 100644
index 00000000..47831537
--- /dev/null
+++ b/config/wikipedia_builder.yaml
@@ -0,0 +1,15 @@
+download_module: nemo_curator.download.wikipedia.WikipediaDownloader
+download_params: {}
+iterator_module: nemo_curator.download.wikipedia.WikipediaIterator
+iterator_params:
+ language: 'en'
+extract_module: nemo_curator.download.wikipedia.WikipediaExtractor
+extract_params:
+ language: 'en'
+format:
+ text: str
+ title: str
+ id: str
+ url: str
+ language: str
+ source_id: str
\ No newline at end of file
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 00000000..06edad0f
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,3 @@
+# Documentation
+## User Guide
+The NeMo User Guide is the recommended way to view the documentation for each one of the NeMo Curator's modules.
diff --git a/docs/user-guide/CPUvsGPU.rst b/docs/user-guide/CPUvsGPU.rst
new file mode 100644
index 00000000..c3159b21
--- /dev/null
+++ b/docs/user-guide/CPUvsGPU.rst
@@ -0,0 +1,98 @@
+======================================
+CPU and GPU Modules with Dask
+======================================
+
+NeMo Curator provides GPU-accelerated modules alongside its CPU modules.
+These modules are based on RAPIDS to enable scaling workflows to massive dataset sizes.
+The remaining modules are CPU based and rely on Dask to scale to multi-node clusters.
+When working with these different modules, it's important to understand how to properly set up your Dask cluster and how to manage where your dataset is stored in memory.
+
+-----------------------------------------
+Initializing the Dask Cluster
+-----------------------------------------
+
+NeMo Curator provides a simple function ``get_client`` that can be used to start a local Dask cluster or connect to an existing one.
+All of the ``examples/`` use it to set up a Dask cluster.
+
+.. code-block:: python
+
+ import argparse
+ from nemo_curator.utils.distributed_utils import get_client
+ from nemo_curator.utils.script_utils import add_distributed_args
+
+
+ def main(args):
+ # Set up Dask client
+ client = get_client(args, args.device)
+
+ # Perform some computation...
+
+ def attach_args(parser=argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)):
+ return add_distributed_args(parser)
+
+ if __name__ == "__main__":
+ main(attach_args().parse_args())
+
+In this short example, you can see that ``get_client`` takes an argparse object with options configured in the corresponding ``add_distributed_args``.
+
+* ``--device`` controls what type of Dask cluster to create. "cpu" will create a CPU based local Dask cluster, while "gpu" will create a GPU based local cluster.
+ If "cpu" is specified, the number of processes started with the cluster can be specified with the ``--n-workers`` argument.
+ By default, this argument is set to ``os.cpu_count()``.
+ If "gpu" is specified, one worker is started per GPU.
+ It is possible to run entirely CPU-based workflows on a GPU cluster, though the process count (and therefore the number of parallel tasks) will be limited by the number of GPUs on your machine.
+
+* ``--scheduler-address`` and ``--scheduler-file`` are used for connecting to an existing Dask cluster.
+ Supplying one of these is essential if you are running a Dask cluster on SLURM or Kubernetes.
+ All other arguments are ignored if either of these are passed, as the cluster configuration will be done when you create the schduler and works on your cluster.
+
+-----------------------------------------
+CPU Modules
+-----------------------------------------
+
+As mentioned in the ``DocumentDataset`` documentation, the underlying storage format for datasets in NeMo Curator is just a Dask dataframe.
+For the CPU modules, Dask uses pandas dataframes to hold dataframe partitions.
+Most modules in NeMo Curator are CPU based.
+Therefore, the default behavior for reading and writing datasets is to operate on them in CPU memory with a pandas backend.
+The following two functions calls are equivalent.
+
+.. code-block:: python
+
+ books = DocumentDataset.read_json(files, add_filename=True)
+ books = DocumentDataset.read_json(files, add_filename=True, backend="pandas")
+
+
+-----------------------------------------
+GPU Modules
+-----------------------------------------
+
+The following NeMo Curator modules are GPU based.
+
+* Exact Deduplication
+* Fuzzy Deduplication
+* Distributed Data Classification
+
+ * Domain Classification
+ * Quality Classification
+
+GPU modules store the ``DocumentDataset`` using a ``cudf`` backend instead of a ``pandas`` one.
+To read a dataset into GPU memory, one could use the following function call.
+
+.. code-block:: python
+
+ gpu_books = DocumentDataset.read_json(files, add_filename=True, backend="cudf")
+
+
+Even if you start a GPU dask cluster, you can't operate on datasets that use a ``pandas`` backend.
+The ``DocuemntDataset`` must either have been originally read in with a ``cudf`` backend, or it must be transferred during the script.
+
+-----------------------------------------
+Dask with SLURM
+-----------------------------------------
+
+We provide an example SLURM script pipeline in ``examples/slurm``.
+This pipeline has a script ``start-slurm.sh`` that provides configuration options similar to what ``get_client`` provides.
+Every SLURM cluster is different, so make sure you understand how your SLURM cluster works so the scripts can be easily adapted.
+``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster.
+
+Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes.
+You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
\ No newline at end of file
diff --git a/docs/user-guide/DataCuration.rsts b/docs/user-guide/DataCuration.rsts
new file mode 100644
index 00000000..55083b44
--- /dev/null
+++ b/docs/user-guide/DataCuration.rsts
@@ -0,0 +1,2 @@
+Data Curation
+!!!!!!!!!!!!!
diff --git a/docs/user-guide/DistributedDataClassification.rst b/docs/user-guide/DistributedDataClassification.rst
new file mode 100644
index 00000000..b7a99a20
--- /dev/null
+++ b/docs/user-guide/DistributedDataClassification.rst
@@ -0,0 +1,71 @@
+============================================
+Distributed Data Classification
+============================================
+
+-----------------------------------------
+Background
+-----------------------------------------
+
+When preparing text data to be used in training a large language model (LLM), it is useful to classify
+text documents in various ways, to enhance the LLM's performance by making it able to produce more
+contextually appropriate and accurate language across various subjects. NeMo Curator provides this module to
+help a user run inference with pre-trained models on large amounts of text documents. We achieve
+this by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to
+accelerate the classification task in a distributed way. In other words, because the classification of
+a single text document is independent of other documents within a dataset, we can distribute the
+workload across multiple nodes and multiple GPUs to perform parallel processing.
+
+Domain classification and quality classification are two tasks we include as examples within our module.
+Here, we summarize why each is useful for training an LLM.
+
+Domain classification is useful because it helps the LLM understand the context and specific domain of
+the input text. Because different domains have different linguistic characteristics and terminologies,
+an LLM's ability to generate contextually relevant responses can be improved by tailoring training data
+to a specific domain. Overall, this helps provide more accurate and specialized information.
+
+Quality classification is useful for filtering out noisy or low quality data. This allows the model to
+focus on learning from high quality and informative examples, which contributes to the LLM's robustness
+and enhances its ability to generate reliable and meaningful outputs. Additionally, quality
+classification helps mitigate biases and inaccuracies that may arise from poorly curated training data.
+
+-----------------------------------------
+Usage
+-----------------------------------------
+
+NeMo Curator provides a base class ``DistributedDataClassifier`` that can be extended to fit your specfic model.
+The only requirement is that the model can fit on a single GPU.
+We have also provided two subclasses that focus on domain and quality classification.
+Let's see how ``DomainClassifier`` works in a small excerpt taken from ``examples/distributed_data_classification_examples/domain_api_example.py``:
+
+.. code-block:: python
+
+ labels = [
+ "Adult",
+ "Arts_and_Entertainment",
+ "Autos_and_Vehicles",
+ ...,
+ "Shopping",
+ "Sports",
+ "Travel_and_Transportation",
+ ]
+
+ model_file_name = "pytorch_model_file.pth"
+
+ files = get_all_files_paths_under("books_dataset/")
+ input_dataset = DocumentDataset.read_json(files, backend="cudf", add_filename=True)
+
+ domain_classifier = DomainClassifier(
+ model_file_name=model_file_name,
+ labels=labels,
+ filter_by=["Games", "Sports"],
+ )
+ result_dataset = domain_classifier(dataset=input_dataset)
+
+ result_dataset.to_json("games_and_sports/", write_to_filename=True)
+
+This module functions very similarly to the ``ScoreFilter`` module.
+The key differences is that it operates on the GPU instead of the CPU.
+Therefore, the Dask cluster must be started as a GPU one.
+And, ``DomainClassifier`` requires ``DocumentDataset`` to be on the GPU (i.e., have ``backend=cudf``).
+It is easy to extend ``DistributedDataClassifier`` to your own model.
+Check out ``nemo_curator.modules.distributed_data_classifier.py`` for reference.
\ No newline at end of file
diff --git a/docs/user-guide/DocumentDataset.rst b/docs/user-guide/DocumentDataset.rst
new file mode 100644
index 00000000..8711227a
--- /dev/null
+++ b/docs/user-guide/DocumentDataset.rst
@@ -0,0 +1,139 @@
+======================================
+Working with DocumentDataset
+======================================
+-----------------------------------------
+Background
+-----------------------------------------
+Text datasets are responsible for storing metadata along with the core text/document.
+``jsonl``` files are common for their ease of processing and inspecting.
+``parquet`` files are also a common format.
+In both cases, a single dataset is often represented with multiple underlying files (called shards).
+For example, if you have a large dataset named "books" it is likely you will store it in shards with each shard being named something like ``books_00.jsonl``, ``books_01.jsonl``, ``books_02.jsonl``, etc.
+
+How you store your dataset in memory is just as important as how you store it on disk.
+If you have a large dataset that is too big to fit directly into memory, you will have to somehow distribute it across multiple machines/nodes.
+Furthermore, if curating your dataset takes a long time, it is likely to get interrupted due to some unforseen failure or another.
+NeMo Curator's ``DocumentDataset`` employs `Dask's distributed dataframes `_ to mangage large datasets across multiple nodes and allow for easy restarting of interrupted curation.
+``DocumentDataset`` supports reading and writing to sharded ``jsonl`` and ``parquet`` files both on local disk and from remote sources directly like S3.
+
+-----------------------------------------
+Usage
+-----------------------------------------
+############################
+Reading and Writing
+############################
+``DocumentDataset`` is the standard format for text datasets in NeMo Curator.
+Imagine we have a "books" dataset stored in the following structure:
+::
+
+ books_dataset/
+ books_00.jsonl
+ books_01.jsonl
+ books_02.jsonl
+
+You could read, filter the dataset, and write it using the following methods
+
+.. code-block:: python
+
+ import nemo_curator as nc
+ from nemo_curator.datasets import DocumentDataset
+ from nemo_curator.utils.file_utils import get_all_files_paths_under
+ from nemo_curator.filters import WordCountFilter
+
+ files = get_all_files_paths_under("books_dataset/")
+ books = DocumentDataset.read_json(files, add_filename=True)
+
+ filter_step = nc.ScoreFilter(
+ WordCountFilter(min_words=80),
+ text_field="text",
+ score_field="word_count",
+ )
+
+ long_books = filter_step(books)
+
+ long_books.to_json("long_books/", write_to_filename=True)
+
+Let's walk through this code line by line.
+
+* ``files = get_all_files_paths_under("books_dataset/")`` This retrieves a list of all files in the given directory.
+ In our case, this is equivalent to writing
+
+ .. code-block:: python
+
+ files = ["books_dataset/books_00.jsonl",
+ "books_dataset/books_01.jsonl",
+ "books_dataset/books_02.jsonl"]
+
+* ``books = DocumentDataset.read_json(files, add_filename=True)`` This will read the files listed into memory.
+ The ``add_filename=True`` option preserves the name of the shard (``books_00.jsonl``, ``books_01.jsonl``, etc.) as an additional ``filename`` field.
+ When the dataset is written back to disk, this option (in conjunction with the ``write_to_filename`` option) ensure that documents stay in their original shard.
+ This can be useful for manually inspecting the results of filtering shard by shard.
+* ``filter_step = ...`` This constructs and applies a heuristic filter for the length of the document.
+ More information is provided in the filtering page of the documentation.
+* ``long_books.to_json("long_books/", write_to_filename=True)`` This writes the filtered dataset to a new directory.
+ As mentioned above, the ``write_to_filename=True`` preserves the sharding of the dataset.
+ If the dataset was not read in with ``add_filename=True``, setting ``write_to_filename=True`` will throw an error.
+
+``DocumentDataset`` is just a wrapper around a `Dask dataframe `_.
+The underlying dataframe can be accessed with the ``DocumentDataset.df`` member variable.
+It is important to understand how Dask handles computation.
+To quote from their `documentation `_:
+
+ Dask is lazily evaluated. The result from a computation isn't computed until you ask for it. Instead, a Dask task graph for the computation is produced.
+
+Because of this, the call to ``DocumentDataset.read_json`` will not execute immediately.
+Instead, tasks that read each shard of the dataset will be placed on the task graph.
+The task graph is only executed when a call to ``DocumentDataset.df.compute()`` is made, or some operation that depends on ``DocumentDataset.df`` calls ``.compute()``.
+This allows us to avoid reading massive datasets into memory.
+In our case, ``long_books.to_json()`` internally calls ``.compute()``, so the task graph will be executed then.
+
+############################
+Resuming from Interruptions
+############################
+It can be helpful to track which documents in a dataset have already been processed so that long curation jobs can be resumed if they are interrupted.
+NeMo Curator provides a utility for easily tracking which dataset shards have already been processed.
+Consider a modified version of the code above:
+
+.. code-block:: python
+
+ from nemo_curator.utils.file_utils import get_remaining_files
+
+ files = get_remaining_files("books_dataset/", "long_books/", "jsonl")
+ books = DocumentDataset.read_json(files, add_filename=True)
+
+ filter_step = nc.ScoreFilter(
+ WordCountFilter(min_words=80),
+ text_field="text",
+ score_field="word_count",
+ )
+
+ long_books = filter_step(books)
+
+ long_books.to_json("long_books/", write_to_filename=True)
+
+``get_remaining_files`` compares the input directory (``"books_dataset/"``) and the output directory (``"long_books"``) and returns a list of all the shards in the input directory that have not yet been written to the output directory.
+
+
+
+While Dask provides an easy way to avoid reading too much data into memory, there are times when we may need to call ``persist()`` or a similar operation that forces the dataset into memory.
+In these cases, we recommend processing the input dataset in batches using a simple wrapper function around ``get_remaining_files`` as shown below.
+
+.. code-block:: python
+
+ from nemo_curator.utils.file_utils import get_batched_files
+
+ for files in get_batched_files("books_dataset/", "long_books/", "jsonl", batch_size=64):
+ books = DocumentDataset.read_json(files, add_filename=True)
+
+ filter_step = nc.ScoreFilter(
+ WordCountFilter(min_words=80),
+ text_field="text",
+ score_field="word_count",
+ )
+
+ long_books = filter_step(books)
+
+ long_books.to_json("long_books/", write_to_filename=True)
+
+This will read in 64 shards at a time, process them, and write them back to disk.
+Like ``get_remaining_files``, it only includes files that are in the input directory and not in the output directory.
\ No newline at end of file
diff --git a/docs/user-guide/Download.rst b/docs/user-guide/Download.rst
new file mode 100644
index 00000000..66a34463
--- /dev/null
+++ b/docs/user-guide/Download.rst
@@ -0,0 +1,190 @@
+
+.. _data-curator-download:
+
+======================================
+Downloading and Extracting Text
+======================================
+-----------------------------------------
+Background
+-----------------------------------------
+Publicly hosted text datasets are stored in various locations and formats. Downloading a massive public dataset is usually the first step in data curation,
+and it can be cumbersome due to the dataset's massive size and hosting method.
+Also, massive pretraining text datasets are rarely in a format that can be immediately operated on for further curation and training.
+For example, the Common Crawl stores its data in a compressed web archive format (:code:`.warc.gz`) for its raw crawl data, but formats
+like :code:`.jsonl` are more common for data curation due to their ease of use.
+However, extraction can be by far the most computational expensive step of the data curation pipeline, so it can be beneifical to do some filtering prior to
+the extraction step to limit the amount of documents that undergo this heavy computation.
+
+NeMo Curator provides example utilities for downloading and extracting Common Crawl, ArXiv, and Wikipedia data.
+In addition, it provides a flexible interface to extend the utility to other datasets.
+Our Common Crawl example demonstrates how to process a crawl by downloading the data from S3, doing preliminary language filtering with pyCLD2,
+and extracting the relevant text with jusText to output :code:`.jsonl` files.
+
+-----------------------------------------
+Usage
+-----------------------------------------
+
+``nemo_curator.download`` has a collection of functions for handling the download and extraction of online datasets.
+By "download", we typically mean the transfer of data from a web-hosted data source to local file storage.
+By "extraction", we typically mean the process of converting a data format from its raw form (e.g., ``.warc.gz``) to a standardized format (e.g., ``.jsonl``) and discarding irrelvant data.
+
+* ``download_common_crawl`` will download and extract the compressed web archive files of common crawl snapshots to a target directory.
+ Common crawl has an S3 bucket and a direct HTTPS endpoint. If you want to use the S3 bucket, ensure you have properly setup your credentials with `s5cmd `_.
+ Otherwise, the HTTPS endpoints will be used with ``wget``. Here is a small example of how to use it:
+
+ .. code-block:: python
+
+ from nemo_curator.download import download_common_crawl
+
+ common_crawl = download_common_crawl("/extracted/output/folder", "2020-50", "2021-04", output_type="jsonl")
+
+ * ``"/extracted/output/folder"`` is the path to on your local filesystem where the final extracted files will be placed.
+ * ``"2020-50"`` is the first common crawl snapshot that will be included in the download.
+ * ``"2021-04"`` is the last common crawl snapshot that will be included in the download.
+ * ``output_type="jsonl"`` is the file format that will be used for storing the data on disk. Currently ``"jsonl"`` and ``"parquet"`` are supported.
+
+ The return value ``common_crawl`` will be in NeMo Curator's standard ``DocumentDataset`` format. Check out the function's docstring for more parameters you can use.
+
+ NeMo Curator's Common Crawl extraction process looks like this under the hood:
+
+ 1. Decode the HTML within the record from binary to text
+ 2. If the HTML can be properly decoded, then with `pyCLD2 `_, perform language detection on the input HTML
+ 3. Finally, the extract the relevant text with `jusText `_ from the HTML and write it out as a single string within the 'text' field of a json entry within a `.jsonl` file
+* ``download_wikipedia`` will download and extract the latest wikipedia dump. Files are downloaded using ``wget``. Wikipedia might download slower than the other datasets. This is because they limit the number of downloads that can occur per-ip address.
+
+ .. code-block:: python
+
+ from nemo_curator.download import download_wikipedia
+
+ wikipedia = download_wikipedia("/extracted/output/folder", dump_date="20240201")
+
+ * ``"/extracted/output/folder"`` is the path to on your local filesystem where the final extracted files will be placed.
+ * ``dump_date="20240201"`` fixes the Wikipedia dump to a specific date. If no date is specified, the latest dump is used.
+
+* ``download_arxiv`` will download and extract latex versions of ArXiv papers. They are hosted on S3, so ensure you have properly setup your credentials with `s5cmd `_.
+
+ .. code-block:: python
+
+ from nemo_curator.download import download_arxiv
+
+ arxiv = download_arxiv("/extracted/output/folder")
+
+ * ``"/extracted/output/folder"`` is the path to on your local filesystem where the final extracted files will be placed.
+
+
+All of these functions return a ``DocumentDataset`` of the underlying dataset and metadata that was obtained during extraction. If the dataset has been downloaded and extracted at the path passed to it, it will read from the files there instead of downloading and extracting them again.
+Due to how massive each of these datasets are (with Common Crawl snapshots being on the order of hundreds of terrabytes) all of these datasets are sharded accross different files.
+They all have a ``url_limit`` parameter that allows you to only download a small number of shards.
+
+-----------------------------------------
+Related Scripts
+-----------------------------------------
+In addition to the Python module described above, NeMo Curator provides several CLI scripts that you may find useful for performing the same function.
+
+The :code:`download_and_extract` script within NeMo Curator is a generic tool that can be used to download and extract from a number of different
+datasets. In general, it can be called as follows in order to download and extract text from the web
+
+.. code-block:: bash
+
+ download_and_extract \
+ --input-url-file= \
+ --builder-config-file= \
+ --output-json-dir=
+
+This utility takes as input a list of URLs that point to files that contain prepared, unextracted data (e.g., pre-crawled web pages from Common Crawl), a config file that describes how to download and extract the data, and the output directory to where the extracted text will be written in jsonl format (one json written to each document per line). For each URL provided in the list of URLs, a corresponding jsonl file will be written to the output directory.
+
+The config file that must be provided at runtime, should take the following form
+
+.. code-block:: yaml
+
+ download_module: nemo_curator.download.mydataset.DatasetDownloader
+ download_params: {}
+ iterator_module: nemo_curator.download.mydataset.DatasetIterator
+ iterator_params: {}
+ extract_module: nemo_curator.download.mydataset.DatasetExtractor
+ extract_params: {}
+
+Each pair of lines corresponds to an implementation of the abstract DocumentDownloader, DocumentIterator and DocumentExtractor classes. In this case the dummy names of DatasetDownloader, DatasetIterator, and DatasetExtractor have been provided. For this example, each of these have been defined within the fictitious file :code:`nemo_curator/download/mydataset.py`. Already within NeMo Curator, we provide implementations of each of these classes for the Common Crawl, Wikipedia and ArXiv datasets.
+
+###############################
+Common Crawl Example
+###############################
+
+
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Setup
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+If you prefer, the download process can pull WARC files from S3 using `s5cmd `_.
+This utility is preinstalled in the NeMo Framework Container, but you must have the necessary credentials within :code:`~/.aws/config` in order to use it.
+If you would prefer to use this over `wget `_ instead, you may set :code:`aws=True` in the :code:`download_params` as follows
+
+.. code-block:: yaml
+
+ download_module: nemo_curator.download.commoncrawl.CommonCrawlWARCDownloader
+ download_params:
+ aws: True
+ iterator_module: nemo_curator.download.commoncrawl.CommonCrawlWARCIterator
+ iterator_params: {}
+ extract_module: nemo_curator.download.commoncrawl.CommonCrawlWARCExtractor
+ extract_params: {}
+
+
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Downloading and Extracting Common Crawl
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As described in the first section of this document, the first step towards using the :code:`download_and_extract` for Common Crawl will be to create a list of URLs that point to the location of the WARC files hosted by Common Crawl.
+Within NeMo Curator, we provide the utility :code:`get_common_crawl_urls` to obtain these urls. This utility can be run as follows
+
+.. code-block:: bash
+
+ get_common_crawl_urls \
+ --cc-snapshot-index-file=./url_data/collinfo.json \
+ --starting-snapshot="2020-50" \
+ --ending-snapshot="2020-50" \
+ --output-warc-url-file=./url_data/warc_urls_cc_2020_50.txt
+
+This script pulls the Common Crawl index from `https://index.commoncrawl.org` and stores the index to the file
+specified by the argument :code:`--cc-snapshot-index-file`. It then retrieves all WARC urls between the
+dates specified by the arguments :code:`--starting-snapshot` and :code:`--ending-snapshot`.
+Finally, it writes all WARC urls to the text file :code:`--output-warc-urls`. This file is a simple text file
+with the following format::
+
+ https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00000.warc.gz
+ https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00001.warc.gz
+ https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00002.warc.gz
+ https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00003.warc.gz
+ https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00004.warc.gz
+ ...
+
+For the CC-MAIN-2020-50 snapshot there are a total of 72,000 compressed WARC files each between 800 - 900 MB.
+
+Now with the prepared list of URLs, we can use the Common Crawl config included in the :code:`config` directory under the root directory of the repository. This config uses the download, data loader and extraction classes defined in the file :code:`nemo_curator/download/commoncrawl.py`.
+With this config and the input list of URLs, the :code:`download_and_extract` utility can be used as follows for downloading and extracting text from Common Crawl
+
+.. code-block:: bash
+
+ download_and_extract \
+ --input-url-file=./url_data/warc_urls_cc_2020_50.txt \
+ --builder-config-file=./config/cc_warc_builder.yaml \
+ --output-json-dir=/datasets/CC-MAIN-2020-50/json
+
+
+As the text is extracted from the WARC records, the prepared documents are written to the directory specified by :code:`--output-json-dir`. Here is an
+example of a single line of an output `.jsonl` file extracted from a WARC record
+
+.. code-block:: json
+
+ {"text": "커뮤니티\n\n어린이 요리 교실은 평소 조리와 제과 제빵에 관심이 있는 초등학생을 대상으로 나이프스킬, 한식, 중식, 양식, 제과, 제빵, 디저트,
+ 생활요리 등 요리 기초부터 시작해 다양한 요리에 대해 배우고, 경험할 수 있도록 구성되었다.\n\n요즘 부모들의 자녀 요리 교육에 대한 관심이 높아지고
+ 있는데, 어린이 요리교실은 자녀들이 어디서 어떻게 요리를 처음 시작할지 막막하고 어려워 고민하는 이들을 위해 만들어졌다.\n\n그 뿐만 아니라 학생들이
+ 식재료를 다루는 과정에서 손으로 만지고 느끼는 것이 감각을 자극하여 두뇌발달에 도움을 주며, 조리를 통해 자신의 감정을 자연스럽게 표현할 수
+ 있고 이를 통해 정서적 안정을 얻을 수 있다. 또한, 다양한 사물을 만져 보면서 차이점을 구별하고 사물의 특징에 대해 인지할 수 있으므로 인지 능력 향상에
+ 도움이 되며, 만지고 느끼고 비교하는 과정에서 감각 기능을 향상시킬 수 있다.\n\n방과 후 시간이 되지 않는 초등학생들을 위해 평일반 뿐만 아니라 주말반도
+ 운영하고 있으며 두 분의 선생님들의 안전적인 지도하에 수업이 진행된다. 한국조리예술학원은 젊은 감각과 학생들과의 소통을 통해 자발적인 교육을 가르친다.
+ 자세한 학원 문의는 한국조리예술학원 홈페이지나 대표 전화, 카카오톡 플러스친구를 통해 가능하다.", "id": "a515a7b6-b6ec-4bed-998b-8be2f86f8eac",
+ "source_id": "https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00000.warc.gz",
+ "url": "http://hanjowon.co.kr/web/home.php?mid=70&go=pds.list&pds_type=1&start=20&num=67&s_key1=&s_que=", "language": "KOREAN"}
+
+Once all records have been processed within a WARC file, it is by default deleted from disk.
+
diff --git a/docs/user-guide/GpuDeduplication.rst b/docs/user-guide/GpuDeduplication.rst
new file mode 100644
index 00000000..d23e8ee7
--- /dev/null
+++ b/docs/user-guide/GpuDeduplication.rst
@@ -0,0 +1,83 @@
+
+.. _data-curator-gpu-deduplication:
+
+#######################################################
+GPU Accelerated Exact and Fuzzy Deduplication
+#######################################################
+
+-----------------------------------------
+Background
+-----------------------------------------
+
+Training on randomly selected documents for many epochs can be sub-optimal to downstream performance for language models.
+For more information on when this is harmful, please see `Muennighoff et al., 2023 `_ and `Tirumala et al., 2023 `_.
+The exact and fuzzy document-level deduplication module in the NeMo Curator aims at reducing the occurence of duplicate and
+near-duplicate documents in the dataset. Exact deduplication refers to removing identical (i.e., document strings are equal)
+documents from the dataset, while fuzzy deduplication refers to removing near-identical (e.g., an excerpt of a document is used in another document)
+documents from the dataset.
+
+Both functionalities are supported in NeMo Curator and accelerated using `RAPIDS `_.
+Exact dedpulication works by hashing each document and only keeping one document per hash.
+Fuzzy deduplication is more involved and follows the method outlined in `Microsoft Turing NLG 530B `_.
+
+-----------------------------------------
+Usage
+-----------------------------------------
+As exact deduplication is a much less involved procedure and requires significantly less compute,
+we typically will first run exact deduplication before fuzzy deduplication. Also, from our experience in
+deduplicating Common Crawl snapshots, a significant portion of the duplicates are in fact exact duplicates.
+
+When removing near-duplicates within the corpus we perform fuzzy deduplication at the document level in order to remove documents that
+have high Jaccard similarity. Our approach closely resembles the approach described in `Smith et al., 2020 `_. This
+approach can essentially be split into two conceptual changes. The first stage involves computing MinHashes Signatures on
+documents and then performing Locality Sensitive Hashing (LSH) to find candidate duplucates. Due to the approximate nature of the bucketing via MinHash + LSH
+(`Leskovec et al., 2020 `_) we process each of the buckets to remove any potential false positives that may have been hashed into the buckets.
+
+
+
+Before running either of these modules, users should assign a unique document ID to each document in the corpus.
+This can be accomplished using the :code:`add_id` module within the NeMo Curator:
+
+.. code-block:: bash
+
+ add_id \
+ --input-data-dir= \
+ --log-dir=./log/add_id
+
+By default, this will create a new field named :code:`adlr_id` within each json document which will have the form "doc_prefix-000001".
+If the dataset already has a unique ID this step can be skipped.
+
+**Note**: Fuzzy deduplication only works with numeric ID's or the specific ID format generated by the :code:`add_id` script. If the
+dataset does not contain ID's in this format it's recommended to convert to an integer based ID or ID created by the :code:`add_id` script.
+
+Once a unique ID has been added to each document, users can proceed with exact and fuzzy deduplication which roughly require the following
+steps (all scripts are included in the :code:`nemo_curator/scripts/` subdirectory):
+
+* Exact dedup
+ 1. Input: Data directories
+ 2. Output: _exact_duplicates.parquet. List of exact duplicates and the document hash.
+
+* Fuzzy Dedup
+ 1. Minhashes (Compute minhashes)
+ 1. Input: Data Directories
+ 2. Output: minhashes.parquet for each data dir.
+ 2. Buckets (Minhash Buckets/LSH)
+ 1. Input: Minhash directories
+ 2. Output: _buckets.parquet
+ 3. Map Buckets
+ 1. Input: Buckets.parquet + Data Dirs
+ 2. Output: anchor_docs_with_bk.parquet
+ 4. Jaccard Shuffle
+ 1. Input: anchor_docs_with_bk.parquet + Data Dirs
+ 2. Output: shuffled_docs.parquet
+ 5. Jaccard compute
+ 1. Input: Shuffled docs.parquet
+ 2. Output: jaccard_similarity_results.parquet
+ 6. Connected Components
+ 1. Input: jaccard_similarity_results.parquet
+ 2. Output: connected_components.parquet
+
+In addition to the scripts, there are examples in the `examples` directory that showcase using the python module
+directly in your own code. It also has examples on how to remove documents from the corpus using the list of duplicate IDs generated from exact or fuzzy
+deduplication.
+
diff --git a/docs/user-guide/LanguageIdentificationUnicodeFormatting.rst b/docs/user-guide/LanguageIdentificationUnicodeFormatting.rst
new file mode 100644
index 00000000..ddd107bf
--- /dev/null
+++ b/docs/user-guide/LanguageIdentificationUnicodeFormatting.rst
@@ -0,0 +1,93 @@
+
+.. _data-curator-languageidentification:
+
+#######################################################
+Language Identification and Unicode Fixing
+#######################################################
+
+-----------------------------------------
+Background
+-----------------------------------------
+Large unlabeled text corpora often contain a variety of languages.
+However, data curation usually includes steps that are language specific (e.g. using language-tuned heuristics for quality filtering)
+and many curators are only interested in curating a monolingual dataset.
+Datasets also may have improperly decoded unicode characters (e.g. "The Mona Lisa doesn't have eyebrows." decoding as "The Mona Lisa doesn’t have eyebrows.").
+
+NeMo Curator provides utilities to identify languages and fix improperly decoded unicode characters.
+The language identification is performed using `fastText `_ and unicode fixing is performed using `ftfy `_.
+Even though a preliminary language identification may have been performed on the unextracted text (as is the case in our Common Crawl pipeline
+using pyCLD2), `fastText `_ is more accurate so it can be used for a second pass.
+
+-----------------------------------------
+Usage
+-----------------------------------------
+
+We provide an example of how to use the language identification and unicode reformatting utility at ``examples/identify_languages_and_fix_unicode.py``.
+At a high level, the module first identifies the languages of the documents and removes any documents for which it has high uncertainty about the language.
+Notably, this line uses one of the ``DocmentModifiers`` that NeMo Curator provides:
+
+.. code-block:: python
+
+ cleaner = nc.Modify(UnicodeReformatter())
+ cleaned_data = cleaner(lang_data)
+
+``DocumentModifier``s like ``UnicodeReformatter`` are very similar to ``DocumentFilter``s.
+They implement a single ``modify_document`` function that takes in a document and outputs a modified document.
+Here is the implementation of the ``UnicodeReformatter`` modifier:
+
+.. code-block:: python
+
+ class UnicodeReformatter(DocumentModifier):
+ def __init__(self):
+ super().__init__()
+
+ def modify_document(self, text: str) -> str:
+ return ftfy.fix_text(text)
+
+Also like the ``DocumentFilter`` functions, ``modify_document`` can be annotated with ``batched`` to take in a pandas series of documents instead of a single document.
+
+-----------------------------------------
+Related Scripts
+-----------------------------------------
+
+To perform the language identification, we can use the config file provided in the `config` directory
+and provide the path to a local copy of the `lid.176.bin` language identification fastText model. Then, with the general purpose
+:code:`filter_documents` tool, we can compute language scores and codes for each document in the corpus as follows
+
+.. code-block:: bash
+
+ filter_documents \
+ --input-data-dir= \
+ --filter-config-file=./config/fasttext_langid.yaml \
+ --log-scores \
+ --log-dir=./log/lang_id
+
+
+This will apply the fastText model, compute the score and obtain the language class, and then write this
+information as additonal keys within each json document.
+
+With the language information present within the keys of each json, the :code:`separate_by_metadata`, will first construct
+a count of the documents by language within the corpus and then from that information, split each file across all the languages
+within that file. Below is an example run command for :code:`separate_by_metadata`
+
+.. code-block:: bash
+
+ separate_by_metadata \
+ --input-data-dir= \
+ --input-metadata-field=language \
+ --output-data-dir=