Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Applying SEO Best Pratices (NVIDIA#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Shuffle CC result on group before writing out (NVIDIA#110) Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst (NVIDIA#113) Added links to tutorials Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * first commit Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * mv under modules dir Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * first commit Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * mv under modules dir Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * first commit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * mv under modules dir Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * embed by cluster saved Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * id map script Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * test commit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * add id map script Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Pre-commit style fixes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * clustering_dask_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Minor clean up to sort_clusters_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * cleanup semdedup_crossfit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Remove undo changes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Remove rename changes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix rename Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Readme formatting Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * add dask to semdedup_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * configure max memory using a cli Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Dumb id results to parquet Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Embedding fixes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Working end to end Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Minor yaml fixes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Undo changes to index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update .pre-commit-config.yaml Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update fuzzy_dedup.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add end to end script in readme.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add type hints Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Use dask for sort_clusters Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make sort_clusters work on MNMG scales Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Cleaned up dask shutdown Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Decrease noise in E2E scripts Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Clean up scripts Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix scripts/end_to_end_script.sh Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Some more cleanup Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add copyright Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix README.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address reviews Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make work with a SemDedupConfig Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make work with SemDedupConfig Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move to nemo-curator's logger Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Semdedup-extract_dedup_data.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Applying SEO Best Pratices (NVIDIA#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix bad merge Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add Module for embedding+clustering Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add sorting to clustering Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix Readme.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add a environment variable to silence HF warnings Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * dask-cudf fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * dask-cudf fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * dask-cudf fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make config a flat file based on reviews Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add docstrings Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix argparse and seed function Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Use argparse to read config Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move around config files Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move around config files Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move around config files Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Remove end_to_end_script.sh Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Append Readme Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address Reviews Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Change config Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make embedding creation optionally lazy Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * fix docstring Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address Reviews and docstrings Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address Reviews and make eps_thresholds a list of values Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Minor import fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Empty Commit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add modules to __init__ and README.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix init Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move comment Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: avinashvem <avem@nvidia.com> Co-authored-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Co-authored-by: avinashvem <avem@nvidia.com>
- Loading branch information