Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Sem-dedup #130

Merged
merged 101 commits into from
Jul 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
18a2fc0
Applying SEO Best Pratices (#104)
aschilling-nv Jun 12, 2024
f19df32
Shuffle CC result on group before writing out (#110)
ayushdg Jun 13, 2024
42309e6
Update index.rst (#113)
jgerh Jun 13, 2024
33332a8
first commit
avem-nv Jun 17, 2024
c633677
mv under modules dir
avem-nv Jun 17, 2024
d9b8545
first commit
avem-nv Jun 17, 2024
dc135c4
mv under modules dir
avem-nv Jun 17, 2024
968a3eb
first commit
avem-nv Jun 17, 2024
f5c51bb
mv under modules dir
avem-nv Jun 17, 2024
f286678
embed by cluster saved
avem-nv Jun 23, 2024
103c366
id map script
avem-nv Jun 24, 2024
451fa2d
test commit
avem-nv Jun 24, 2024
dec4913
add id map script
avem-nv Jun 24, 2024
bbbe400
Cleanup compute_embeddings_crossfit.py
VibhuJawa Jun 27, 2024
5d56cd0
Cleanup compute_embeddings_crossfit.py
VibhuJawa Jun 27, 2024
9ddf558
Pre-commit style fixes
VibhuJawa Jun 27, 2024
4ebab04
clustering_dask_crossfit.py
VibhuJawa Jun 27, 2024
eeee758
Minor clean up to sort_clusters_crossfit.py
VibhuJawa Jun 27, 2024
79beb61
cleanup semdedup_crossfit
VibhuJawa Jun 27, 2024
e11bbd5
Remove undo changes
VibhuJawa Jun 27, 2024
3179e24
Remove rename changes
VibhuJawa Jun 27, 2024
cbc9960
Fix rename
VibhuJawa Jun 27, 2024
57469cb
Readme formatting
VibhuJawa Jun 27, 2024
f60fc01
add dask to semdedup_crossfit.py
VibhuJawa Jun 27, 2024
c0e36f2
README.md updates
VibhuJawa Jun 27, 2024
61b21fd
README.md updates
VibhuJawa Jun 27, 2024
94b70f0
README.md updates
VibhuJawa Jun 27, 2024
2ba596e
README.md updates
VibhuJawa Jun 27, 2024
d8cbd42
README.md updates
VibhuJawa Jun 27, 2024
11fcf9d
configure max memory using a cli
VibhuJawa Jun 27, 2024
8c0d0ce
Dumb id results to parquet
VibhuJawa Jun 27, 2024
cd3f842
Embedding fixes
VibhuJawa Jun 27, 2024
1c28b83
README.md updates
VibhuJawa Jun 27, 2024
be5a608
Working end to end
VibhuJawa Jun 28, 2024
fd6ff60
Minor yaml fixes
VibhuJawa Jun 28, 2024
b307375
Undo changes to index.rst
VibhuJawa Jun 28, 2024
b30dd52
Update .pre-commit-config.yaml
VibhuJawa Jun 28, 2024
5d5e07c
Update index.rst
VibhuJawa Jun 28, 2024
7d32fb4
Update index.rst
VibhuJawa Jun 28, 2024
d6ead05
Undo changes to docs/personalidentifiableinformationidentificationand…
VibhuJawa Jun 28, 2024
e37a1db
Update fuzzy_dedup.py
VibhuJawa Jun 28, 2024
d6cd233
Undo changes to docs/personalidentifiableinformationidentificationand…
VibhuJawa Jun 28, 2024
dfe7db8
Merge branch 'main' into vjawa/dev_semdedup
VibhuJawa Jun 28, 2024
79167ec
Update index.rst
VibhuJawa Jun 28, 2024
27b5248
Add end to end script in readme.md
VibhuJawa Jun 28, 2024
6c196cf
Add type hints
VibhuJawa Jun 28, 2024
1072d56
Use dask for sort_clusters
VibhuJawa Jun 28, 2024
0258923
Make sort_clusters work on MNMG scales
VibhuJawa Jun 28, 2024
b896c8b
Cleaned up dask shutdown
VibhuJawa Jun 28, 2024
2c03601
Decrease noise in E2E scripts
VibhuJawa Jun 28, 2024
cde12c2
Clean up scripts
VibhuJawa Jun 28, 2024
2e71f65
Fix scripts/end_to_end_script.sh
VibhuJawa Jun 28, 2024
e49573d
Some more cleanup
VibhuJawa Jun 28, 2024
d291e9d
Add copyright
VibhuJawa Jun 28, 2024
81cc71c
Fix README.md
VibhuJawa Jun 28, 2024
5cd14f1
Address reviews
VibhuJawa Jun 28, 2024
e4713b2
Make work with a SemDedupConfig
VibhuJawa Jun 29, 2024
0b5782d
Make work with SemDedupConfig
VibhuJawa Jun 29, 2024
e119880
Move to nemo-curator's logger
VibhuJawa Jun 29, 2024
f961a2b
Semdedup-extract_dedup_data.py
VibhuJawa Jul 1, 2024
cd4dab9
Update index.rst
VibhuJawa Jun 28, 2024
c07411c
Applying SEO Best Pratices (#104)
aschilling-nv Jun 12, 2024
155188e
Update index.rst
VibhuJawa Jun 28, 2024
d096721
Fix bad merge
VibhuJawa Jul 1, 2024
a339e59
Update index.rst
VibhuJawa Jun 28, 2024
d6f2c98
Update index.rst
VibhuJawa Jun 28, 2024
6d6c21c
Update index.rst
VibhuJawa Jun 28, 2024
b761e77
Update index.rst
VibhuJawa Jun 28, 2024
7d5fbe9
Add Module for embedding+clustering
VibhuJawa Jul 2, 2024
9419338
Add sorting to clustering
VibhuJawa Jul 2, 2024
5fdb3b4
Refactor Semdup modules
VibhuJawa Jul 2, 2024
3a6f10c
Refactor Semdup modules
VibhuJawa Jul 3, 2024
9ff7397
Refactor Semdup modules
VibhuJawa Jul 3, 2024
a5b5f17
Fix Readme.md
VibhuJawa Jul 3, 2024
993ba92
Add a environment variable to silence HF warnings
VibhuJawa Jul 3, 2024
d505d8f
Merge in main
VibhuJawa Jul 3, 2024
835c3a0
dask-cudf fix
VibhuJawa Jul 3, 2024
2eba719
dask-cudf fix
VibhuJawa Jul 3, 2024
05d0e88
dask-cudf fix
VibhuJawa Jul 3, 2024
0bde039
Make config a flat file based on reviews
VibhuJawa Jul 3, 2024
ae03905
Add docstrings
VibhuJawa Jul 3, 2024
f957b50
Fix argparse and seed function
VibhuJawa Jul 3, 2024
eaada91
Use argparse to read config
VibhuJawa Jul 3, 2024
07f8290
Move around config files
VibhuJawa Jul 3, 2024
d5997b5
Move around config files
VibhuJawa Jul 3, 2024
94efa3a
Move around config files
VibhuJawa Jul 3, 2024
e9d21e3
Remove end_to_end_script.sh
VibhuJawa Jul 3, 2024
14faf60
Append Readme
VibhuJawa Jul 3, 2024
a304629
Address Reviews
VibhuJawa Jul 3, 2024
e7fa30d
Change config
VibhuJawa Jul 3, 2024
4f46f78
Make embedding creation optionally lazy
VibhuJawa Jul 3, 2024
bd43d5d
fix docstring
VibhuJawa Jul 3, 2024
52480aa
Address Reviews and docstrings
VibhuJawa Jul 5, 2024
16ad760
Address Reviews and make eps_thresholds a list of values
VibhuJawa Jul 5, 2024
584340a
Minor import fix
VibhuJawa Jul 5, 2024
01affbb
Empty Commit
VibhuJawa Jul 5, 2024
eaee1e5
Add modules to __init__ and README.md
VibhuJawa Jul 5, 2024
1c0f706
Fix init
VibhuJawa Jul 5, 2024
12373a7
Move comment
VibhuJawa Jul 5, 2024
da909f3
Empty commit to restart CI (which failed due to a download issue)
VibhuJawa Jul 5, 2024
c2cd97c
Empty commit to restart CI (which failed due to a download issue)
VibhuJawa Jul 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,9 @@ NeMo Curator provides a collection of scalable data-mining modules. Some of the

- [Document-level deduplication](docs/user-guide/gpudeduplication.rst)

- Both exact and fuzzy (near-identical) deduplication are accelerated using cuDF and Dask
- exact and fuzzy (near-identical) deduplication are accelerated using cuDF and Dask
- For fuzzy deduplication, our implementation follows the method described in [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990)
- For semantic deduplication, our implementation follows the method described in [SemDeDup] (https://arxiv.org/pdf/2303.09540) by Meta AI (FAIR) (https://github.com/facebookresearch/SemDeDup)

- [Multilingual downstream-task decontamination](docs/user-guide/taskdecontamination.rst) following the approach of [OpenAI GPT3](https://arxiv.org/pdf/2005.14165.pdf) and [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990)

Expand Down
32 changes: 32 additions & 0 deletions config/sem_dedup_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Configuration file for semdantic dedup
cache_dir: "semdedup_cache"
num_files: 16
id_col_name: "id"
id_col_type: "int"
input_column: "text"

# Embeddings configuration
embeddings_save_loc: "embeddings"
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
embedding_batch_size: 128
embedding_max_mem_gb: 25

# Clustering configuration
clustering_save_loc: "clustering_results"
n_clusters: 1000
seed: 1234
max_iter: 100
kmeans_with_cos_dist: false

# Semdedup configuration
which_to_keep: "hard"
largest_cluster_size_to_process: 100000
sim_metric: "cosine"

# Extract dedup configuration
eps_thresholds:
- 0.01
- 0.001

# Which threshold to use for extracting deduped data
eps_to_extract: 0.01
1 change: 0 additions & 1 deletion docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,3 @@
personalidentifiableinformationidentificationandremoval.rst
distributeddataclassification.rst
kubernetescurator.rst

84 changes: 84 additions & 0 deletions examples/semdedup_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging
import os
import time

from nemo_curator.datasets import DocumentDataset
from nemo_curator.log import create_logger
from nemo_curator.modules.config import SemDedupConfig
from nemo_curator.modules.semantic_dedup import SemDedup
from nemo_curator.utils.distributed_utils import get_client, read_data
from nemo_curator.utils.file_utils import (
expand_outdir_and_mkdir,
get_all_files_paths_under,
)
from nemo_curator.utils.script_utils import ArgumentHelper


def silence_hf_warnings():
from transformers.utils import logging

logging.set_verbosity_error()


def main(args):
semdedup_config = SemDedupConfig.from_yaml(args.config_file)
client = get_client(**ArgumentHelper.parse_client_args(args))

silence_hf_warnings()
client.run(silence_hf_warnings)

expand_outdir_and_mkdir(semdedup_config.cache_dir)
logger = create_logger(
rank=0,
name="logger-end-to_end-semdup",
log_file=os.path.join(semdedup_config.cache_dir, "compute_embeddings.log"),
log_level=logging.INFO,
stdout=True,
)
st = time.time()
input_files = get_all_files_paths_under(
root=args.input_data_dir,
)
if semdedup_config.num_files > 0:
input_files = input_files[: semdedup_config.num_files]
logger.info(f"Processing {len(input_files)} files")
ddf = read_data(
VibhuJawa marked this conversation as resolved.
Show resolved Hide resolved
input_files=input_files,
file_type=args.input_file_type,
add_filename=False,
backend="cudf",
)
dataset = DocumentDataset(ddf)
semdup = SemDedup(semdedup_config, logger=logger)
dedup_ids = semdup(dataset)
print(dedup_ids.df.head())
logger.info(f"Time taken: {time.time() - st}")
client.cancel(client.futures, force=True)
client.close()


def attach_args():
parser = ArgumentHelper.parse_semdedup_args(add_input_args=True)
return parser


def console_script():
main(attach_args().parse_args())


if __name__ == "__main__":
main(attach_args().parse_args())
8 changes: 6 additions & 2 deletions nemo_curator/log.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from nemo_curator.utils.file_utils import expand_outdir_and_mkdir


def create_logger(rank, log_file, name="logger", log_level=logging.INFO):
def create_logger(rank, log_file, name="logger", log_level=logging.INFO, stdout=False):
# Create the logger
logger = logging.getLogger(name)
logger.setLevel(log_level)
Expand All @@ -36,8 +36,12 @@ def create_logger(rank, log_file, name="logger", log_level=logging.INFO):
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)

logger = logging.LoggerAdapter(logger, extra)
if stdout:
stdout_handler = logging.StreamHandler()
stdout_handler.setFormatter(formatter)
logger.addHandler(stdout_handler)

logger = logging.LoggerAdapter(logger, extra)
return logger


Expand Down
18 changes: 16 additions & 2 deletions nemo_curator/modules/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from nemo_curator.utils.import_utils import gpu_only_import_from

from .add_id import AddId
from .config import FuzzyDuplicatesConfig
from .config import FuzzyDuplicatesConfig, SemDedupConfig
from .dataset_ops import blend_datasets, Shuffle
from .exact_dedup import ExactDuplicates
from .filter import Filter, Score, ScoreFilter
Expand All @@ -36,10 +36,19 @@
FuzzyDuplicates = gpu_only_import_from(
"nemo_curator.modules.fuzzy_dedup", "FuzzyDuplicates"
)

# Pytorch related imports must come after all imports that require cugraph,
# because of context cleanup issues b/w pytorch and cugraph
# See this issue: https://github.com/rapidsai/cugraph/issues/2718
SemDedup = gpu_only_import_from("nemo_curator.modules.semantic_dedup", "SemDedup")
EmbeddingCreator = gpu_only_import_from(
"nemo_curator.modules.semantic_dedup", "EmbeddingCreator"
)
ClusteringModel = gpu_only_import_from(
"nemo_curator.modules.semantic_dedup", "ClusteringModel"
)
SemanticClusterLevelDedup = gpu_only_import_from(
"nemo_curator.modules.semantic_dedup", "SemanticClusterLevelDedup"
)
from .distributed_data_classifier import DomainClassifier, QualityClassifier

__all__ = [
Expand All @@ -59,4 +68,9 @@
"AddId",
"blend_datasets",
"Shuffle",
"SemDedup",
"SemDedupConfig",
"EmbeddingCreator",
"ClusteringModel",
"SemanticClusterLevelDedup",
]
70 changes: 69 additions & 1 deletion nemo_curator/modules/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@
# limitations under the License.

import warnings
from dataclasses import dataclass
from dataclasses import dataclass, field
from typing import List

import yaml

Expand Down Expand Up @@ -98,3 +99,70 @@ def __post_init__(self):
raise ValueError("Jaccard Threshold must be between [0,1]")
if self.buckets_per_shuffle <= 0:
raise ValueError("Buckets per shuffle must be greater than 0")


@dataclass
class SemDedupConfig(BaseConfig):
VibhuJawa marked this conversation as resolved.
Show resolved Hide resolved
"""
Configuration for Semantic Deduplication.
Attributes:
cache_dir (str): Directory to store cache.
num_files (int): Number of files. Default is -1, meaning all files.
id_col_name (str): Column name for ID.
id_col_type (str): Column type for ID.
input_column (str): Input column for embeddings.
embeddings_save_loc (str): Location to save embeddings.
embedding_model_name_or_path (str): Model name or path for embeddings.
embedding_batch_size (int): Inital Batch size for processing embeddings.
embedding_max_mem_gb (int): Maximum memory in GB for embeddings.
clustering_save_loc (str): Location to save clustering results.
n_clusters (int): Number of clusters.
seed (int): Seed for clustering.
max_iter (int): Maximum iterations for clustering.
kmeans_with_cos_dist (bool): Use KMeans with cosine distance.
which_to_keep (str): Which duplicates to keep.
largest_cluster_size_to_process (int): Largest cluster size to process.
sim_metric (str): Similarity metric for deduplication.
eps_thresholds (List[float]): Epsilon thresholds to calculate if semantically similar or not.
eps_to_extract (float): Epsilon value to extract deduplicated data.
"""

cache_dir: str
num_files: int = -1
id_col_name: str = "id"
id_col_type: str = "str"
input_column: str = "text"

# Embeddings
embeddings_save_loc: str = "embeddings"
embedding_model_name_or_path: str = "sentence-transformers/all-MiniLM-L6-v2"
embedding_batch_size: int = 128
embedding_max_mem_gb: int = 25

# Clustering config
clustering_save_loc: str = "clustering_results"
n_clusters: int = 1000
seed: int = 1234
max_iter: int = 100
kmeans_with_cos_dist: bool = False

# Semdedup config
which_to_keep: str = "hard"
largest_cluster_size_to_process: int = 100000
sim_metric: str = "cosine"

# Extract dedup config
eps_thresholds: List[float] = field(default_factory=lambda: [0.01, 0.001])
eps_to_extract: float = 0.01

def __post_init__(self):
if self.cache_dir is None:
raise ValueError(
"Finding sem-dedup requires a cache directory accessible via all workers to store intermediates"
)

if self.eps_to_extract not in self.eps_thresholds:
raise ValueError(
f"Epsilon to extract {self.eps_to_extract} must be in eps_thresholds {self.eps_thresholds}"
)
Loading