Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add jupyter notebook tutorial for single node mulilingual dataset #30

Merged
merged 36 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
ec26f9f
Init commit for tutorial notebook
nicoleeeluo Apr 11, 2024
462a1a3
Fix metadata inference with pandas and dask (#35)
ryantwolf Apr 19, 2024
f297076
Disable PyTorch Compile Multiprocessing (#34)
ryantwolf Apr 22, 2024
dbe7606
Improve speed of AddId module (#36)
ryantwolf Apr 23, 2024
417e874
Make GPU dependencies optional (#27)
ayushdg Apr 23, 2024
6d99292
Fix failing GPU tests with latest pandas bump (#41)
ayushdg Apr 23, 2024
dff70cc
Adds Nemo Curator K8s example (#40)
terrykong Apr 23, 2024
f2b3904
Move common dedup utils and remove unused code (#42)
ayushdg Apr 30, 2024
b192e92
Fix lang id example (#37)
ryantwolf May 3, 2024
909f58d
Add dataset blending tool (#32)
ryantwolf May 3, 2024
0bab063
High level fuzzy duplicates module (#46)
ayushdg May 3, 2024
9849164
Fix indexing in PII Modifier (#55)
ryantwolf May 6, 2024
794a435
Disable string conversion globally (#56)
ryantwolf May 7, 2024
0f5a029
Fix issue #43 (empty files creation) and improve reading/writing spee…
miguelusque May 8, 2024
d4a2f0f
[Tutorials] Add a tutorial for PEFT data curation (#45)
Maghoumi May 10, 2024
8bea00b
Only import PII constants during Curator import (#61)
ayushdg May 13, 2024
c66138a
Deleting links
nicoleeeluo May 15, 2024
148e1d4
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
7e08c96
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
75f5dd7
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
fcd8230
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
48af561
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
49efc21
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
5826eb1
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
30abf29
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
43eae27
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
87eefbd
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
262d8e0
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
15db6f3
Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
nicoleeeluo May 16, 2024
84587b2
Fixed typo. Update content to lastest NeMo Curator version. Added fuz…
nicoleeeluo May 17, 2024
4b024cb
Fixing Style
nicoleeeluo May 20, 2024
0a50fd4
Updating container version
nicoleeeluo May 20, 2024
c119bf8
Merge branch 'main' into main
nicoleeeluo May 20, 2024
2a9052c
Fixing style
nicoleeeluo May 20, 2024
9ab7144
Merge branch 'NVIDIA:main' into main
nicoleeeluo May 24, 2024
11e4eba
Update get_client() according to latest version; Update log path for …
nicoleeeluo May 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file modified .pre-commit-config.yaml
100644 → 100755
Empty file.
82 changes: 82 additions & 0 deletions tutorials/single_node_tutorial/config/heuristic_filter_non-en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
input_field: text
filters:
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
# This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words.
# The filter listed at the top will be applied first, and the following filters will be applied in
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
- name: nemo_curator.filters.heuristic_filter.SymbolsToWordsFilter
log_score: True
params:
max_symbol_to_word_ratio: 0.1
- name: nemo_curator.filters.heuristic_filter.NumbersFilter
log_score: True
params:
max_number_to_text_ratio: 0.15
- name: nemo_curator.filters.heuristic_filter.UrlsFilter
log_score: True
params:
max_url_to_text_ratio: 0.2
- name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
log_score: True
params:
max_white_space_ratio: 0.25
- name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
log_score: True
params:
max_parentheses_ratio: 0.1
- name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
log_score: True
params:
remove_if_at_top_or_bottom: True
max_boilerplate_string_ratio: 0.4
- name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
log_score: True
params:
max_repeated_line_fraction: 0.7
- name: nemo_curator.filters.heuristic_filter.RepeatedParagraphsFilter
log_score: True
params:
max_repeated_paragraphs_ratio: 0.7
- name: nemo_curator.filters.heuristic_filter.RepeatedLinesByCharFilter
params:
max_repeated_lines_char_ratio: 0.8
- name: nemo_curator.filters.heuristic_filter.RepeatedParagraphsByCharFilter
log_score: True
params:
max_repeated_paragraphs_char_ratio: 0.8
- name: nemo_curator.filters.heuristic_filter.WordCountFilter
log_score: True
params:
min_words: 50
max_words: 100000
# NOTE: This filter tends to remove many documents and will need to
# be tuned per language
# - name: nemo_curator.filters.heuristic_filter.PunctuationFilter
# params:
# max_num_sentences_without_endmark_ratio: 0.85
# - name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
# params:
# max_mean_word_length: 10
# min_mean_word_length: 3
# - name: nemo_curator.filters.heuristic_filter.LongWordFilter
# params:
# max_word_length: 1000
# - name: nemo_curator.filters.heuristic_filter.EllipsisFilter
# params:
# max_num_lines_ending_with_ellipsis_ratio: 0.3
# Top N-Gram filters for N-grams 2, 3, and 4
- name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter
log_score: True
params:
n: 2
max_repeating_ngram_ratio: 0.2
- name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter
log_score: True
params:
n: 3
max_repeating_ngram_ratio: 0.18
- name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter
log_score: True
params:
n: 4
max_repeating_ngram_ratio: 0.16
Binary file added tutorials/single_node_tutorial/image/jaccard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading