[BUG] FastTextLangId doesnt return a list truely #33

zahramahani · 2024-04-17T08:20:57Z

Describe the bug

lambda score: score[1]

IndexError: string index out of range

Steps/Code to reproduce bug

i used examples files to download common crawl with download_common_crawl.py after that i decided to seprate Persian language so i run identify_languages_and_fix_unicode.py but when i run the python script the above error occurs

here are varibles that i changed in first file

start_snapshot = "2023-50"
end_snapshot = "2024-10"
output_directory = "../../../datasets/common_crawl/"

second file changes

multilingual_data_path = "../../../datasets/common_crawl"
language_separated_output_path = "../../../datasets/common_crawl_lang_separated"
cleaned_data_output_path = "../../../datasets/common_crawl_cleaned"

# Download a fastText language identification model
# and see a list of supported languages here:
# https://fasttext.cc/docs/en/language-identification.html
model_path = "../../../lid.176.bin"
target_language = "FA"
language_field = "language"

Expected behavior

run with no error

**Environment overview **

Environment location: local server
Method of NeMo-Curator install: pip install --extra-index-url https://pypi.nvidia.com .

Environment details

OS version : Debian GNU/Linux 11 (bullseye)
Dask version : 2024.1.1
Python version: 3.10.14

The text was updated successfully, but these errors were encountered:

ryantwolf · 2024-04-19T22:57:53Z

Thanks for raising this issue! I'm always happy to see people contributing to the project. I think I've discovered the root cause. It appears that when Dask performs type inference on the function, it passes in a string instead of a list. I'll make a PR that explicitly annotates the meta so no type inference is needed.

This bug has seemingly revealed another though, as the mentioned fix isn't actually enough. If you only implement that, you may be left with another error Exception: 'TypeError("\'>=\' not supported between instances of \'str\' and \'float\'")'. This comes from this issue where Dask seems to improperly convert object types to strings. Therefore, in the keep_document call of FastTextLangId, score[0] is not 0.5435 (or whatever the lang id score is), but instead it's '[' because score was converted to a string. We'll likely need to disable the string conversion for the filter with dask.config.set({"dataframe.convert-string": False}).

Thanks again for raising the issue! I should have a PR up shortly for this.

zahramahani · 2024-04-20T09:12:55Z

i added this line in the start of main function the next error occured

kwargs:    {}
Exception: "KeyError('a')"

Traceback (most recent call last):
  File "../miniconda3/envs/nemo_curator/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '
```a'

zahramahani · 2024-04-30T09:00:06Z

i changed

language_stats = separate_by_metadata(
        filtered_dataset.df,
        language_separated_output_path,
        metadata_field=language_field[1],
    ).compute()

to

language_stats = separate_by_metadata(
        filtered_dataset.df,
        language_separated_output_path,
        metadata_field=language_field,
    ).compute()

so problem solved

zahramahani · 2024-04-30T09:01:08Z

so i close the issue

zahramahani added the bug Something isn't working label Apr 17, 2024

zahramahani changed the title ~~[BUG]~~ [BUG] FastTextLangId doesnt return a list truely Apr 17, 2024

ryantwolf mentioned this issue Apr 19, 2024

Fix lang id example #37

Merged

zahramahani closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] FastTextLangId doesnt return a list truely #33

[BUG] FastTextLangId doesnt return a list truely #33

zahramahani commented Apr 17, 2024 •

edited

Loading

ryantwolf commented Apr 19, 2024 •

edited

Loading

zahramahani commented Apr 20, 2024 •

edited

Loading

zahramahani commented Apr 30, 2024 •

edited

Loading

zahramahani commented Apr 30, 2024

[BUG] FastTextLangId doesnt return a list truely #33

[BUG] FastTextLangId doesnt return a list truely #33

Comments

zahramahani commented Apr 17, 2024 • edited Loading

ryantwolf commented Apr 19, 2024 • edited Loading

zahramahani commented Apr 20, 2024 • edited Loading

zahramahani commented Apr 30, 2024 • edited Loading

zahramahani commented Apr 30, 2024

zahramahani commented Apr 17, 2024 •

edited

Loading

ryantwolf commented Apr 19, 2024 •

edited

Loading

zahramahani commented Apr 20, 2024 •

edited

Loading

zahramahani commented Apr 30, 2024 •

edited

Loading