Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FastTextLangId doesnt return a list truely #33

Closed
zahramahani opened this issue Apr 17, 2024 · 4 comments
Closed

[BUG] FastTextLangId doesnt return a list truely #33

zahramahani opened this issue Apr 17, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@zahramahani
Copy link

zahramahani commented Apr 17, 2024

Describe the bug

lambda score: score[1]

IndexError: string index out of range

Steps/Code to reproduce bug

i used examples files to download common crawl with download_common_crawl.py after that i decided to seprate Persian language so i run identify_languages_and_fix_unicode.py but when i run the python script the above error occurs

here are varibles that i changed in first file

start_snapshot = "2023-50"
end_snapshot = "2024-10"
output_directory = "../../../datasets/common_crawl/"

second file changes

multilingual_data_path = "../../../datasets/common_crawl"
language_separated_output_path = "../../../datasets/common_crawl_lang_separated"
cleaned_data_output_path = "../../../datasets/common_crawl_cleaned"

# Download a fastText language identification model
# and see a list of supported languages here:
# https://fasttext.cc/docs/en/language-identification.html
model_path = "../../../lid.176.bin"
target_language = "FA"
language_field = "language"

Expected behavior

run with no error

**Environment overview **

  • Environment location: local server
  • Method of NeMo-Curator install: pip install --extra-index-url https://pypi.nvidia.com .

Environment details

  • OS version : Debian GNU/Linux 11 (bullseye)
  • Dask version : 2024.1.1
  • Python version: 3.10.14
@zahramahani zahramahani added the bug Something isn't working label Apr 17, 2024
@zahramahani zahramahani changed the title [BUG] [BUG] FastTextLangId doesnt return a list truely Apr 17, 2024
@ryantwolf
Copy link
Collaborator

ryantwolf commented Apr 19, 2024

Thanks for raising this issue! I'm always happy to see people contributing to the project. I think I've discovered the root cause. It appears that when Dask performs type inference on the function, it passes in a string instead of a list. I'll make a PR that explicitly annotates the meta so no type inference is needed.

This bug has seemingly revealed another though, as the mentioned fix isn't actually enough. If you only implement that, you may be left with another error Exception: 'TypeError("\'>=\' not supported between instances of \'str\' and \'float\'")'. This comes from this issue where Dask seems to improperly convert object types to strings. Therefore, in the keep_document call of FastTextLangId, score[0] is not 0.5435 (or whatever the lang id score is), but instead it's '[' because score was converted to a string. We'll likely need to disable the string conversion for the filter with dask.config.set({"dataframe.convert-string": False}).

Thanks again for raising the issue! I should have a PR up shortly for this.

@zahramahani
Copy link
Author

zahramahani commented Apr 20, 2024

i added this line in the start of main function the next error occured

kwargs:    {}
Exception: "KeyError('a')"

Traceback (most recent call last):
  File "../miniconda3/envs/nemo_curator/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '
```a'

@zahramahani
Copy link
Author

zahramahani commented Apr 30, 2024

i changed

language_stats = separate_by_metadata(
        filtered_dataset.df,
        language_separated_output_path,
        metadata_field=language_field[1],
    ).compute()

to

language_stats = separate_by_metadata(
        filtered_dataset.df,
        language_separated_output_path,
        metadata_field=language_field,
    ).compute()

so problem solved

@zahramahani
Copy link
Author

so i close the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants