Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] training and test text from Tolúwaṣẹ lang-id task #15

Closed
ruohoruotsi opened this issue Jun 5, 2019 · 1 comment
Closed
Assignees
Labels
bug Something isn't working

Comments

@ruohoruotsi
Copy link
Member

From this commit it appears that either there is overlapping text between training and text (Fix!) or that we did not combine the text from this repo correctly. In either case, we need to rename the text-file so that we can trace its origin better.

Since we are going to shuffle all sentences anyway,

  • combine all available text, do sort | uniq
  • call it something reasonable, related to the Tolúwaṣẹ language word-id task!
@ruohoruotsi ruohoruotsi added the bug Something isn't working label Jun 5, 2019
@ruohoruotsi ruohoruotsi self-assigned this Jun 5, 2019
@ruohoruotsi
Copy link
Member Author

We did the following analysis on the top level of [Tolúwaṣẹ's repo]:(https://github.com/Toluwase/Word-Level-Language-Identification-for-Resource-Scarce-)

$ cat Yoruba_training_corpus\(part\).txt | wc
    4708  130914  824905
$ cat EngYor_test_corpus.txt | wc
     616   12077   76069
$ cat EngYor_test_corpus.txt Yoruba_training_corpus\(part\).txt | wc
    5324  142991  900974
$ cat EngYor_test_corpus.txt Yoruba_training_corpus\(part\).txt | sort | uniq -d | wc 
     417    5621   37125
$ cat EngYor_test_corpus.txt Yoruba_training_corpus\(part\).txt | sort | uniq | wc 
    3915  129775  811647

it shows us that we only have 417 repeated lines comprising 5621 words out of 130k total words. Given how small this is, I think it is okay for there to be these duplicates, for simplicity's sake we can just combine the text and rename it correctly. Closing.

ruohoruotsi added a commit that referenced this issue Jul 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant