[FIX] training and test text from Tolúwaṣẹ lang-id task #15

ruohoruotsi · 2019-06-05T02:46:15Z

From this commit it appears that either there is overlapping text between training and text (Fix!) or that we did not combine the text from this repo correctly. In either case, we need to rename the text-file so that we can trace its origin better.

Since we are going to shuffle all sentences anyway,

combine all available text, do sort | uniq
call it something reasonable, related to the Tolúwaṣẹ language word-id task!

The text was updated successfully, but these errors were encountered:

ruohoruotsi · 2019-07-01T03:09:41Z

We did the following analysis on the top level of [Tolúwaṣẹ's repo]:(https://github.com/Toluwase/Word-Level-Language-Identification-for-Resource-Scarce-)

$ cat Yoruba_training_corpus\(part\).txt | wc
    4708  130914  824905
$ cat EngYor_test_corpus.txt | wc
     616   12077   76069
$ cat EngYor_test_corpus.txt Yoruba_training_corpus\(part\).txt | wc
    5324  142991  900974
$ cat EngYor_test_corpus.txt Yoruba_training_corpus\(part\).txt | sort | uniq -d | wc 
     417    5621   37125
$ cat EngYor_test_corpus.txt Yoruba_training_corpus\(part\).txt | sort | uniq | wc 
    3915  129775  811647

it shows us that we only have 417 repeated lines comprising 5621 words out of 130k total words. Given how small this is, I think it is okay for there to be these duplicates, for simplicity's sake we can just combine the text and rename it correctly. Closing.

…f overlap between the two corpora for #15

ruohoruotsi added the bug Something isn't working label Jun 5, 2019

ruohoruotsi self-assigned this Jun 5, 2019

ruohoruotsi closed this as completed Jul 1, 2019

ruohoruotsi added a commit that referenced this issue Jul 1, 2019

[ADD] combined training and text corpus, after validating the level o…

2005732

…f overlap between the two corpora for #15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] training and test text from Tolúwaṣẹ lang-id task #15

[FIX] training and test text from Tolúwaṣẹ lang-id task #15

ruohoruotsi commented Jun 5, 2019

ruohoruotsi commented Jul 1, 2019

[FIX] training and test text from Tolúwaṣẹ lang-id task #15

[FIX] training and test text from Tolúwaṣẹ lang-id task #15

Comments

ruohoruotsi commented Jun 5, 2019

ruohoruotsi commented Jul 1, 2019