You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From this commit it appears that either there is overlapping text between training and text (Fix!) or that we did not combine the text from this repo correctly. In either case, we need to rename the text-file so that we can trace its origin better.
Since we are going to shuffle all sentences anyway,
combine all available text, do sort | uniq
call it something reasonable, related to the Tolúwaṣẹ language word-id task!
The text was updated successfully, but these errors were encountered:
it shows us that we only have 417 repeated lines comprising 5621 words out of 130k total words. Given how small this is, I think it is okay for there to be these duplicates, for simplicity's sake we can just combine the text and rename it correctly. Closing.
From this commit it appears that either there is overlapping text between training and text (Fix!) or that we did not combine the text from this repo correctly. In either case, we need to rename the text-file so that we can trace its origin better.
Since we are going to shuffle all sentences anyway,
sort | uniq
The text was updated successfully, but these errors were encountered: