Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape second Yorùbá Bible version #4

Closed
ruohoruotsi opened this issue Dec 12, 2018 · 1 comment
Closed

Scrape second Yorùbá Bible version #4

ruohoruotsi opened this issue Dec 12, 2018 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@ruohoruotsi
Copy link
Member

Currently the Bible corpus only comprises scraped text from the first version. Examining the first chapter of Genesis, they are different in a way that perhaps makes them non-redundant for ADR or NMT training.

Ní ìbẹ̀rẹ̀ ohun gbogbo Ọlọ́run dá àwọn ọ̀run àti ayé. 2 Ayé sì wà ní rúdurùdu, ó sì ṣófo, òkùnkùn sì wà lójú ibú omi, Ẹ̀mí Ọlọ́run sì ń rábàbà lójú omi.
Ní ìbẹ̀rẹ̀, nígbà tí Ọlọrun dá ọ̀run ati ayé, 2 ayé rí júujùu, ó sì ṣófo. Ibú omi bo gbogbo ayé, gbogbo rẹ̀ ṣókùnkùn biribiri, ẹ̀mí Ọlọrun sì ń rábàbà lójú omi.
  1. https://www.bible.com/versions/911-ycb-bibeli-mimo-ni-ede-yoruba-de-ni
  2. https://www.bible.com/versions/207-bm-yoruba-bible

Experiments are necessary to not add redundant text to the ADR training corpus, but the second corpus it definitely worth the scraping effort.

@ruohoruotsi ruohoruotsi added the enhancement New feature or request label Dec 12, 2018
@ruohoruotsi ruohoruotsi self-assigned this Dec 12, 2018
ruohoruotsi added a commit that referenced this issue Dec 12, 2018
Currently the Bible corpus only comprises scraped text from the first version.  Adding the second version is tracked here: #4
@ruohoruotsi
Copy link
Member Author

Fixed in bf5fb22, still need to integrate the texts into the training script for https://github.com/Niger-Volta-LTI/yoruba-adr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant