Last update: July 3rd 2020
Content coordination: Rubén Martín rmartin@mozilla.com
Mozilla Voice communities empower the collection of machine-learning based voice technologies -- including software, tools, and data -- that Mozilla stands behind.
- Provide people interested in contributing to Mozilla Voice goals and mission with clear guidelines and expectations on how to set up and run a self-sustaining Mozilla Voice community.
- Unify existing community knowledge previously documented in different places.
- Be the central place to understand the whole voice community journey.
- Communicate what brings value to the project and how communities can support it.
Mozilla Voice communities are governed by Mozilla's code of conduct and etiquette guidelines, we take this very seriously and no violations are tolerated.
We encourage you to please read Mozilla Community Participation Guidelines before contributing to this project.
For more information on how to report violations of the Community Participation Guidelines, please read our 'How to Report' page.
Mozilla Corporation owns the overall Mozilla Voice project governance and is the ultimate decision maker for its direction and goals. It’s also in charge of the development of some tools and channels described here to support our communities.
The voice communities are self-organized, and you don’t need to ask for permission to participate or mobilize any of these communities in your language. All the data generated by communities is published under open licences.
Some community roles exist formally and informally, and they all should follow the Mozilla leadership shared agreements.
Mozilla Voice has a variety of communities that support the project in different important areas, they are usually grouped by language.
The work done by these communities advance a language from not having a presence in Mozilla Voice at all to being able to generate a functional STT model which is able to understand how people speak.
📝 | Text corpus
Gathering, validating and processing public domain sentences. |
🗣 | Voice corpus
Recording and validating voices to create a public domain dataset. |
🌍 | Localization
Adapting the project tools and materials to be understood by a specific audience. |
🤖 | Model training (TBD)
Using our text and voice datasets to train and optimize STT models in specific languages using machine learning. |
👥 As you can read on this playbook, you will need a multidisciplinary team of committed people to support your language journey.
🔨 Make sure you check the required skills for each section and look for people who can fit.
💬 Check the channels section to learn how to set up your local forums and chat to communicate with other people in your language.
ℹ️ Note: Mozilla’s focus is to optimize this project, tools and communities for the goals and measures of success described in this document. We welcome small and minority language communities, and we understand these goals may seem out of reach. In that case, feel free to share with us how they are different for you. Nevertheless, we welcome all language communities!
Collect or generate text corpus under public domain licence that can be read by people to facilitate their voice donations.
We are a community of text collectors and creators, always looking for places with text corpora we can extract and process so it can be transformed into short and simple sentences for people to read.
Generate as many sentences as possible in our languages. Having more sentences allows contributors to donate more hours of voice data.
- 5,000 sentences allow 5,5 hrs of voice
- 9,000 sentences allow 10 hrs of voice
- 90,000 sentences allow 100 hrs of voice
- 1,800,000 sentences allow 2000 hrs of voice
Anyone can join this community. Join our discourse forums or our matrix chat, introduce yourself and jump into our sentence tools right away.
We have developed a tool to extract sentences from large sources of public domain text, with a focus easy-to-read corpus and Wikipedia.
This is the easiest and fastest way to get more than a million sentences as soon as possible for your language.
ℹ️ Please read the tool documentation on how to generate specific rules for your language.
🔨 Skills required to help: Command line usage and git, familiar with regular expressions.
We have also created a sentence collection tool that allows contributors to collect and validate sentences created by the community. You can use this tool also to import and clean-up small-to-medium-sized public domain corpus you have found or collected.
ℹ️ Please read the collector how-to before using this tool and check the community guidelines on how to validate sentences.
🔨 Skills required to help: Strong grammar knowledge of the target language you are contributing to.
If you have found an existing public domain corpus bigger than 100K sentences, we have an independent process to handle it, since we understand that manual validation using the sentence collector is not ideal.
ℹ️ Please create a new topic on our discourse, so we can evaluate if your corpus fits the licence and size requirements to run this process.
🔨 Skills required to help: Expertise processing and cleaning up text, linguistics/language expertise to check the quality of the resulting sentences.
Contributors also develop, maintain and update the sentence extractor and collector code.
- Sentence Extractor: 🐞 Open issues - 🔨 Skills needed: Rust
- Sentence Collector: 🐞 Open issues - 🔨 Skills needed: React, JavaScript (and soon Node.js)
These are some roles you can take as part of this community.
- Text searcher - Find and connect with sources and organizations that have or are willing to donate text corpus under public domain licence.
- Text processor - Cleaning up the raw text corpus to apply our sentences requirements.
- Text creator - Generate your own sentences and release them under public domain.
- Validator - Help validate and review existing cleaned-up sentences.
- Mobilizer - Help people in the community to get started and keep contributing.
- Developer - Develop, maintain and update the sentence tooling.
- Common Voice discourse category.
- Common Voice matrix chat room.
- Sentence Extractor matrix chat room.
- Common Voice project announcements.
💬 If your language already exists on Common Voice, make sure you check and join the local discourse and matrix room. If that’s not the case, please create a new topic on discourse asking for one to be created.
Donate and validate our voices under public domain licence to generate a dataset usable by Speech to Text technologies to train models in different languages democratizing voice technology.
We are a community of voice tech enthusiasts, who want to help collect and generate a large dataset of public domain voices that can be freely used to train Speech to Text technologies.
Collect and validate as many voices as possible in our languages. Having more voices validated allows us to then train more advanced STT models.
- At least 1,000 unique speakers per language.
- 2,000 hours of voice validated to train a near-human general STT model.
- 10,000 hours of voice validated for a very high quality, general, large vocabulary, continuous speech recognition model.
Anyone can join this community. Join our discourse forums or our matrix chat and introduce yourself, jump into Common Voice site, get familiar with it and start donating your voice.
🔨 You don’t need any specialized skill to contribute to this community, you only need to be able to speak into a microphone or listen to audio clips.
We have developed a site that allows you to donate your voice by reading sentences collected by the community.
Feel free to create an account to track your progress and add more information on your profile about your voice. Demographic information helps us balance the dataset, giving machine learning researchers and engineers a way to train models that represent better the speakers of the language.
ℹ️ Please read the following community guidelines to know how to produce better voice donations.
The same site allows you to review other people’s voices by listening to voices donated by the community. Each recording will need at least two positive validations from different people. Feel free to create an account to track your progress, compare with other contributors, set yourself goals or get awards badges.
ℹ️ Please read the following community guidelines to know how to better validate voices.
You can help the community by organizing activities and encouraging others to do the same. Use the channels we have at our disposal to engage with other contributors in your language, talk about your ideas to grow the community and collect and validate more voices.
ℹ️ Check a few ideas from the Contribute to Common Voice activity.
⭐️ You can re-use any graphical material we have produced to support the project.
Help other contributors in our discourse and matrix channels. Answering their questions about how to use the site or helping document reported issues on github.
The main development of our site is led by our staff team, but anyone can submit pull requests based on open issues, or minor UI bugs.
ℹ️ Please read the contribution guidelines before submitting any code.
The complete text and voice dataset for languages where we have data is currently generated by the Common Voice staff team.
Currently, we are generating a new version of the datasets two times per year and publishing them on our site.
ℹ️ Note that we are asking for an email to send the link to the dataset (instead of direct download) because we want to have a way to contact everyone who downloaded the data in case we get deletion requests from contributors.
We understand that some people might want more frequent releases, and we are working on a more continuous release model to accommodate these needs.
These are some roles you can take as part of this community.
- Voice donator: Donate your voice.
- Voice validator: Help review other people’s voices.
- Support: Join our community channels to support contributors with issues using our site.
- Mobilizer: Help people in the community to get started and keep contributing.
- Developer: Help submitting code and fixes to our site.
- Common Voice discourse category.
- Common Voice matrix chat room.
- Common Voice project announcements.
💬 If your language already exists on Common Voice, make sure you check and join the local discourse and matrix room. If that’s not the case, please create a new topic on discourse asking for one to be created.
Adapting the project tools and material to be understood by a specific audience.
We are a community of translators and linguists that localize the original English content into our languages.
🔨 English knowledge and deep understanding of our local language and culture are key for this work.
Localize the project tools into our language, mainly the Common Voice site.
- The Common Voice site is 100% localized in my language.
Anyone can join this community. Join our discourse forums or our matrix chat and introduce yourself, jump into our localization tool and check the status of your language.
We use Mozilla’s localization tool, Pontoon, to translate the Common Voice strings. Please create an account and check your language on Common Voice Pontoon section.
ℹ️ Please read how to use pontoon before starting to use the tool, you might need to ask the Mozilla localization team for permissions to validate suggestions.
🔨 Skills required to help: English knowledge, strong knowledge of your language.
These are some roles you can take as part of this community.
- Localizer: Suggest new translations for the pending strings
- Reviewer: Check and validate existing suggestions and improve their quality.
- Mobilizer: Help people in the community to get started and keep contributing.
- Common Voice discourse category.
- Common Voice matrix chat room.
- Mozilla Localization matrix chat room.
- Common Voice project announcements.
💬 If your language already exists on Common Voice, make sure you check and join the local discourse and matrix room. If that’s not the case, please create a new topic on discourse asking for one to be created.