Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
mmcauliffe committed Feb 16, 2023
1 parent e88a65e commit 912f216
Show file tree
Hide file tree
Showing 15 changed files with 948 additions and 362 deletions.
18 changes: 18 additions & 0 deletions docs/source/changelog/changelog_2.2.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@

.. _changelog_2.2:

*************
2.2 Changelog
*************

2.2.1
=====

- Fixed a couple of bugs in training Phonetisaurus models
- Added training of Phonetisaurus models for tokenizer

2.2.0
=====

- Add support for training tokenizers and tokenization
- Migrate most os.path functionality to pathlib
1 change: 1 addition & 0 deletions docs/source/changelog/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ Not tied to 2.1, but in the near-ish term I would like to:
:hidden:
:maxdepth: 1
changelog_2.2.rst
news_2.1.rst
changelog_2.1.rst
news_2.0.rst
Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -345,7 +345,7 @@
# "image_dark": "logo-dark.svg",
},
"analytics": {
"google_analytics_id": "UA-73068199-4",
"google_analytics_id": "353930198",
},
# "show_nav_level": 1,
# "navigation_depth": 4,
Expand Down
45 changes: 41 additions & 4 deletions docs/source/first_steps/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ There are several broad use cases that you might want to use MFA for. Take a lo
#. Use the trained G2P model in :ref:`first_steps_g2p_pretrained` to generate a pronunciation dictionary
#. Use the generated pronunciation dictionary in :ref:`first_steps_align_train_acoustic_model` to generate aligned TextGrids

#. **Use case 5:** You have a :ref:`speech corpus <corpus_structure>` and the language involved is in the list of :xref:`pretrained_acoustic_models`, but the language does not mark word boundaries in its orthography.

#. Follow :ref:`first_steps_tokenize` to tokenize the corpus
#. Use the tokenized transcripts and follow :ref:`first_steps_align_pretrained`

.. _first_steps_align_pretrained:

Aligning a speech corpus with existing pronunciation dictionary and acoustic model
Expand Down Expand Up @@ -90,8 +95,8 @@ Depending on your use case, you might have a list of words to run G2P over, or j

.. code-block::
mfa g2p english_us_arpa ~/mfa_data/my_corpus ~/mfa_data/new_dictionary.txt # If using a corpus
mfa g2p english_us_arpa ~/mfa_data/my_word_list.txt ~/mfa_data/new_dictionary.txt # If using a word list
mfa g2p ~/mfa_data/my_corpus english_us_arpa ~/mfa_data/new_dictionary.txt # If using a corpus
mfa g2p ~/mfa_data/my_word_list.txt english_us_arpa ~/mfa_data/new_dictionary.txt # If using a word list
Running one of the above will output a text file pronunciation dictionary in the format that MFA uses (:ref:`dictionary_format`). I recommend looking over the pronunciations generated and make sure that they look sensible. For languages where the orthography is not transparent, it may be helpful to include :code:`--num_pronunciations 3` so that more pronunciations are generated than just the most likely one. For more details on running G2P, see :ref:`g2p_dictionary_generating`.

Expand Down Expand Up @@ -170,18 +175,50 @@ Once the G2P model is trained, you should see the exported archive in the folder
mfa model save g2p ~/mfa_data/my_g2p_model.zip
mfa g2p my_g2p_model ~/mfa_data/my_new_word_list.txt ~/mfa_data/my_new_dictionary.txt
mfa g2p ~/mfa_data/my_new_word_list.txt my_g2p_model ~/mfa_data/my_new_dictionary.txt
# Or
mfa g2p ~/mfa_data/my_g2p_model.zip ~/mfa_data/my_new_word_list.txt ~/mfa_data/my_new_dictionary.txt
mfa g2p ~/mfa_data/my_new_word_list.txt ~/mfa_data/my_g2p_model.zip ~/mfa_data/my_new_dictionary.txt
Take a look at :ref:`first_steps_g2p_pretrained` with this new model for a more detailed walk-through of generating a dictionary.

.. note::

Please see :ref:`g2p_model_training_example` for an example using toy data.

.. _first_steps_tokenize:

Tokenize a corpus to add word boundaries
----------------------------------------

For the purposes of this example, we'll also assume that you have done nothing else with MFA other than follow the :ref:`installation` instructions and you have the :code:`mfa` command working. Finally, we'll assume that your corpus is in Japanese and is stored in the folder :code:`~/mfa_data/my_corpus`, so when working with your data, this will be the main thing to update.

To tokenize the Japanese text to add spaces, first download the Japanese tokenizer model via:


.. code-block::
mfa model download tokenizer japanese_mfa
Once you have the model downloaded, you can tokenize your corpus via:

.. code-block::
mfa tokenize ~/mfa_data/my_corpus japanese_mfa ~/mfa_data/tokenized_version
You can check the tokenized text in :code:`~/mfa_data/tokenized_version`, verify that it looks good, and copy the files to replace the untokenized files in :code:`~/mfa_data/my_corpus` for use in alignment.

.. warning::

MFA's tokenizer models are nowhere near state of the art, and I recommend using other tokenizers as they make sense:

* Japanese: `nagisa <https://nagisa.readthedocs.io/en/latest/>`_
* Chinese: `spacy-pkuseg <https://github.com/explosion/spacy-pkuseg/blob/master/readme/readme_english.md>`_
* Thai: `sertiscorp/thai-word-segmentation <https://github.com/sertiscorp/thai-word-segmentation>`_

The above were used in the initial construction of the training corpora for MFA, though the training segmentations for Japanese have begun to diverge from :code:`nagisa`, as they break up phonological words into morphological parses where for the purposes of acoustic model training and alignment it makes more sense to not split (nagisa: :ipa_inline:`使っ て [ts ɨ k a Q t e]` vs mfa: :ipa_inline:`使って [ts ɨ k a tː e]`). The MFA tokenizer models are provided as an easy start up path as the ones listed above may have extra dependencies and platform restrictions.

.. toctree::
:maxdepth: 1
:hidden:
Expand Down
2 changes: 1 addition & 1 deletion montreal_forced_aligner/alignment/mixins.py
Original file line number Diff line number Diff line change
Expand Up @@ -551,7 +551,7 @@ def compile_information(self) -> None:
average_logdet_frames += data["logdet_frames"]
average_logdet_sum += data["logdet"] * data["logdet_frames"]

if hasattr(self, "db_engine"):
if hasattr(self, "session"):
csv_path = self.working_directory.joinpath("alignment_log_likelihood.csv")
with mfa_open(csv_path, "w") as f, self.session() as session:
writer = csv.writer(f)
Expand Down
28 changes: 23 additions & 5 deletions montreal_forced_aligner/command_line/train_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,10 @@
common_options,
)
from montreal_forced_aligner.config import GLOBAL_CONFIG, MFA_PROFILE_VARIABLE
from montreal_forced_aligner.tokenization.trainer import TokenizerTrainer
from montreal_forced_aligner.tokenization.trainer import (
PhonetisaurusTokenizerTrainer,
TokenizerTrainer,
)

__all__ = ["train_tokenizer_cli"]

Expand Down Expand Up @@ -48,6 +51,12 @@
"most of the data and validating on an unseen subset.",
default=False,
)
@click.option(
"--phonetisaurus",
is_flag=True,
help="Flag for using Phonetisaurus-style models.",
default=False,
)
@common_options
@click.help_option("-h", "--help")
@click.pass_context
Expand All @@ -63,10 +72,19 @@ def train_tokenizer_cli(context, **kwargs) -> None:
config_path = kwargs.get("config_path", None)
corpus_directory = kwargs["corpus_directory"]
output_model_path = kwargs["output_model_path"]
trainer = TokenizerTrainer(
corpus_directory=corpus_directory,
**TokenizerTrainer.parse_parameters(config_path, context.params, context.args),
)
phonetisaurus = kwargs["phonetisaurus"]
if phonetisaurus:
trainer = PhonetisaurusTokenizerTrainer(
corpus_directory=corpus_directory,
**PhonetisaurusTokenizerTrainer.parse_parameters(
config_path, context.params, context.args
),
)
else:
trainer = TokenizerTrainer(
corpus_directory=corpus_directory,
**TokenizerTrainer.parse_parameters(config_path, context.params, context.args),
)

try:
trainer.setup()
Expand Down
7 changes: 2 additions & 5 deletions montreal_forced_aligner/corpus/multiprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -494,11 +494,8 @@ def _no_dictionary_sanitize(self, session):
text.append(w)
if character_text:
character_text.append("<space>")
if self.bracket_regex.match(w):
character_text.append(self.bracketed_word)
else:
for g in w:
character_text.append(g)
for g in w:
character_text.append(g)
text = " ".join(text)
character_text = " ".join(character_text)
yield {
Expand Down
8 changes: 3 additions & 5 deletions montreal_forced_aligner/g2p/mixins.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Mixin module for G2P functionality"""
import typing
from abc import ABCMeta, abstractmethod
from abc import ABCMeta
from pathlib import Path
from typing import Dict, List

Expand Down Expand Up @@ -36,7 +36,6 @@ def __init__(
self.g2p_threshold = g2p_threshold
self.include_bracketed = include_bracketed

@abstractmethod
def generate_pronunciations(self) -> Dict[str, List[str]]:
"""
Generate pronunciations
Expand All @@ -46,13 +45,12 @@ def generate_pronunciations(self) -> Dict[str, List[str]]:
dict[str, list[str]]
Mappings of keys to their generated pronunciations
"""
...
raise NotImplementedError

@property
@abstractmethod
def words_to_g2p(self) -> List[str]:
"""Words to produce pronunciations"""
...
raise NotImplementedError


class G2PTopLevelMixin(MfaWorker, DictionaryMixin, G2PMixin):
Expand Down
Loading

0 comments on commit 912f216

Please sign in to comment.