2.2.1 (MontrealCorpusTools#568)

Aditya514 · Feb 16, 2023 · 912f216 · 912f216
1 parent e88a65e
commit 912f216
Show file tree

Hide file tree

Showing 15 changed files with 948 additions and 362 deletions.
diff --git a/docs/source/changelog/changelog_2.2.rst b/docs/source/changelog/changelog_2.2.rst
@@ -0,0 +1,18 @@
+
+.. _changelog_2.2:
+
+*************
+2.2 Changelog
+*************
+
+2.2.1
+=====
+
+- Fixed a couple of bugs in training Phonetisaurus models
+- Added training of Phonetisaurus models for tokenizer
+
+2.2.0
+=====
+
+- Add support for training tokenizers and tokenization
+- Migrate most os.path functionality to pathlib
diff --git a/docs/source/changelog/index.md b/docs/source/changelog/index.md
@@ -60,6 +60,7 @@ Not tied to 2.1, but in the near-ish term I would like to:
 :hidden:
 :maxdepth: 1
 
+changelog_2.2.rst
 news_2.1.rst
 changelog_2.1.rst
 news_2.0.rst

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -345,7 +345,7 @@
         # "image_dark": "logo-dark.svg",
     },
     "analytics": {
-        "google_analytics_id": "UA-73068199-4",
+        "google_analytics_id": "353930198",
     },
     # "show_nav_level": 1,
     # "navigation_depth": 4,

diff --git a/docs/source/first_steps/index.rst b/docs/source/first_steps/index.rst
@@ -34,6 +34,11 @@ There are several broad use cases that you might want to use MFA for.  Take a lo
     #. Use the trained G2P model in :ref:`first_steps_g2p_pretrained` to generate a pronunciation dictionary
     #. Use the generated pronunciation dictionary in :ref:`first_steps_align_train_acoustic_model` to generate aligned TextGrids
 
+#. **Use case 5:** You have a :ref:`speech corpus <corpus_structure>` and the language involved is in the list of :xref:`pretrained_acoustic_models`, but the language does not mark word boundaries in its orthography.
+
+   #. Follow :ref:`first_steps_tokenize` to tokenize the corpus
+   #. Use the tokenized transcripts and follow :ref:`first_steps_align_pretrained`
+
 .. _first_steps_align_pretrained:
 
 Aligning a speech corpus with existing pronunciation dictionary and acoustic model
@@ -90,8 +95,8 @@ Depending on your use case, you might have a list of words to run G2P over, or j
 
 .. code-block::
 
-   mfa g2p english_us_arpa ~/mfa_data/my_corpus ~/mfa_data/new_dictionary.txt  # If using a corpus
-   mfa g2p english_us_arpa ~/mfa_data/my_word_list.txt ~/mfa_data/new_dictionary.txt  # If using a word list
+   mfa g2p ~/mfa_data/my_corpus english_us_arpa ~/mfa_data/new_dictionary.txt  # If using a corpus
+   mfa g2p ~/mfa_data/my_word_list.txt english_us_arpa ~/mfa_data/new_dictionary.txt  # If using a word list
 
 Running one of the above will output a text file pronunciation dictionary in the format that MFA uses (:ref:`dictionary_format`).  I recommend looking over the pronunciations generated and make sure that they look sensible.  For languages where the orthography is not transparent, it may be helpful to include :code:`--num_pronunciations 3` so that more pronunciations are generated than just the most likely one. For more details on running G2P, see :ref:`g2p_dictionary_generating`.
 
@@ -170,18 +175,50 @@ Once the G2P model is trained, you should see the exported archive in the folder
 
    mfa model save g2p ~/mfa_data/my_g2p_model.zip
 
-   mfa g2p my_g2p_model ~/mfa_data/my_new_word_list.txt ~/mfa_data/my_new_dictionary.txt
+   mfa g2p ~/mfa_data/my_new_word_list.txt my_g2p_model ~/mfa_data/my_new_dictionary.txt
 
    # Or
 
-   mfa g2p ~/mfa_data/my_g2p_model.zip ~/mfa_data/my_new_word_list.txt ~/mfa_data/my_new_dictionary.txt
+   mfa g2p ~/mfa_data/my_new_word_list.txt ~/mfa_data/my_g2p_model.zip ~/mfa_data/my_new_dictionary.txt
 
 Take a look at :ref:`first_steps_g2p_pretrained` with this new model for a more detailed walk-through of generating a dictionary.
 
 .. note::
 
    Please see :ref:`g2p_model_training_example` for an example using toy data.
 
+.. _first_steps_tokenize:
+
+Tokenize a corpus to add word boundaries
+----------------------------------------
+
+For the purposes of this example, we'll also assume that you have done nothing else with MFA other than follow the :ref:`installation` instructions and you have the :code:`mfa` command working.  Finally, we'll assume that your corpus is in Japanese and is stored in the folder :code:`~/mfa_data/my_corpus`, so when working with your data, this will be the main thing to update.
+
+To tokenize the Japanese text to add spaces, first download the Japanese tokenizer model via:
+
+
+.. code-block::
+
+   mfa model download tokenizer japanese_mfa
+
+Once you have the model downloaded, you can tokenize your corpus via:
+
+.. code-block::
+
+   mfa tokenize ~/mfa_data/my_corpus japanese_mfa ~/mfa_data/tokenized_version
+
+You can check the tokenized text in :code:`~/mfa_data/tokenized_version`, verify that it looks good, and copy the files to replace the untokenized files in :code:`~/mfa_data/my_corpus` for use in alignment.
+
+.. warning::
+
+   MFA's tokenizer models are nowhere near state of the art, and I recommend using other tokenizers as they make sense:
+
+   * Japanese: `nagisa <https://nagisa.readthedocs.io/en/latest/>`_
+   * Chinese: `spacy-pkuseg <https://github.com/explosion/spacy-pkuseg/blob/master/readme/readme_english.md>`_
+   * Thai: `sertiscorp/thai-word-segmentation <https://github.com/sertiscorp/thai-word-segmentation>`_
+
+   The above were used in the initial construction of the training corpora for MFA, though the training segmentations for Japanese have begun to diverge from :code:`nagisa`, as they break up phonological words into morphological parses where for the purposes of acoustic model training and alignment it makes more sense to not split (nagisa: :ipa_inline:`使っ て [ts ɨ k a Q t e]` vs mfa: :ipa_inline:`使って [ts ɨ k a tː e]`). The MFA tokenizer models are provided as an easy start up path as the ones listed above may have extra dependencies and platform restrictions.
+
 .. toctree::
    :maxdepth: 1
    :hidden:

diff --git a/montreal_forced_aligner/alignment/mixins.py b/montreal_forced_aligner/alignment/mixins.py
@@ -551,7 +551,7 @@ def compile_information(self) -> None:
                 average_logdet_frames += data["logdet_frames"]
                 average_logdet_sum += data["logdet"] * data["logdet_frames"]
 
-        if hasattr(self, "db_engine"):
+        if hasattr(self, "session"):
             csv_path = self.working_directory.joinpath("alignment_log_likelihood.csv")
             with mfa_open(csv_path, "w") as f, self.session() as session:
                 writer = csv.writer(f)

diff --git a/montreal_forced_aligner/command_line/train_tokenizer.py b/montreal_forced_aligner/command_line/train_tokenizer.py
@@ -12,7 +12,10 @@
     common_options,
 )
 from montreal_forced_aligner.config import GLOBAL_CONFIG, MFA_PROFILE_VARIABLE
-from montreal_forced_aligner.tokenization.trainer import TokenizerTrainer
+from montreal_forced_aligner.tokenization.trainer import (
+    PhonetisaurusTokenizerTrainer,
+    TokenizerTrainer,
+)
 
 __all__ = ["train_tokenizer_cli"]
 
@@ -48,6 +51,12 @@
     "most of the data and validating on an unseen subset.",
     default=False,
 )
+@click.option(
+    "--phonetisaurus",
+    is_flag=True,
+    help="Flag for using Phonetisaurus-style models.",
+    default=False,
+)
 @common_options
 @click.help_option("-h", "--help")
 @click.pass_context
@@ -63,10 +72,19 @@ def train_tokenizer_cli(context, **kwargs) -> None:
     config_path = kwargs.get("config_path", None)
     corpus_directory = kwargs["corpus_directory"]
     output_model_path = kwargs["output_model_path"]
-    trainer = TokenizerTrainer(
-        corpus_directory=corpus_directory,
-        **TokenizerTrainer.parse_parameters(config_path, context.params, context.args),
-    )
+    phonetisaurus = kwargs["phonetisaurus"]
+    if phonetisaurus:
+        trainer = PhonetisaurusTokenizerTrainer(
+            corpus_directory=corpus_directory,
+            **PhonetisaurusTokenizerTrainer.parse_parameters(
+                config_path, context.params, context.args
+            ),
+        )
+    else:
+        trainer = TokenizerTrainer(
+            corpus_directory=corpus_directory,
+            **TokenizerTrainer.parse_parameters(config_path, context.params, context.args),
+        )
 
     try:
         trainer.setup()

diff --git a/montreal_forced_aligner/corpus/multiprocessing.py b/montreal_forced_aligner/corpus/multiprocessing.py
@@ -494,11 +494,8 @@ def _no_dictionary_sanitize(self, session):
                 text.append(w)
                 if character_text:
                     character_text.append("<space>")
-                if self.bracket_regex.match(w):
-                    character_text.append(self.bracketed_word)
-                else:
-                    for g in w:
-                        character_text.append(g)
+                for g in w:
+                    character_text.append(g)
             text = " ".join(text)
             character_text = " ".join(character_text)
             yield {

diff --git a/montreal_forced_aligner/g2p/mixins.py b/montreal_forced_aligner/g2p/mixins.py
@@ -1,6 +1,6 @@
 """Mixin module for G2P functionality"""
 import typing
-from abc import ABCMeta, abstractmethod
+from abc import ABCMeta
 from pathlib import Path
 from typing import Dict, List
 
@@ -36,7 +36,6 @@ def __init__(
         self.g2p_threshold = g2p_threshold
         self.include_bracketed = include_bracketed
 
-    @abstractmethod
     def generate_pronunciations(self) -> Dict[str, List[str]]:
         """
         Generate pronunciations
@@ -46,13 +45,12 @@ def generate_pronunciations(self) -> Dict[str, List[str]]:
         dict[str, list[str]]
             Mappings of keys to their generated pronunciations
         """
-        ...
+        raise NotImplementedError
 
     @property
-    @abstractmethod
     def words_to_g2p(self) -> List[str]:
         """Words to produce pronunciations"""
-        ...
+        raise NotImplementedError
 
 
 class G2PTopLevelMixin(MfaWorker, DictionaryMixin, G2PMixin):