Download the materials here!

Thanks to the creators of the Northwestern ALLSSTAR Corpus and Mozilla Common Voice Corpus for the publicly accessible data! Many of the recordings used here come from these two corpora.

Overview

The Montreal Forced Aligner is a forced alignment system with acoustic models built using the Kaldi ASR toolkit. A major highlight of this system is the availability of pretrained acoustic models and grapheme-to-phoneme models for a wide variety of languages, as well as the ability to train acoustic and grapheme-to-phoneme models to any new dataset you might have. It also uses advanced techniques for training and aligning speech data (Kaldi) with a full suite of training and speaker adaptation algorithms. The basic acoustic model recipe uses the traditional GMM-HMM framework, starting with monophone models, then triphone models (which allow for context sensitivity, read: coarticulation), and some transformations and speaker adaptation along the way. You can check out more regarding the recipe on the MFA read the docs page, and for an overview to a standard training recipe, check out this page in the Kaldi tutorial.

As with any forced alignment system, the Montreal Forced Aligner will time-align a transcript to a corresponding audio file at the phone and word levels provided there exist a set of pretrained acoustic models and a lexicon/dictionary of the words in the transcript with their canonical phonetic pronunciation(s). The phone set used in the dictionary must match the phone set in the acoustic models. The orthography used in the dictionary must also match that in the transcript.

Very generally, the procedure is as follows:

You will also need to identify or create an input folder that contains the wav files and TextGrids/transcripts and an output folder for the time-aligned TextGrids to be created. These cannot be the same folder or you will get an error. It will make your life easier if you keep the output directory clean (e.g., empty) on each run. Unless otherwise specified, MFA will not overwrite files in that directory.

Audio files

The Montreal Forced Aligner is now incredibly robust to audio files of differing formats, sampling rates and channels. You should not have to do much prep, but note that whatever you feed the system will be converted to be a wav file with a sampling rate of 16 kHz with a single (mono) channel unless otherwise specified (see Feature Configuration. For the record, I have not yet tried any other file format except for wav files, so I’m not yet aware of potential issues that might arise there.

Transcripts

The MFA can take as input either a Praat TextGrid or a .lab or .txt file. I have worked most extensively with the TextGrid input, so I’ll describe those details here. As for .lab and .txt input, I believe this method only works when the transcript is pasted in as a single line. In other words, I don’t think it can handle timestamps for utterance start and end times.

Filenames

  • The filename of the wav file and its corresponding transcript must fully match except the extension (.wav or .TextGrid).
  • If you are doing speaker/channel adaptation, it helps to have the speaker ID as the prefix to the filename and the utterance ID as the suffix. For example:
spkr01_cat1.wav, spkr01_cat1.TextGrid  
spkr01_cat2.wav, spkr01_cat2.TextGrid  
spkr01_cat3.wav, spkr01_cat3.TextGrid  
spkr01_cat4.wav, spkr01_cat4.TextGrid  
spkr02_cat1.wav, spkr02_cat1.TextGrid  
spkr02_cat2.wav, spkr02_cat2.TextGrid  
spkr02_cat3.wav, spkr02_cat3.TextGrid  
spkr02_cat4.wav, spkr02_cat4.TextGrid  

This requirement of the initial speaker prefix is at least true when training new acoustic models, but it might also be required if you want the speaker adaptation to work during alignment. Not sure, but I do know that if you are just doing alignment with e.g., pretrained acoustic models, then you mostly just need to pay attention to Point 1.

Pronunciation dictionary

NB: I will use the terms lexicon and dictionary interchangeably. The pronunciation lexicon must be a two column text file with a list of words on the lefthand side and the phonetic pronunciation(s) on the righthand side. Each word should be separated from its phonetic pronunciation by a tab, and each phone in the phonetic pronunciation should be separated by a space. Many-to-many mappings between words and pronunciations are permitted. In fact, you can even add pronunciation probabilities to the lexicon, but I have not yet tried this!

One important point: the phone set in your lexicon must match that used in the acoustic models and the orthography must match that in the transcripts.

There are a few options for obtaining a pronunciation lexicon, outlined below. More details about these options are in the sections to come.

  • Download a large-scale preexisting pronunciation dictionary

NB: you must add any missing words in your corpus manually, or train a G2P model to handle these cases. As of writing, there are dictionaries for English, French, and German availabe for download from the MFA. You can also copy and paste some additional dictionaries into text files from the MFA website. Another option is to scrape pronunciations from Wiktionary using WikiPron (but the transcription system can have high variance within a language).

  • Generate a lexicon using a pretrained grapheme-to-phoneme (G2P) model

You can download and apply a pretrained G2P model from MFA or use another resource, like Epitran or XPF, to generate these for you. These systems automatically converts the words in your corpus to the most likely phonetic pronunciation according to what it’s learned. Remember that whichever phone set you use here must be the same phone set used in your acoustic model.

MFA does not yet support using pretrained G2P models on Windows.

  • Train a G2P model using a pre-existing lexicon to generalize to unseen orthographic forms

MFA does not yet support training G2P models on Windows.

  • Create the pronunciation lexicon by hand using the same phone set as the acoustic models

Acoustic models

Pretrained acoustic models for several languages can be downloaded directly using the command line interface.

You can also train an acoustic model yourself directly on the data you’re working on. I can’t say for sure how much alignment accuracy varies based on your dataset size, but I do know it’s typically not good at all with a very small sample size (e.g., ~10 minutes of speech). I don’t know what the minimum is though until it’s not bad. As always, look at your data and check the output just to be safe. Worse comes to worse, you manually clean it up. At least the boundaries are mostly already in place.

Practical

In the next part of the tutorial, we will work through several case studies to explore some functionality of the MFA, incuding some of the code presented above. I’ll also try to highlight some tips and tricks along the way, but note that this version of the MFA is very robust to potential issues and makes good use of warnings and error messages!

Installation

Did it work? A new release was just pushed last night. Apparently Kaldi had a substantial change in the past few days: it fully broke my aligner, but you know, you can’t fully understand something unless you break it at some point or another. Many many thanks to Michael McAuliffe for being so responsive and helpful!

The installation page is here.

MFA is now released through conda which is a package management system. You connect to some server and with a conda install or create command it sends down sets of packages needed to run the program properly. After you install Miniconda3, the installation command should be:

conda create -n aligner -c conda-forge montreal-forced-aligner

This creates a little packaged environment for you to interact with in your computer. It set aside from other things in your computer so even if you have your own version of e.g., Kaldi or numpy on your computer, whichever version is used in the aligner is kept in this environment.

Something that might be useful: the -n flag refers to the name of the environment. If you’re unsure whether you want to overwrite an existing (working) version of the aligner with a new update, you can change the -n flag to something else just to check out if it works first. In the next activation step, you would then just activate that second environment. All environments are at the same level of hierarchy, so it shouldn’t matter which environment you’re working in when you create a new one (they don’t nest).

Activating the aligner

Each time you want to use the aligner, make sure that the aligner is activated. You will need to re-activate the aligner every time you open a new shell/command line window. At least on a Mac, you should be able to see which environment you have open in the parentheses before each shell prompt. For example, mine looks like:

(base) Eleanors-iPro:MFA eleanorchodroff$ conda activate aligner
(aligner) Eleanors-iPro:MFA eleanorchodroff$ 

And after the dollar sign, there’s a blinking cursor. Note that I just happen to be in the ~/Documents/MFA folder. You can also run the aligner without a problem from anywhere else on the computer, as long as you’re in the aligner environment on the command line:

(aligner) Eleanors-iPro:Desktop eleanorchodroff$ 

Checking the version

mfa version

Does yours say 2.0.0 (or some variation on 2.0.0 like 2.0.0b3)? You can find the installation instructions here. If you never had to activate the conda environment, then follow the first set of instructions for “All platforms”. If you have an older version that did require conda activation but it is not 2.0.0+, then you should follow the second set of instructions for “Upgrading from non-conda version” (where non-conda here, I believe, refers to the Conda Forge distribution and not just the conda environment).

A sneak peek

Run the following code to see almost the full range of functionality within the MFA suite. -h is a flag to the command mfa that stands for “help”. Important tangent: commands can take arguments and flags. Arguments are required by the command; flags are optional.

Each listed line is a separate command that you can use. We’ll be covering align, model (model inspect, model download, model save), train, and validate.

mfa -h

If you ever need a reminder about how a specific command works, you can add -h after it. This will give you a detailed overview of the required arguments, their order, and any optional flags you can specify. For example:

mfa align -h

If you just want a basic overview of the arguments, then you can just type the command with no arguments following it:

mfa align

Running the aligner

Some overarching instructions, just for today, in the hopes of minimizing path errors:

Fortunately the folders Desktop and Documents have the same name across Mac and Windows systems. We may have to update these instructions, or you may have to substitute Desktop with some other folder if you have trouble writing new files there.

Example 1: Basic

This example assumes you are using a pre-existing large lexicon and pretrained acoustic models. With reference to our ingredients:

  • wav files: prepped (same name as TextGrids)
  • TextGrids: prepped (same name as wavs)
  • lexicon: large pre-existing lexicon
  • acoustic models: pretrained model
  1. Download acoustic models and dictionary for American English

Any pretrained models you download will be sent to the ~/Documents/MFA/pretrained_models/ folder. Download pretrained acoustic models, dictionaries, or g2p models using the mfa model download option

mfa model download acoustic english

mfa model download dictionary english

We can even inspect these models to figure out things like what is the assumed phone set using the mfa model inspect function.

Following mfa model {download|inspect}, you always specify either acoustic, dictionary, or g2p (for a G2P model: see below), followed by the name of the model.

mfa model inspect acoustic english

mfa model inspect dictionary english
  1. Create input and output folders

For this tutorial, let’s keep it simple and put these on the Desktop

  1. Place TextGrids and wav files in the input folder

  2. Run the Montreal Forced Aligner with the align command, but make sure to update the arguments!

conda activate aligner

mfa align corpus_directory dictionary acoustic_model output_directory

If you forget what the arguments are, just type the name of the command:

mfa align

The align command takes 4 arguments:

1. where are the wav files and TextGrids? (path to input folder)  
2. where is the dictionary? (path necessary)  
3. where are the acoustic models? (path and .zip are not necessary)
4. where should the output go? (path to output folder)  

Explicit example (make sure to remove backslashes):

mfa align --clean ~/Desktop/input ~/Documents/MFA/pretrained_models/dictionary/english.dict  english ~/Desktop/output

For non-Mac users, note that the tilde, ~, refers to the root directory. In my case, it’s shorthand for /Users/eleanorchodroff/. On a Windows computer, you might need to specify the full path starting from e.g., C:\\.

In the above example, I wrote out full paths for every argument, just to be safe. That is, I started from the root of the computer, and worked my way down to the exact file location. The only exception to this is the “path” to the acoustic model. For whatever reason, the acoustic models must always be in the pretrained_models/acoustic directory, and even though you will physically see them with a .zip extension, you simply refer to them by their shorthand name in the call to the aligner.

I also used an optional “flag”: –clean. This does not need to be present, but I do prefer to use this. If I run the aligner multiple times on an input folder with the same name, it won’t overwrite the model/data, unless I clean out the old model first.

Note that there are several other optional flags you can specify. Use the mfa align -h command to see more.

Example 2: Prep the TextGrids – all-in-one

Note that one of the harder parts to using the forced alignment is prepping the TextGrids.

The most straightforward (but potentially riskiest) implementation of the aligner with TextGrid input is to paste the transcript into a TextGrid with a single interval tier. If you would like the aligner to do some speaker/channel adaptation for potentially improved performance, the name of the TextGrid tier should reflect the speaker/channel ID. If you don’t care about that, you can name the tier whatever you want: it still works fine. You can also just paste the transcript into a .txt or .lab file and rename it so it matches the wav file; however, the TextGrid format has more overall flexibility, so we’re going to stick to that format.

create_textgrid_mfa_simple.praat

Example 3: Prep the TextGrids – utterance-specific intervals

If you have utterance-level timestamps, you can also add in intervals for an alignment to minimize the possibility of “derailment”. By “derail”, I mean that the aligner gets thrown off early on in the wav file and never gets back on track. The result can be a pretty terrible “alignment”. By delimiting the temporal span of an utterance, the aligner has a chance to reset at the next utterance, even if the preceding utterance was completely misaligned. Side note: misalignments are more likely to occur if there’s additional noise in the wav file (e.g., coughing, background noise) or if the speech and transcript don’t match at either the word or phone level (e.g., pronunciation of a word does not match the dictionary/lexicon entry).

If the transcript has start and end times for each utterance (3-column text file with start time, end time, text): create_textgrid_mfa_timestamps.praat

If the transcript has start and end times for each utterance (2-column text file with start time, text): create_textgrid_mfa_timestamps2.praat

Example 4: Modifying the lexicon 1 – nonwords

Case study: Nonsense words with “illegal” consonant clusters in English. Did the participant produce a schwa or not?

These are all nonsense words, so let’s just create a new dictionary that will be compatible with our English acoustic models: those models use ARPABET with stress marked. If you need a reminder, you can run the mfa model inspect acoustic english command, or just go to the pretrained_models/dictionary directory and look inside english.dict.

I’ve created a lexicon with two entries for each word: one with the intact consonant cluster and one with schwa epenthesis. Note that without any further specification, the model will assume equal probability over all entries for a given word. We’ll let the model decide, based on the acoustics and the assumption of each phone >= 30ms plus some others, whether or not the participant produced a schwa.

mfa align --clean ~/Desktop/input/ ~/Desktop/MFATutorial2021/ex4_english_modify1/CC_list.txt english ~/Desktop/output/

Example 5: Modifying the lexicon 2 – “g-dropping”

We’ve got real words of English, but did the speaker produce ng or n? This is a toy example with the words “walking” and “going”. Let’s modify english.dict to allow for both possibilities of “walking”/“going” and “walkin’”/“goin’” without making any commitment to what the speaker actually said. We’ll let the model decide based on the acoustics. (Note that without any further specification, the model will assume equal probability over all entries for a given word.)

mfa validate --ignore_acoustics ~/Desktop/input ~/Documents/MFA/pretrained_models/dictionary/english.dict

mfa align --clean ~/Desktop/input ~/Desktop/nonwords.txt english ~/Desktop/output

See also Yuan & Liberman (2008), Yuan & Liberman (2011), Bailey (2016) for similar approaches.

Example 6: Train and align 1 – Spanish

Spanish! We’ve got some data from the ALLSSTAR corpus; can we get a reasonable alignment?

  • wav files: prepped (same name as TextGrids)
  • TextGrids: prepped (same name as wavs)
  • lexicon: large pre-existing lexicon
  • acoustic models: ultimately train

Some glaring issues with the lexicon already: does anyone spot the discrepancy?

Now for an intentional mistake:

mfa model download acoustic spanish

mfa align ~/Desktop/input ~/Desktop/MFATutorial2021/Spanish/talnupf_spanish.txt spanish ~/Desktop/output

We can’t use the pretrained Spanish acoustic model because it assumes a different phone set from the one we have in our pronunciation dictionary! Let’s train our own acoustic model:

mfa train --clean ~/Desktop/input ~/Desktop/MFATutorial2021/Spanish/talnupf_spanish.txt ~/Desktop/output

But look at the output… yikes… this is about 10 minutes of speech. In my brief experience, things start to look a little more reasonable/stable with maybe ~45 minutes, but this is a guess and fully open to investigation.

Example 7: Train and align 2 – Guarani

We’ll be aligning Guarani with an acoustic model trained on about 50 minutes of speech from 32 speakers. The data subset comes from the Mozilla Common Voice Corpus. Training was done beforehand because it still took around 1 hour. After training, I could save the output of the most “trained up” model, sat2_ali (but check out the structure of your ~/Documents/MFA/ folder because the training recipe might vary slightly between versions).

More generally, you should probably just poke around the generated documents from the training procedure in ~/Documents/MFA//. This is the folder we clean out with the –clean flag, and it’s where all the docs for the training/alignment go.

For the record, you can save a trained-up acoustic model using:

mfa model save acoustic ~/Documents/MFA/prep_validated_guarani/sat2_ali/acoustic_model.zip --name guarani_cv

Where do we start?

Place the acoustic model (`guarani_cv.zip) in the pretrained_models/acoustic directory

Note the path to the pronunciation dictionary guarani_lexicon.txt

Create the input/output folders and populate the input folder with the wav files and TextGrids and make sure the output folder is clean.

Run the aligner:

mfa align --clean ~/Desktop/input ~/Desktop/MFATutorial2021/ex6_guarani_guarani_lexicon.txt guarani_cv ~/Desktop/output

Extra: Under development

Useful terms

  • path: an address to a folder on your computer
  • type pwd (mac/linux) or CD (windows, with no argument) to see the path of your current working directory. pwd = “print working directory”
  • use cd .. (mac/linux) or CD .. (windows) to move up a folder in the tree
  • use ls (mac/linux) or dir (windows) to list all files in the directory. You can also use ls *.TextGrid to see just the TextGrids
  • very useful: use ctrl-c to cancel any command in the shell and stop all processing. That is, hold down the control key and press the c key.
  • hit the tab key to autocomplete a filename or directory name in the shell
  • at least for Mac terminal, you can use command-mouse_click to move the cursor to a specific point in the line of the terminal (otherwise, you’re stuck using the arrow keys)

Training and using a G2P model (Mac/Linux only)

Let’s say you’re working with some Romanian data and you technically have all the ingredients you need to train an acoustic model and align the files. That is, for every word in your corpus, you have a phonetic pronunciation, and you can use that small-ish Romanian pronunciation dictionary to then train new acoustic models and align the wav files.

Now, you’ve acquired some new data and don’t have the pronunciations for those words. Using your small-ish Romanian pronunciation dictionary, you can train a new G2P model so you can generate reasonable pronunciations for new Romanian words!

To do this, you use the mfa train_g2p function:

mfa train_g2p --clean ~/Desktop/romanian/romanian_lexicon.txt romanian_cv ~/Desktop/romanian/romanian_g2p

You’ll then need to follow that up by generating the pronunciations for your new Romanian words with:

mfa g2p --clean ~/Documents/MFA/pretrained_models/g2p/romanian_cv.zip ~/Desktop/romanian/TextGrids_with_new_words/ ~/Desktop/romanian/new_romanian_lexicon.txt

Multiple speakers in the same file and stereo recordings

You can list multiple speakers in the same TextGrid on separate tiers, and the alignment will be done separately for each, but still be returned to you in a single TextGrid.

If you have a stereo recording, channel 1 is assumed to be the first listed speaker in the TextGrid; channel 2 is the second listed speaker in the TextGrid. I am not positive, but I would assume that if you have only one tier, then the MFA will only use channel 1 and not the convolution of the two channels. Worth checking, but you might need to do that separately.

Generating pronunciation probabilities in a speech corpus

If you list multiple pronunciation variants for a word in the dictionary, the model typically assumes that each variant is equally probable. If you have 2 variants listed, then it assumes a probability of 0.5 over each. If you have 4 variants, then a probability of 0.25, etc. It then determines which variant is more likely given the observed acoustic signal.

You can, however, have the model report the observed model in the dataset and also use that prior probability for potentially improved recognition.

You can use the mfa train_dictionary command to determine the observed rate of each pronunciation of variant. You can then use that updated pronunciation lexicon (note that it has an extra column for the probability) for potentially improved alignment and training. Be careful with this if your research question is actually about determining the rate of pronunciation variation: you might actually want to approach the dataset in a literally unbiased way.

mfa train_dictionary --clean ~/Desktop/romanian/prep_validated/ ~/Desktop/romanian/romanian_lexicon.txt romanian_cv ~/Desktop/romanian/dictionary/

Acknowledgments

Huge thanks to the MFA team for creating such an amazing aligner, and a special thanks to Michael McAuliffe for so much:

I also want to thank Emily Ahn (University of Washington) for helping me work out the functionality of MFA 2.0. Finally, I’d like to thank the audiences at Northwestern University (2018), Newcastle University (2019), University of York (2020, 2021), and UT Austin (2021) for working with me on previous “drafts” of this tutorial for MFA 1.0 and 1.1.