Download the materials here!
Thanks to the creators of the Northwestern ALLSSTAR Corpus and Mozilla Common Voice Corpus for the publicly accessible data! Many of the recordings used here come from these two corpora.
The Montreal Forced Aligner is a forced alignment system with acoustic models built using the Kaldi ASR toolkit. A major highlight of this system is the availability of pretrained acoustic models and grapheme-to-phoneme models for a wide variety of languages, as well as the ability to train acoustic and grapheme-to-phoneme models to any new dataset you might have. It also uses advanced techniques for training and aligning speech data (Kaldi) with a full suite of training and speaker adaptation algorithms. The basic acoustic model recipe uses the traditional GMM-HMM framework, starting with monophone models, then triphone models (which allow for context sensitivity, read: coarticulation), and some transformations and speaker adaptation along the way. You can check out more regarding the recipe on the MFA read the docs page, and for an overview to a standard training recipe, check out this page in the Kaldi tutorial.
As with any forced alignment system, the Montreal Forced Aligner will time-align a transcript to a corresponding audio file at the phone and word levels provided there exist a set of pretrained acoustic models and a lexicon/dictionary of the words in the transcript with their canonical phonetic pronunciation(s). The phone set used in the dictionary must match the phone set in the acoustic models. The orthography used in the dictionary must also match that in the transcript.
Very generally, the procedure is as follows:
TextGrid
or .lab
/.txt
file)You will also need to identify or create an input folder that contains the wav
files and TextGrids
/transcripts and an output folder for the time-aligned TextGrids
to be created. These cannot be the same folder or you will get an error. It will make your life easier if you keep the output directory clean (e.g., empty) on each run. Unless otherwise specified, MFA will not overwrite files in that directory.
The Montreal Forced Aligner is now incredibly robust to audio files of differing formats, sampling rates and channels. You should not have to do much prep, but note that whatever you feed the system will be converted to be a wav
file with a sampling rate of 16 kHz with a single (mono) channel unless otherwise specified (see Feature Configuration. For the record, I have not yet tried any other file format except for wav
files, so I’m not yet aware of potential issues that might arise there.
The MFA can take as input either a Praat TextGrid
or a .lab
or .txt
file. I have worked most extensively with the TextGrid
input, so I’ll describe those details here. As for .lab
and .txt
input, I believe this method only works when the transcript is pasted in as a single line. In other words, I don’t think it can handle timestamps for utterance start and end times.
wav
file and its corresponding transcript must fully match except the extension (.wav
or .TextGrid
).spkr01_cat1.wav, spkr01_cat1.TextGrid
spkr01_cat2.wav, spkr01_cat2.TextGrid
spkr01_cat3.wav, spkr01_cat3.TextGrid
spkr01_cat4.wav, spkr01_cat4.TextGrid
spkr02_cat1.wav, spkr02_cat1.TextGrid
spkr02_cat2.wav, spkr02_cat2.TextGrid
spkr02_cat3.wav, spkr02_cat3.TextGrid
spkr02_cat4.wav, spkr02_cat4.TextGrid
This requirement of the initial speaker prefix is at least true when training new acoustic models, but it might also be required if you want the speaker adaptation to work during alignment. Not sure, but I do know that if you are just doing alignment with e.g., pretrained acoustic models, then you mostly just need to pay attention to Point 1.
NB: I will use the terms lexicon and dictionary interchangeably. The pronunciation lexicon must be a two column text file with a list of words on the lefthand side and the phonetic pronunciation(s) on the righthand side. Each word should be separated from its phonetic pronunciation by a tab, and each phone in the phonetic pronunciation should be separated by a space. Many-to-many mappings between words and pronunciations are permitted. In fact, you can even add pronunciation probabilities to the lexicon, but I have not yet tried this!
One important point: the phone set in your lexicon must match that used in the acoustic models and the orthography must match that in the transcripts.
There are a few options for obtaining a pronunciation lexicon, outlined below. More details about these options are in the sections to come.
NB: you must add any missing words in your corpus manually, or train a G2P model to handle these cases. As of writing, there are dictionaries for English, French, and German availabe for download from the MFA. You can also copy and paste some additional dictionaries into text files from the MFA website. Another option is to scrape pronunciations from Wiktionary using WikiPron (but the transcription system can have high variance within a language).
You can download and apply a pretrained G2P model from MFA or use another resource, like Epitran or XPF, to generate these for you. These systems automatically converts the words in your corpus to the most likely phonetic pronunciation according to what it’s learned. Remember that whichever phone set you use here must be the same phone set used in your acoustic model.
MFA does not yet support using pretrained G2P models on Windows.
MFA does not yet support training G2P models on Windows.
Pretrained acoustic models for several languages can be downloaded directly using the command line interface.
You can also train an acoustic model yourself directly on the data you’re working on. I can’t say for sure how much alignment accuracy varies based on your dataset size, but I do know it’s typically not good at all with a very small sample size (e.g., ~10 minutes of speech). I don’t know what the minimum is though until it’s not bad. As always, look at your data and check the output just to be safe. Worse comes to worse, you manually clean it up. At least the boundaries are mostly already in place.
In the next part of the tutorial, we will work through several case studies to explore some functionality of the MFA, incuding some of the code presented above. I’ll also try to highlight some tips and tricks along the way, but note that this version of the MFA is very robust to potential issues and makes good use of warnings and error messages!
Did it work? A new release was just pushed last night. Apparently Kaldi had a substantial change in the past few days: it fully broke my aligner, but you know, you can’t fully understand something unless you break it at some point or another. Many many thanks to Michael McAuliffe for being so responsive and helpful!
The installation page is here.
MFA is now released through conda which is a package management system. You connect to some server and with a conda install or create command it sends down sets of packages needed to run the program properly. After you install Miniconda3, the installation command should be:
conda create -n aligner -c conda-forge montreal-forced-aligner
This creates a little packaged environment for you to interact with in your computer. It set aside from other things in your computer so even if you have your own version of e.g., Kaldi or numpy on your computer, whichever version is used in the aligner is kept in this environment.
Something that might be useful: the -n
flag refers to the name of the environment. If you’re unsure whether you want to overwrite an existing (working) version of the aligner with a new update, you can change the -n
flag to something else just to check out if it works first. In the next activation step, you would then just activate that second environment. All environments are at the same level of hierarchy, so it shouldn’t matter which environment you’re working in when you create a new one (they don’t nest).
Each time you want to use the aligner, make sure that the aligner is activated. You will need to re-activate the aligner every time you open a new shell/command line window. At least on a Mac, you should be able to see which environment you have open in the parentheses before each shell prompt. For example, mine looks like:
(base) Eleanors-iPro:MFA eleanorchodroff$ conda activate aligner
(aligner) Eleanors-iPro:MFA eleanorchodroff$
And after the dollar sign, there’s a blinking cursor. Note that I just happen to be in the ~/Documents/MFA folder
. You can also run the aligner without a problem from anywhere else on the computer, as long as you’re in the aligner environment on the command line:
(aligner) Eleanors-iPro:Desktop eleanorchodroff$
mfa version
Does yours say 2.0.0 (or some variation on 2.0.0 like 2.0.0b3)? You can find the installation instructions here. If you never had to activate the conda environment, then follow the first set of instructions for “All platforms”. If you have an older version that did require conda activation but it is not 2.0.0+, then you should follow the second set of instructions for “Upgrading from non-conda version” (where non-conda here, I believe, refers to the Conda Forge distribution and not just the conda environment).
Run the following code to see almost the full range of functionality within the MFA suite. -h
is a flag to the command mfa
that stands for “help”. Important tangent: commands can take arguments and flags. Arguments are required by the command; flags are optional.
Each listed line is a separate command that you can use. We’ll be covering align, model (model inspect, model download, model save), train, and validate.
mfa -h
If you ever need a reminder about how a specific command works, you can add -h
after it. This will give you a detailed overview of the required arguments, their order, and any optional flags you can specify. For example:
mfa align -h
If you just want a basic overview of the arguments, then you can just type the command with no arguments following it:
mfa align
Some overarching instructions, just for today, in the hopes of minimizing path errors:
Documents/MFA
folderFortunately the folders Desktop
and Documents
have the same name across Mac and Windows systems. We may have to update these instructions, or you may have to substitute Desktop with some other folder if you have trouble writing new files there.
This example assumes you are using a pre-existing large lexicon and pretrained acoustic models. With reference to our ingredients:
wav
files: prepped (same name as TextGrids
)TextGrids
: prepped (same name as wavs
)Any pretrained models you download will be sent to the ~/Documents/MFA/pretrained_models/ folder. Download pretrained acoustic models, dictionaries, or g2p models using the mfa model download
option
mfa model download acoustic english
mfa model download dictionary english
We can even inspect these models to figure out things like what is the assumed phone set using the mfa model inspect
function.
Following mfa model {download|inspect}
, you always specify either acoustic
, dictionary
, or g2p
(for a G2P model: see below), followed by the name of the model.
mfa model inspect acoustic english
mfa model inspect dictionary english
For this tutorial, let’s keep it simple and put these on the Desktop
Place TextGrids
and wav
files in the input folder
Run the Montreal Forced Aligner with the align command, but make sure to update the arguments!
conda activate aligner
mfa align corpus_directory dictionary acoustic_model output_directory
If you forget what the arguments are, just type the name of the command:
mfa align
The align
command takes 4 arguments:
1. where are the wav files and TextGrids? (path to input folder)
2. where is the dictionary? (path necessary)
3. where are the acoustic models? (path and .zip are not necessary)
4. where should the output go? (path to output folder)
Explicit example (make sure to remove backslashes):
mfa align --clean ~/Desktop/input ~/Documents/MFA/pretrained_models/dictionary/english.dict english ~/Desktop/output
For non-Mac users, note that the tilde, ~
, refers to the root directory. In my case, it’s shorthand for /Users/eleanorchodroff/
. On a Windows computer, you might need to specify the full path starting from e.g., C:\\
.
In the above example, I wrote out full paths for every argument, just to be safe. That is, I started from the root of the computer, and worked my way down to the exact file location. The only exception to this is the “path” to the acoustic model. For whatever reason, the acoustic models must always be in the pretrained_models/acoustic directory, and even though you will physically see them with a .zip
extension, you simply refer to them by their shorthand name in the call to the aligner.
I also used an optional “flag”: –clean. This does not need to be present, but I do prefer to use this. If I run the aligner multiple times on an input folder with the same name, it won’t overwrite the model/data, unless I clean out the old model first.
Note that there are several other optional flags you can specify. Use the mfa align -h
command to see more.
Note that one of the harder parts to using the forced alignment is prepping the TextGrids.
The most straightforward (but potentially riskiest) implementation of the aligner with TextGrid
input is to paste the transcript into a TextGrid
with a single interval tier. If you would like the aligner to do some speaker/channel adaptation for potentially improved performance, the name of the TextGrid
tier should reflect the speaker/channel ID. If you don’t care about that, you can name the tier whatever you want: it still works fine. You can also just paste the transcript into a .txt
or .lab
file and rename it so it matches the wav
file; however, the TextGrid
format has more overall flexibility, so we’re going to stick to that format.
If you have utterance-level timestamps, you can also add in intervals for an alignment to minimize the possibility of “derailment”. By “derail”, I mean that the aligner gets thrown off early on in the wav
file and never gets back on track. The result can be a pretty terrible “alignment”. By delimiting the temporal span of an utterance, the aligner has a chance to reset at the next utterance, even if the preceding utterance was completely misaligned. Side note: misalignments are more likely to occur if there’s additional noise in the wav
file (e.g., coughing, background noise) or if the speech and transcript don’t match at either the word or phone level (e.g., pronunciation of a word does not match the dictionary/lexicon entry).
If the transcript has start and end times for each utterance (3-column text file with start time, end time, text): create_textgrid_mfa_timestamps.praat
If the transcript has start and end times for each utterance (2-column text file with start time, text): create_textgrid_mfa_timestamps2.praat
Case study: Nonsense words with “illegal” consonant clusters in English. Did the participant produce a schwa or not?
These are all nonsense words, so let’s just create a new dictionary that will be compatible with our English acoustic models: those models use ARPABET with stress marked. If you need a reminder, you can run the mfa model inspect acoustic english
command, or just go to the pretrained_models/dictionary
directory and look inside english.dict
.
I’ve created a lexicon with two entries for each word: one with the intact consonant cluster and one with schwa epenthesis. Note that without any further specification, the model will assume equal probability over all entries for a given word. We’ll let the model decide, based on the acoustics and the assumption of each phone >= 30ms plus some others, whether or not the participant produced a schwa.
mfa align --clean ~/Desktop/input/ ~/Desktop/MFATutorial2021/ex4_english_modify1/CC_list.txt english ~/Desktop/output/
We’ve got real words of English, but did the speaker produce ng
or n
? This is a toy example with the words “walking” and “going”. Let’s modify english.dict to allow for both possibilities of “walking”/“going” and “walkin’”/“goin’” without making any commitment to what the speaker actually said. We’ll let the model decide based on the acoustics. (Note that without any further specification, the model will assume equal probability over all entries for a given word.)
mfa validate --ignore_acoustics ~/Desktop/input ~/Documents/MFA/pretrained_models/dictionary/english.dict
mfa align --clean ~/Desktop/input ~/Desktop/nonwords.txt english ~/Desktop/output
See also Yuan & Liberman (2008), Yuan & Liberman (2011), Bailey (2016) for similar approaches.
Spanish! We’ve got some data from the ALLSSTAR corpus; can we get a reasonable alignment?
wav
files: prepped (same name as TextGrids
)wavs
)Some glaring issues with the lexicon already: does anyone spot the discrepancy?
Now for an intentional mistake:
mfa model download acoustic spanish
mfa align ~/Desktop/input ~/Desktop/MFATutorial2021/Spanish/talnupf_spanish.txt spanish ~/Desktop/output
We can’t use the pretrained Spanish acoustic model because it assumes a different phone set from the one we have in our pronunciation dictionary! Let’s train our own acoustic model:
mfa train --clean ~/Desktop/input ~/Desktop/MFATutorial2021/Spanish/talnupf_spanish.txt ~/Desktop/output
But look at the output… yikes… this is about 10 minutes of speech. In my brief experience, things start to look a little more reasonable/stable with maybe ~45 minutes, but this is a guess and fully open to investigation.
We’ll be aligning Guarani with an acoustic model trained on about 50 minutes of speech from 32 speakers. The data subset comes from the Mozilla Common Voice Corpus. Training was done beforehand because it still took around 1 hour. After training, I could save the output of the most “trained up” model, sat2_ali (but check out the structure of your ~/Documents/MFA/
More generally, you should probably just poke around the generated documents from the training procedure in ~/Documents/MFA/
For the record, you can save a trained-up acoustic model using:
mfa model save acoustic ~/Documents/MFA/prep_validated_guarani/sat2_ali/acoustic_model.zip --name guarani_cv
Where do we start?
Place the acoustic model (`guarani_cv.zip) in the pretrained_models/acoustic directory
Note the path to the pronunciation dictionary guarani_lexicon.txt
Create the input/output folders and populate the input folder with the wav
files and TextGrids
and make sure the output folder is clean.
Run the aligner:
mfa align --clean ~/Desktop/input ~/Desktop/MFATutorial2021/ex6_guarani_guarani_lexicon.txt guarani_cv ~/Desktop/output
pwd
(mac/linux) or CD
(windows, with no argument) to see the path of your current working directory. pwd
= “print working directory”cd ..
(mac/linux) or CD ..
(windows) to move up a folder in the treels
(mac/linux) or dir
(windows) to list all files in the directory. You can also use ls *.TextGrid
to see just the TextGridsctrl-c
to cancel any command in the shell and stop all processing. That is, hold down the control
key and press the c
key.tab
key to autocomplete a filename or directory name in the shellLet’s say you’re working with some Romanian data and you technically have all the ingredients you need to train an acoustic model and align the files. That is, for every word in your corpus, you have a phonetic pronunciation, and you can use that small-ish Romanian pronunciation dictionary to then train new acoustic models and align the wav files.
Now, you’ve acquired some new data and don’t have the pronunciations for those words. Using your small-ish Romanian pronunciation dictionary, you can train a new G2P model so you can generate reasonable pronunciations for new Romanian words!
To do this, you use the mfa train_g2p
function:
mfa train_g2p --clean ~/Desktop/romanian/romanian_lexicon.txt romanian_cv ~/Desktop/romanian/romanian_g2p
You’ll then need to follow that up by generating the pronunciations for your new Romanian words with:
mfa g2p --clean ~/Documents/MFA/pretrained_models/g2p/romanian_cv.zip ~/Desktop/romanian/TextGrids_with_new_words/ ~/Desktop/romanian/new_romanian_lexicon.txt
You can list multiple speakers in the same TextGrid
on separate tiers, and the alignment will be done separately for each, but still be returned to you in a single TextGrid.
If you have a stereo recording, channel 1 is assumed to be the first listed speaker in the TextGrid; channel 2 is the second listed speaker in the TextGrid. I am not positive, but I would assume that if you have only one tier, then the MFA will only use channel 1 and not the convolution of the two channels. Worth checking, but you might need to do that separately.
If you list multiple pronunciation variants for a word in the dictionary, the model typically assumes that each variant is equally probable. If you have 2 variants listed, then it assumes a probability of 0.5 over each. If you have 4 variants, then a probability of 0.25, etc. It then determines which variant is more likely given the observed acoustic signal.
You can, however, have the model report the observed model in the dataset and also use that prior probability for potentially improved recognition.
You can use the mfa train_dictionary
command to determine the observed rate of each pronunciation of variant. You can then use that updated pronunciation lexicon (note that it has an extra column for the probability) for potentially improved alignment and training. Be careful with this if your research question is actually about determining the rate of pronunciation variation: you might actually want to approach the dataset in a literally unbiased way.
mfa train_dictionary --clean ~/Desktop/romanian/prep_validated/ ~/Desktop/romanian/romanian_lexicon.txt romanian_cv ~/Desktop/romanian/dictionary/
Huge thanks to the MFA team for creating such an amazing aligner, and a special thanks to Michael McAuliffe for so much:
I also want to thank Emily Ahn (University of Washington) for helping me work out the functionality of MFA 2.0. Finally, I’d like to thank the audiences at Northwestern University (2018), Newcastle University (2019), University of York (2020, 2021), and UT Austin (2021) for working with me on previous “drafts” of this tutorial for MFA 1.0 and 1.1.