3 Montreal Forced Aligner

3.1 Overview

The Montreal Forced Aligner is a forced alignment system with acoustic models built using the Kaldi ASR toolkit. A major highlight of this system is the availability of pretrained acoustic models and grapheme-to-phoneme models for a wide variety of languages, as well as the ability to train acoustic and grapheme-to-phoneme models to any new dataset you might have. It also uses advanced techniques for training and aligning speech data (Kaldi) with a full suite of training and speaker adaptation algorithms. The basic acoustic model recipe uses the traditional GMM-HMM framework, starting with monophone models, then triphone models (which allow for context sensitivity, read: coarticulation), and some transformations and speaker adaptation along the way. You can check out more regarding the recipes in the MFA user guide and reference guide; for an overview to a standard training recipe, check out this Kaldi tutorial.

As a forced alignment system, the Montreal Forced Aligner will time-align a transcript to a corresponding audio file at the phone and word levels provided there exist a set of pretrained acoustic models and a pronunciation dictionary (a.k.a. lexicon) of the words in the transcript with their canonical phonetic pronunciation(s).

The current MFA download contains a suite of mfa commands that allow you to do everything from basic forced alignment to grapheme-to-phoneme conversion to automatic speech recognition. In this tutorial, we’ll be focusing mostly on those commands relating to forced alignment and a side of grapheme-to-phoneme conversion.

Very generally, the procedure for forced alignment is as follows:

Prep audio file(s)
Prep transcript(s) (Praat TextGrids or .lab/.txt files)
Obtain a pronunciation dictionary
Obtain an acoustic model
Create an input folder that contains the audio files and transcripts
Create an empty output folder
Run the mfa align command

Useful links:

3.2 Installation

You can find the streamlined installation instructions on the main MFA installation page. The majority of users will be following the “All platforms” instructions.

As listed there, you will need to download Miniconda or Anaconda. If you’re trying to decide between the two, I’d probably recommend Miniconda. Compared to Anaconda, Miniconda is smaller, and has a slightly higher success rate for installation. Conda is a package and environment management system. If you’re familiar with R, the package management system is similar in concept to the R CRAN server. Once installed, you can import sets of packages via the conda commands, similar to how you might run install.packages in R. The environment manager aspect of conda is fairly unique, and essentially creates a small environment on your computer for the collection of imported packages (here the MFA suite) to run. The primary advantage is that you don’t have to worry about conflicting versions of the same code package on your computer; the environment will be a bubble in your computer with the correct package versions for the MFA suite.

Once you’ve installed Miniconda/Anaconda, you’ll run the following line of code in the command line. For Mac users, the command line will be the terminal or a downloaded equivalent. For Windows users, I believe you will run this directly in the Miniconda console.

conda create -n aligner -c conda-forge montreal-forced-aligner

The -n flag here refers to the name of the environment. If you want to test a new version of the aligner, but aren’t ready to overwrite your old working version of the aligner, you can create a new environment using a different name after the -n flag. See 3.7 for more on conda environments.

If it’s worked, you should see something like:

# To activate this environment, use
#
#     $ conda activate aligner
#
# To deactivate an active environment, use
#
#     $ conda deactivate

And with that, the final step is activating the aligner. You will need to re-activate the aligner every time you open a new shell/command line window. You should be able to see which environment you have open in the parentheses before each shell prompt. For example, mine looks like:

(base) Eleanors-iPro:Documents eleanorchodroff$ conda activate aligner
(aligner) Eleanors-iPro:Documents eleanorchodroff$

You can run the aligner from any location on the computer as long as you’re in the aligner environment on the command line.

NB: The Montreal Forced Aligner is now located in the Documents/MFA folder. All acoustic models, dictionaries, temporary alignment or training files, etc. will be stored in this folder. If you are ever confused or curious about what is happening with the aligner, just poke around this folder.

3.3 Running the aligner

To run the aligner, we’ll need to follow the procedure listed above. In this section, I will give a quick example of how this works using a hypothetical example of aligning American English. In the following sections, I’ll go into more detail about each of these steps.

For the first steps of prepping the audio files and prepping the transcripts, I’ll assume we’re working with a wav file and a Praat TextGrid that has a single interval tier with utterances transcribed in English. Each interval in the TextGrid corresponds to an utterance in the wav file. A good rule of thumb is to keep utterances less than 30 seconds; accuracy can improve with smaller windows, but you can still get lucky in getting the job done even with bigger windows. The wav file and TextGrid must have the same name.

After we have the audio files and transcript TextGrids, we’ll place them in an input folder. For the sake of this tutorial, this input folder will be called Documents/input_english/. We will also need to create an empty output folder. I will call this output folder Documents/output_english.

We can then obtain the acoustic model from the internet using the following command. (You can also use mfa models download acoustic english_us_arpa.)

mfa model download acoustic english_us_arpa

And we can obtain the dictionary from the internet using this command. (You can also use mfa models download dictionary english_us_arpa.)

mfa model download dictionary english_us_arpa

The dictionary and acoustic model can now be found in their respective Documents/MFA/pretrained_models/ folders.

Once we have the dictionary, acoustic model, wav files and TextGrids in their input folder, and an empty output folder, then we’re ready to run the aligner!

The alignment command is called align and takes 4 arguments:

path to the input folder
name of the dictionary (in the Documents/MFA/pretrained_models/dictionary/ folder) or path to the acoustic model if it’s elsewhere on the computer
name of the acoustic model (in the Documents/MFA/pretrained_models/acoustic/ folder) or path to the acoustic model if it’s elsewhere on the computer
path to the output folder

I always add the optional --clean flag as this “cleans” out any old temporary files created in a previous run. (If you’re anything like me, you might find yourself re-running the aligner a few times, and with the same input filename. The aligner won’t actually re-run properly unless you clear out the old files.)

Important: You will need to scroll across to see the whole line of code.

mfa align --clean /Users/Eleanor/Documents/input_english/ english_us_arpa english_us_arpa /Users/Eleanor/Documents/output_english/

And if everything worked appropriately, your aligned TextGrids should now be in the Documents/output_english/ folder!

3.4 File preparation

3.4.1 Audio files

The Montreal Forced Aligner is incredibly robust to audio files of differing formats, sampling rates and channels. You should not have to do much prep, but note that whatever you feed the system will be converted to be a wav file with a sampling rate of 16 kHz with a single (mono) channel unless otherwise specified (see Feature Configuration. For the record, I have not yet tried any other file format except for wav files, so I’m not yet aware of potential issues that might arise there.

3.4.2 Transcripts

The MFA can take as input either a Praat TextGrid or a .lab or .txt file. I have worked most extensively with the TextGrid input, so I’ll describe those details here. As for .lab and .txt input, I believe this method only works when the transcript is pasted in as a single line. In other words, I don’t think it can handle time stamps for utterance start and end times.

As described above, the Praat TextGrid should contain a single tier with intervals that are preferably on the shorter side. I’ve noticed that the biggest gains you can make in alignment accuracy tend to come from simply shortening the input window. Within the TextGrid, each interval should correspond to an utterance to an utterance in the wav file, and all the words in the utterance should ideally have an entry in the pronunciation dictionary. You can (but do not have to) specify the speaker ID using the tier name, which then allows the aligner to then perform speaker adaptation in training or alignment. A few more options for indicating speaker/channel IDs are provided below in 3.4.3.

3.4.3 Filenames

The filename of the wav file and its corresponding transcript must be identical except for the extension (.wav or .TextGrid). If you have multiple speakers/recording channels in the alignment or training, you can either specify this ID in the tier name of the TextGrid, or add an optional argument to the mfa align command to use the first n characters of the filename.

If you go forward with this, it helps to have the speaker ID as the prefix to the filenamex. For example:

spkr01_utt1.wav, spkr01_utt1.TextGrid  
spkr01_utt2.wav, spkr01_utt2.TextGrid  
spkr02_utt1.wav, spkr02_utt1.TextGrid  
spkr02_utt2.wav, spkr02_utt2.TextGrid
spkr03_utt1.wav, spkr03_utt1.TextGrid  
spkr03_utt2.wav, spkr03_utt2.TextGrid  
spkr04_utt1.wav, spkr04_utt1.TextGrid  
spkr04_utt2.wav, spkr04_utt2.TextGrid

In this case, we can then tell the aligner to use the first 6 characters as the speaker information. The -s flag stands for speaker characters.

mfa align -s 6 --clean /Users/Eleanor/input_english/ english_us_arpa english_us_arpa /Users/Eleanor/output_english/

Alternatively:

mfa align --speaker_characters 6 --clean /Users/Eleanor/input_english/ english_us_arpa english_us_arpa /Users/Eleanor/output_english/

3.4.4 Input and output folders

I would recommend creating a special input folder that houses a copy of your audio files and TextGrid transcripts. In case something goes wrong, you won’t be messing up the raw data. You can place this folder basically anywhere on your computer, and you can call this whatever you want.

You will also need to create an empty output folder. I recommend making sure this is empty each time you run the aligner as the aligner does not overwrite any existing files. You can place this folder basically anywhere on your computer, and you can call this whatever you want; however, you may not re-use the input folder as the output folder.

3.5 Obtaining acoustic models

3.5.1 Download an acoustic model

Pretrained acoustic models for several languages can be downloaded directly using the command line interface. This is the master list of acoustic models available for download.

3.5.2 Train an acoustic model

You can also train an acoustic model yourself directly on the data you’re working on. You do need a fair amount of data to get reasonable alignments out. I would probably recommend a bare minimum of 30 minutes of actual speech, and if and only if you’re lucky, this will produce something reasonable. Referencing experience alone, more stability seems to come at around 2 hours of actual speech at a minimum. This is just a heuristic, and the more you have, the better. For more research on this, and a recommendation for even more data, see this blog post.).

The mfa train command takes 3 arguments:

path to the input folder with the audio files and utterance-level transcripts (TextGrids)
name of the pronunciation dictionary in the pretrained_models folder or path to the pronunciation dictionary
path to the output acoustic model (zip file)

mfa train --clean ~/Documents/input_spanish ~/Documents/talnupf_spanish.txt ~/Documents/spanish_model.zip

Once again, I’m using the --clean flag just in case I need to clean out an old folder in Documents/MFA.

Once you obtain the acoustic model, you can use it normally to align the files. (It seems that the MFA is generating alignments during the acoustic model training, but in the version I currently have, these are not being saved. Easy solution: just run the align algorithm.)

mfa align --clean ~/Documents/input_spanish ~/Documents/talnupf_spanish.txt ~/Documents/spanish_model.zip ~/Documents/output_spanish

Or if you place the dictionary and acoustic model in the respective pretrained_models folder, then:

mfa align --clean ~/Documents/input_spanish talnupf_spanish spanish_model ~/Documents/output_spanish

3.5.3 Adapt an existing acoustic model

If an existing acoustic model already exists, it might be worth simply using that directly on your data or you could try adapting the model to your data. Adapting an acoustic model requires that you create a dictionary for the new data with the same phone set as the existing model. The adapt command takes 4 arguments:

path to the corpus
path to the new dictionary
path to the existing acoustic model
path to the output model

mfa adapt --clean ~/Documents/input_spanish ~/Documents/talnupf_spanish.txt ~/Documents/existing_spanish_model.zip ~/Documents/adapted_spanish_model.zip

3.6 Obtaining dictionaries

The pronunciation dictionary must be a two column text file with a list of words on the left-hand side and the phonetic pronunciation(s) on the right-hand side. Each word should be separated from its phonetic pronunciation by a tab, and each phone in the phonetic pronunciation should be separated by a space. Many-to-many mappings between words and pronunciations are permitted. In fact, you can even add pronunciation probabilities to the dictionary, but I have not yet tried this!

One important point: the phone set in your dictionary must match that used in the acoustic models and the orthography must match that in the transcripts.

There are a few options for obtaining a pronunciation dictionary:

3.6.1 Download a dictionary

This is the master list of pre-existing pronunciation dictionaries available through the MFA. Click on the dictionary of interest, and if you scroll to the bottom of the page, it will tell you the name to type in the mfa model download dictionary command.

mfa model download dictionary spanish_latin_america_mfa

NB: you must add any missing words in your corpus manually, or train a G2P model to handle these cases.

3.6.2 Generate a dictionary using a G2P model

Grapheme-to-phoneme (G2P) models automatically convert the orthographic words in your corpus to the most likely phonetic pronunciation. How exactly it does this depends a lot on the type of model and its training data.

The MFA has a handful of pretrained G2P models that you can download. This is the master list of G2P models available for download.

mfa model download g2p bulgarian_mfa

You’ll then use the mfa g2p command to generate the phonetic transcriptions from the submitted orthographic forms and the g2p model. The mfa g2p command takes 3 arguments:

name of g2p model (in Documents/MFA/pretrained_models/g2p/)
path to the TextGrids or transcripts in your corpus
path to where the new pronunciation dictionary should go

mfa g2p --clean bulgarian_mfa ~/Desktop/romanian/TextGrids_with_new_words/ ~/Desktop/new_bulgarian_dictionary.txt

You can also use an external resource like Epitran or XPF to generate a dictionary for you. These are both rule-based G2P systems built by linguists; these systems work to varying degrees of success.

In all cases, you’re best off checking the output. Remember that the phone set in the dictionary must entirely match the phone set in the acoustic model.

3.6.3 Train a G2P model

You can also train a G2P model on an existing pronunciation dictionary. Once you’ve trained the G2P model, you’ll need to jump back up to the generate dictionary instructions just above. This might be useful in cases when you have many unlisted orthographic forms that are still in need of a phonetic transcription.

The mfa train_g2p command has 2 arguments:

path to the training data pronunciation lexicon
path to where the trained G2P model should go

mfa train_g2p ~/Desktop/romanian/romanian_lexicon.txt ~/Desktop/romanian/romanian_g2p

3.6.4 Create the dictionary by hand

This one is self-explanatory, but once again, make sure to use the same phone set as the acoustic models. See below regarding the validate command to cross-check the dictionary phones against the acoustic model phones. This will ensure you don’t have a typo in your dictionary somewhere, where you’ve included a phone that doesn’t actually exist in the acoustic model set.

3.7 Tips and tricks

3.7.1 Validate a corpus

You can check for problems with your corpus using the validate command. Such problems might include words in your TextGrids that are not listed in your dictionary, phones in your dictionary that do not have corresponding acoustic models, as well as general problems with your sound file (like a missing waveform – this has happened to me when an audio file got corrupted during download). Note that checking for audio file problems takes a lot longer, so I frequently add the optional --ignore_acoustics flag.

The three arguments required by the validate command are:

path to the corpus
name of the dictionary
name of the acoustic model

If you don’t yet have an acoustic model (i.e., you’re running the validate command on the data you’re using to train an acoustic model), just put in any acoustic model as a placeholder and ignore any corresponding warnings about phones in your dictionary that are not in the acoustic model.

mfa validate ~/Documents/myCorpus/mfainput4english/ english_us_arpa english_us_arpa 

mfa validate ~/Documents/myCorpus/mfainput4english/ english_us_arpa english_us_arpa --ignore_acoustics

3.7.2 Inspect a model or dictionary

You can inspect the details of any local acoustic model or dictionary using the mfa model inspect commands. You can get information about how to reference it, the phone set, whether it performs speaker adaptation, among other things. I particularly like using this to get the phone set of the model:

mfa model inspect acoustic english_us_arpa
mfa model inspect acoustic english_mfa

mfa model inspect dictionary english_us_arpa
mfa model inspect dictionary english_mfa

3.7.3 List models and dictionaries

You can get a list of the local acoustic models and dictionaries in your Documents/MFA/pretrained_models/ folders using the mfa model list commands:

mfa model list acoustic
mfa model list dictionary

3.7.4 Get help (with the argument structure!)

Add -h after any command to get a help screen that will also display its argument structure. For example:

mfa align -h
mfa train -h

3.7.5 Handle conda environments

If you have created multiple conda environments, you can list all of these using the following command:

conda info --envs

You can delete an environment with the following command, where ENV_NAME is the name of the environment, like aligner. Make sure that you have deactivated the environment before deleting it.

conda deactivate
conda env remove -n ENV_NAME

3.7.6 Extra

For awhile, it was the case that all phones were represented as a minimum of 3 HMM states, which each have duration 10 ms. This would mean that phones would have a minimum of 30 ms. In MFA, this has been updated where phones can now have varying numbers of HMM states including just 1 state. The minimum duration now is thus 10 ms. See here for more.
The duration granularity will always be 10 ms. (Without further hand alignment, you can’t draw conclusions about any difference in duration less than this value.)
Use the --clean flag each time you run the align command
Don’t forget to activate the aligner
Make sure the output folder is empty each time you run the align command
The input and output folders must be different folders
Many users have trouble reading/writing files to the Desktop folder. If you’re having issues using the Desktop, just switch to a different folder like Documents, or Google how to change the read/write permissions on your Desktop folder
And finally, there is even more to the MFA than just what I’ve put here! It’s also frequently updated and enhanced. Definitely poke around the MFA user guide and Github repo to learn more and find special functions that might make your workflow even easier.