3 Montreal Forced Aligner
3.1 Overview
The Montreal Forced Aligner is a forced alignment system with acoustic models built using the Kaldi ASR toolkit. A major highlight of this system is the availability of pretrained acoustic models and grapheme-to-phoneme models for a wide variety of languages, as well as the ability to train acoustic and grapheme-to-phoneme models to any new dataset you might have. It also uses advanced techniques for training and aligning speech data (Kaldi) with a full suite of training and speaker adaptation algorithms. The basic acoustic model recipe uses the traditional GMM-HMM framework, starting with monophone models, then triphone models (which allow for context sensitivity, read: coarticulation), and some transformations and speaker adaptation along the way. You can check out more regarding the recipes in the MFA user guide and reference guide; for an overview to a standard training recipe, check out this Kaldi tutorial.
As a forced alignment system, the Montreal Forced Aligner will time-align a transcript to a corresponding audio file at the phone and word levels provided there exist a set of pretrained acoustic models and a pronunciation dictionary (a.k.a. lexicon) of the words in the transcript with their canonical phonetic pronunciation(s).
The current MFA download contains a suite of mfa
commands that allow you to do everything from basic forced alignment to grapheme-to-phoneme conversion to automatic speech recognition. In this tutorial, we’ll be focusing mostly on those commands relating to forced alignment and a side of grapheme-to-phoneme conversion.
Very generally, the procedure for forced alignment is as follows:
- Prep audio file(s)
- Prep transcript(s) (Praat
TextGrids
or.lab
/.txt
files) - Obtain a pronunciation dictionary
- Obtain an acoustic model
- Create an input folder that contains the audio files and transcripts
- Create an empty output folder
- Run the
mfa align
command
Useful links:
- MFA installation page
- MFA Github page for posting issues and finding solutions
- Acoustic model master list
- Pronunciation dictionary master list
- G2P model master list
- Epitran G2P
- XPF Corpus G2P
- Blog post by Michael McAuliffe about necessary training data
- VoxCommunis (Common Voice) acoustic models and dictionaries
3.2 Installation
You can find the streamlined installation instructions on the main MFA installation page. The majority of users will be following the “All platforms” instructions.
As listed there, you will need to download Miniconda or Anaconda. If you’re trying to decide between the two, I’d probably recommend Miniconda. Compared to Anaconda, Miniconda is smaller, and has a slightly higher success rate for installation. Conda is a package and environment management system. If you’re familiar with R, the package management system is similar in concept to the R CRAN server. Once installed, you can import sets of packages via the conda
commands, similar to how you might run install.packages
in R. The environment manager aspect of conda is fairly unique, and essentially creates a small environment on your computer for the collection of imported packages (here the MFA suite) to run. The primary advantage is that you don’t have to worry about conflicting versions of the same code package on your computer; the environment will be a bubble in your computer with the correct package versions for the MFA suite.
Once you’ve installed Miniconda/Anaconda, you’ll run the following line of code in the command line. For Mac users, the command line will be the terminal or a downloaded equivalent. For Windows users, I believe you will run this directly in the Miniconda console.
The -n
flag here refers to the name of the environment. If you want to test a new version of the aligner, but aren’t ready to overwrite your old working version of the aligner, you can create a new environment using a different name after the -n
flag. See 3.7 for more on conda environments.
If it’s worked, you should see something like:
# To activate this environment, use
#
# $ conda activate aligner
#
# To deactivate an active environment, use
#
# $ conda deactivate
And with that, the final step is activating the aligner. You will need to re-activate the aligner every time you open a new shell/command line window. You should be able to see which environment you have open in the parentheses before each shell prompt. For example, mine looks like:
(base) Eleanors-iPro:Documents eleanorchodroff$ conda activate aligner
(aligner) Eleanors-iPro:Documents eleanorchodroff$
You can run the aligner from any location on the computer as long as you’re in the aligner environment on the command line.
NB: The Montreal Forced Aligner is now located in the Documents/MFA
folder. All acoustic models, dictionaries, temporary alignment or training files, etc. will be stored in this folder. If you are ever confused or curious about what is happening with the aligner, just poke around this folder.
3.3 Running the aligner
To run the aligner, we’ll need to follow the procedure listed above. In this section, I will give a quick example of how this works using a hypothetical example of aligning American English. In the following sections, I’ll go into more detail about each of these steps.
For the first steps of prepping the audio files and prepping the transcripts, I’ll assume we’re working with a wav
file and a Praat TextGrid
that has a single interval tier with utterances transcribed in English. Each interval in the TextGrid
corresponds to an utterance in the wav file. A good rule of thumb is to keep utterances less than 30 seconds; accuracy can improve with smaller windows, but you can still get lucky in getting the job done even with bigger windows. The wav
file and TextGrid
must have the same name.
After we have the audio files and transcript TextGrids
, we’ll place them in an input folder. For the sake of this tutorial, this input folder will be called Documents/input_english/
. We will also need to create an empty output folder. I will call this output folder Documents/output_english
.
We can then obtain the acoustic model from the internet using the following command. (You can also use mfa models download acoustic english_us_arpa
.)
And we can obtain the dictionary from the internet using this command. (You can also use mfa models download dictionary english_us_arpa
.)
The dictionary and acoustic model can now be found in their respective Documents/MFA/pretrained_models/
folders.
Once we have the dictionary, acoustic model, wav
files and TextGrids
in their input folder, and an empty output folder, then we’re ready to run the aligner!
The alignment command is called align
and takes 4 arguments:
- path to the input folder
- name of the dictionary (in the
Documents/MFA/pretrained_models/dictionary/
folder) or path to the acoustic model if it’s elsewhere on the computer - name of the acoustic model (in the
Documents/MFA/pretrained_models/acoustic/
folder) or path to the acoustic model if it’s elsewhere on the computer - path to the output folder
I always add the optional --clean
flag as this “cleans” out any old temporary files created in a previous run. (If you’re anything like me, you might find yourself re-running the aligner a few times, and with the same input filename. The aligner won’t actually re-run properly unless you clear out the old files.)
Important: You will need to scroll across to see the whole line of code.
mfa align --clean /Users/Eleanor/Documents/input_english/ english_us_arpa english_us_arpa /Users/Eleanor/Documents/output_english/
And if everything worked appropriately, your aligned TextGrids
should now be in the Documents/output_english/
folder!
3.4 File preparation
3.4.1 Audio files
The Montreal Forced Aligner is incredibly robust to audio files of differing formats, sampling rates and channels. You should not have to do much prep, but note that whatever you feed the system will be converted to be a wav
file with a sampling rate of 16 kHz with a single (mono) channel unless otherwise specified (see Feature Configuration. For the record, I have not yet tried any other file format except for wav
files, so I’m not yet aware of potential issues that might arise there.
3.4.2 Transcripts
The MFA can take as input either a Praat TextGrid
or a .lab
or .txt
file. I have worked most extensively with the TextGrid
input, so I’ll describe those details here. As for .lab
and .txt
input, I believe this method only works when the transcript is pasted in as a single line. In other words, I don’t think it can handle time stamps for utterance start and end times.
As described above, the Praat TextGrid should contain a single tier with intervals that are preferably on the shorter side. I’ve noticed that the biggest gains you can make in alignment accuracy tend to come from simply shortening the input window. Within the TextGrid, each interval should correspond to an utterance to an utterance in the wav file, and all the words in the utterance should ideally have an entry in the pronunciation dictionary. You can (but do not have to) specify the speaker ID using the tier name, which then allows the aligner to then perform speaker adaptation in training or alignment. A few more options for indicating speaker/channel IDs are provided below in 3.4.3.
3.4.3 Filenames
The filename of the wav
file and its corresponding transcript must be identical except for the extension (.wav
or .TextGrid
). If you have multiple speakers/recording channels in the alignment or training, you can either specify this ID in the tier name of the TextGrid, or add an optional argument to the mfa align
command to use the first n
characters of the filename.
If you go forward with this, it helps to have the speaker ID as the prefix to the filenamex. For example:
spkr01_utt1.wav, spkr01_utt1.TextGrid
spkr01_utt2.wav, spkr01_utt2.TextGrid
spkr02_utt1.wav, spkr02_utt1.TextGrid
spkr02_utt2.wav, spkr02_utt2.TextGrid
spkr03_utt1.wav, spkr03_utt1.TextGrid
spkr03_utt2.wav, spkr03_utt2.TextGrid
spkr04_utt1.wav, spkr04_utt1.TextGrid
spkr04_utt2.wav, spkr04_utt2.TextGrid
In this case, we can then tell the aligner to use the first 6 characters as the speaker information. The -s
flag stands for speaker characters.
mfa align -s 6 --clean /Users/Eleanor/input_english/ english_us_arpa english_us_arpa /Users/Eleanor/output_english/
Alternatively:
3.4.4 Input and output folders
I would recommend creating a special input folder that houses a copy of your audio files and TextGrid
transcripts. In case something goes wrong, you won’t be messing up the raw data. You can place this folder basically anywhere on your computer, and you can call this whatever you want.
You will also need to create an empty output folder. I recommend making sure this is empty each time you run the aligner as the aligner does not overwrite any existing files. You can place this folder basically anywhere on your computer, and you can call this whatever you want; however, you may not re-use the input folder as the output folder.
3.5 Obtaining acoustic models
3.5.1 Download an acoustic model
Pretrained acoustic models for several languages can be downloaded directly using the command line interface. This is the master list of acoustic models available for download.
3.5.2 Train an acoustic model
You can also train an acoustic model yourself directly on the data you’re working on. You do need a fair amount of data to get reasonable alignments out. I would probably recommend a bare minimum of 30 minutes of actual speech, and if and only if you’re lucky, this will produce something reasonable. Referencing experience alone, more stability seems to come at around 2 hours of actual speech at a minimum. This is just a heuristic, and the more you have, the better. For more research on this, and a recommendation for even more data, see this blog post.).
The mfa train
command takes 3 arguments:
- path to the input folder with the audio files and utterance-level transcripts (
TextGrids
) - name of the pronunciation dictionary in the
pretrained_models
folder or path to the pronunciation dictionary - path to the output acoustic model (zip file)
mfa train --clean ~/Documents/input_spanish ~/Documents/talnupf_spanish.txt ~/Documents/spanish_model.zip
Once again, I’m using the --clean
flag just in case I need to clean out an old folder in Documents/MFA
.
Once you obtain the acoustic model, you can use it normally to align the files. (It seems that the MFA is generating alignments during the acoustic model training, but in the version I currently have, these are not being saved. Easy solution: just run the align
algorithm.)
mfa align --clean ~/Documents/input_spanish ~/Documents/talnupf_spanish.txt ~/Documents/spanish_model.zip ~/Documents/output_spanish
Or if you place the dictionary and acoustic model in the respective pretrained_models folder, then:
3.5.3 Adapt an existing acoustic model
If an existing acoustic model already exists, it might be worth simply using that directly on your data or you could try adapting the model to your data. Adapting an acoustic model requires that you create a dictionary for the new data with the same phone set as the existing model. The adapt
command takes 4 arguments:
- path to the corpus
- path to the new dictionary
- path to the existing acoustic model
- path to the output model
3.6 Obtaining dictionaries
The pronunciation dictionary must be a two column text file with a list of words on the left-hand side and the phonetic pronunciation(s) on the right-hand side. Each word should be separated from its phonetic pronunciation by a tab, and each phone in the phonetic pronunciation should be separated by a space. Many-to-many mappings between words and pronunciations are permitted. In fact, you can even add pronunciation probabilities to the dictionary, but I have not yet tried this!
One important point: the phone set in your dictionary must match that used in the acoustic models and the orthography must match that in the transcripts.
There are a few options for obtaining a pronunciation dictionary:
3.6.1 Download a dictionary
This is the master list of pre-existing pronunciation dictionaries available through the MFA. Click on the dictionary of interest, and if you scroll to the bottom of the page, it will tell you the name to type in the mfa model download dictionary
command.
NB: you must add any missing words in your corpus manually, or train a G2P model to handle these cases.
3.6.2 Generate a dictionary using a G2P model
Grapheme-to-phoneme (G2P) models automatically convert the orthographic words in your corpus to the most likely phonetic pronunciation. How exactly it does this depends a lot on the type of model and its training data.
The MFA has a handful of pretrained G2P models that you can download. This is the master list of G2P models available for download.
You’ll then use the mfa g2p
command to generate the phonetic transcriptions from the submitted orthographic forms and the g2p model. The mfa g2p
command takes 3 arguments:
- name of g2p model (in
Documents/MFA/pretrained_models/g2p/
) - path to the
TextGrids
or transcripts in your corpus - path to where the new pronunciation dictionary should go
mfa g2p --clean bulgarian_mfa ~/Desktop/romanian/TextGrids_with_new_words/ ~/Desktop/new_bulgarian_dictionary.txt
You can also use an external resource like Epitran or XPF to generate a dictionary for you. These are both rule-based G2P systems built by linguists; these systems work to varying degrees of success.
In all cases, you’re best off checking the output. Remember that the phone set in the dictionary must entirely match the phone set in the acoustic model.
3.6.3 Train a G2P model
You can also train a G2P model on an existing pronunciation dictionary. Once you’ve trained the G2P model, you’ll need to jump back up to the generate dictionary instructions just above. This might be useful in cases when you have many unlisted orthographic forms that are still in need of a phonetic transcription.
The mfa train_g2p
command has 2 arguments:
- path to the training data pronunciation lexicon
- path to where the trained G2P model should go
3.6.4 Create the dictionary by hand
This one is self-explanatory, but once again, make sure to use the same phone set as the acoustic models. See below regarding the validate
command to cross-check the dictionary phones against the acoustic model phones. This will ensure you don’t have a typo in your dictionary somewhere, where you’ve included a phone that doesn’t actually exist in the acoustic model set.
3.7 Tips and tricks
3.7.1 Validate a corpus
You can check for problems with your corpus using the validate
command. Such problems might include words in your TextGrids that are not listed in your dictionary, phones in your dictionary that do not have corresponding acoustic models, as well as general problems with your sound file (like a missing waveform – this has happened to me when an audio file got corrupted during download). Note that checking for audio file problems takes a lot longer, so I frequently add the optional --ignore_acoustics
flag.
The three arguments required by the validate command are:
- path to the corpus
- name of the dictionary
- name of the acoustic model
If you don’t yet have an acoustic model (i.e., you’re running the validate command on the data you’re using to train an acoustic model), just put in any acoustic model as a placeholder and ignore any corresponding warnings about phones in your dictionary that are not in the acoustic model.
3.7.2 Inspect a model or dictionary
- You can inspect the details of any local acoustic model or dictionary using the
mfa model inspect
commands. You can get information about how to reference it, the phone set, whether it performs speaker adaptation, among other things. I particularly like using this to get the phone set of the model:
3.7.3 List models and dictionaries
- You can get a list of the local acoustic models and dictionaries in your
Documents/MFA/pretrained_models/
folders using themfa model list
commands:
3.7.4 Get help (with the argument structure!)
- Add
-h
after any command to get a help screen that will also display its argument structure. For example:
3.7.5 Handle conda environments
If you have created multiple conda environments, you can list all of these using the following command:
You can delete an environment with the following command, where ENV_NAME is the name of the environment, like aligner
. Make sure that you have deactivated the environment before deleting it.
3.7.6 Extra
For awhile, it was the case that all phones were represented as a minimum of 3 HMM states, which each have duration 10 ms. This would mean that phones would have a minimum of 30 ms. In MFA, this has been updated where phones can now have varying numbers of HMM states including just 1 state. The minimum duration now is thus 10 ms. See here for more.
The duration granularity will always be 10 ms. (Without further hand alignment, you can’t draw conclusions about any difference in duration less than this value.)
Use the
--clean
flag each time you run thealign
commandDon’t forget to activate the aligner
Make sure the output folder is empty each time you run the
align
commandThe input and output folders must be different folders
Many users have trouble reading/writing files to the Desktop folder. If you’re having issues using the Desktop, just switch to a different folder like Documents, or Google how to change the read/write permissions on your Desktop folder
And finally, there is even more to the MFA than just what I’ve put here! It’s also frequently updated and enhanced. Definitely poke around the MFA user guide and Github repo to learn more and find special functions that might make your workflow even easier.