1 Introduction
Corpus phonetics has become an increasingly popular method of research in linguistic analysis. With advances in speech technology and computational power, large scale processing of speech data has become a viable technique. A fair number of researchers have exploited these methods, yet these techniques still remain elusive for many. In the words of Mark Liberman, there has been “surprisingly little change in style and scale of [phonetic] research” from 1966 on, implying that the field still relies on small sample sizes of speech data (2009). While “big data” phonetics is not the be-all and end-all of phonetic research, larger sample sizes ensure more statistically sound conclusions about phonetic values in an individual or population. Furthermore, corpus research is not synonymous with big data. Rather, corpus phonetics describes a method of processing speech data with advantages primarily gained in its computational power (relation to big data) and efficiency. The methods and tools developed for corpus phonetics are based on engineering algorithms primarily from automatic speech recognition (ASR), as well as simple programming for data manipulation. This tutorial aims to bring some of these tools to the non-engineer, and specifically to the speech scientist.
Acoustic analysis programs such as Praat, MATLAB, and R (check out the tuneR and multitaper packages) are already capable of large scale phonetic measurement via their respective scripting languages. While the tutorial covers some phonetic processing in Praat, the primary aim is to introduce supplementary tools to phonetic processing. These tools are based on concepts and algorithms from automatic speech recognition, which allow for automatic alignment of phonetic boundaries to the speech signal.
In particular, the tutorial currently covers AutoVOT, various tools from the Kaldi Automatic Speech Recognition Toolkit, and the Penn Forced Aligner. Note that the Penn Forced Aligner is no longer being maintained, so this section will soon be replaced or supplemented with current forced alignment systems such as FAVE and the Montreal Forced Aligner. “Forced alignment” is the automatic synchronization of a sequence of phones with an audio file. This process employs “acoustic models” (see below) of the sounds of a language, along with a pronunciation lexicon which provides a canonical mapping from orthographic words to sequences of phones. Forced alignment greatly expedites data processing and phonetic measurement. AutoVOT is an automatic voice onset time (VOT) measurement tool that demarcates the burst release and vocalic onset of a word-initial, prevocalic stop consonant. Kaldi is an automatic speech recognition toolkit that provides the infrastructure to build “personalized” acoustic models and forced alignment systems. Acoustic models are the statistical representations of each phoneme’s acoustic information. The “personalized” component means that this system is capable of modeling any corpus of speech, be it British English, Southern American English, Hungarian, or Korean. It additionally houses many speech processing algorithms, which may be of use to the speech scientist. This tutorial will cover acoustic model training and forced alignment in Kaldi; however, the toolkit as a whole provides exceptional potential for phonetic research.
Finally, the tutorial assumes basic familiarity with Praat, as well as a Mac operating system, primarily for the default bash/Unix shell in the Terminal application. If using a PC, I recommend downloading Cygwin for running bash/Unix commands. For AutoVOT and the Penn Forced Aligner, most of the Unix commands are provided in the tutorial itself. While I try to provide as many of the commands as possible, Kaldi requires more fluency in shell scripting. If you have not used the Terminal application before, I recommend looking over some basic Unix commands online (Google is every programmer’s best friend). For a list of the most useful commands, I recommend this website: http://www.tutorialspoint.com/unix/unix-useful-commands.htm. For more details regarding the argument structure, I recommend this website: https://kb.iu.edu/d/afsk.
Each section covers the prerequisites for each program’s installation, as well as a standard recipe for each program. As a good rule of thumb, all prerequisites should be installed prior to installation of the desired program.