Background

The phonetic realization of vowels can vary considerably from speaker to speaker, from dialect to dialect, and even from generation to generation. In particular, substantial evidence has indicated variability in the vowel formants. Vowel formants are acoustic concentrations of energy in certain frequency bands and that reflect resonances of the vocal tract. The lowest two formants, F1 and F2, strongly indicate vowel category identity and respectively relate to tongue body height (F1) and backness (F2) during vowel production. At a high level, male speakers typically have lower formants than female speakers and adult speakers have lower formant values than child speakers. Even within the same sex and dialect, formant variability is pervasive. In some cases, a speaker’s formant for vowel category X can have the same value as a second speaker’s formant for vowel category Y. In other words, no one-to-one mapping exists between the observed acoustics and the intended linguistic category. This is well-known in the literature as the “lack-of-invariance” problem: the realization of any given vowel or speech segment is highly variable. Despite this variability, listeners quickly adapt to the speech patterns of novel talkers, and in many cases this adaptation is seamless.

How exactly listeners adapt has been a matter of intense debate. One early proposal is that listeners adapt to talkers using linguistic knowledge (Ladefoged & Broadbent, 1957: LB1957). Specifically, listeners may have a template of vowel categories in the F1 x F2 space based on their experience with the given language. If a talker has an exceptionally low F1 value, that the template of vowel categories could be shifted accordingly (see Figure 1; Nearey & Assmann, 2008; McMurray & Jongman, 2012; Chodroff & Wilson, 2020). In trying to make sense of the acoustic signal then, a listener might know that “a vowel sound does not depend on the absolute values of its formant frequencies, but on the relationship between the formant frequencies for that vowel and the formant frequencies of other vowels pronounced by that speaker” (Ladefoged & Broadbent, 1957).

Figure 1. Sliding template model of talker-specific vowel adaptation.

Figure 1. Sliding template model of talker-specific vowel adaptation.

Evidence for such an interpretation comes from the fact that listeners perception of a given vowel sound changes depending on the preceding speech context. LB1957 presented listeners six versions of the sentence “Please say what this word is” in which the F1 or F2 values had been shifted up or down. The context sentence was followed by one of four target words with the frame b-vowel-t, and listeners could report hearing “bit”, “bet”, “bat”, or “but”. For an ambiguous “bit”–“bet” stimulus, when the context F1 was shifted upwards, listeners were more likely to hear “bit” than “bet”, and when it was shifted downwards, listeners were more likely to hear “bet” than “bit”. Recall that the vowel in “bit” has a relatively lower F1 than the vowel in “bet”. According to the “shifting template” interpretation, the same F1 in the target stimulus will be interpreted as low in a raised F1 context (leading to “bit” responses) and high in a lowered F1 context (leading to “bet” responses).

Listen to the relevant target word: F1 = 450 Hz

Listen to the raised F1 context

Listen to the neutral F1 context

Listen to the lowered F1 context

A near-categorical “flip” in perception was reported in LB1957 with the above target word: participants heard “bit” after the raised F1 context and “bet” after the neutral and lowered F1 contexts. Many, but perhaps not all of you, might also experience such a categorical change in perception, at least between the raised and lowered F1 contexts. (It sadly does not work for me… more on that soon.)

Listen to the target word after the raised F1 context: “sentence 1”

Listen to the target word after the neutral F1 context: “sentence 2”

Listen to the target word after the lowered F1 context: “sentence 3”

This “flip” also can also happen with F2 and vowel contrasts along the front–back dimensions, but we’ll be focusing here on the F1 height effects. For a more thorough overview of LB1957, you can listen to Peter Ladefoged himself describing the experiment procedure and materials in the following link. Huge thanks to Alice Turk, Bob Ladd, Simon King, and the good folks at the University of Edinburgh Linguistics and English Language Department for their recent digitalization efforts.

Listen to Peter Ladefoged present and describe the original experiment

Several studies (and by several, I might mean dozens to hundreds of studies) have followed up on this effect to investigate how exactly this perceptual adaptation occurs. Almost all of these studies replicate the direction of this type of contextual adaptation; however, very few, if any, have replicated the magnitude of the effect reported in LB1957. With respect to the sentences demonstrated above, LB1957 reported that out of 60 participants tested, 97% heard the target word in sentence 1 as “bit” and 95% heard the target word in sentence 3 as “bet”. It is important to note that this study was conducted at the University of Edinburgh in Scotland and based on reasonable inferences, was conducted in a large lecture hall setting. As far as we can tell, all 60 participants were tested simultaneously.

As both we (the two US authors) did not share the perception of the word in sentence 2 as “bet”, we were perplexed by these reported numbers. (We classified the target words for all three context sentences as “bit”.) Of course, these studies used different stimuli, were run in different countries and on different languages, and so forth. We therefore decided to replicate the study as best as possible in both the US and in the UK. The discrepancy between our experience with these stimuli and the reported numbers could have been due to any number of differences including regional differences (UK vs US), generational differences (1957 vs 2022), environment differences (lecture hall vs headphones), or perhaps LB1957 had anomalous results. Anomalous results do not necessarily mean that LB1957 had done anything wrong; it’s just that those findings do not really generalise to a broader group of people.
We predicted that while we would also find a similar direction of effects, we would not observe the same magnitude (i.e., close to 100%).

Methods

In the UK, we replicated the study in a lecture hall with 47 participants. The following analysis includes data from 35 participants who reported English as their first language and who grew up in the UK, defined as having spent the majority of the time before age 12 in the UK.

In the US, we have not yet been able to conduct the replication to the same degree as in the UK. However, we did replicate the relevant trials in a small classroom setting with 11 participants who reported English as their first language and who grew up in the US. One of the 11 US participants completed the experiment with headphones at a computer. We will use these results for a comparison in the meantime.

Results and Discussion

We’ve simplified the results here somewhat for digestion. We will just report the percentage “bit” response for each of the three sentences demonstrated above, and this can be compared to the corresponding “bet” response (100% minus the “bit” percentage). No participants in the 2022 studies in the UK or US reported hearing “bat” or “but” on these particular trials (again, just those three demonstrated above). Two of the 60 LB1957 participants did report hearing “bat” in the lowered F1 context. In all other cases, the complement of the percent “bit” response can be directly interpreted as the percent “bet” response.

Table 1. Percentage ‘bit’ response for the target word after the three context sentences: raised F1, neutral F1, and lowered F1 in the UK 1957 (n = 60), UK 2022 (n = 35), and US 2022 (n = 11) experiments.
UK 1957 UK 2022 US 2022
raised F1 97% 91% 91%
neutral F1 8% 49% 55%
lowered F1 2% 3% 45%

As you can see in Table 1, the vast majority of participants in all three experiments reported hearing “bit” for this target word following a raised F1 context. Moving to the neutral and lowered F1 contexts, things start to get interesting. In the neutral F1 context, we observe a generational divide: both US and UK 2022 participants are pretty split: about half the participants hear the word as “bit” and half as “bet”. However, the 1957 participants mostly reported “bet” for this trial. In the lowered F1 context, we observe a regional divide: both UK 1957 and UK 2022 participants are very likely to hear the target word as “bet” whereas the US 2022 participants are still split down the middle, just like for the neutral F1 context.

So what is happening? While speculative, our best guess is that the contrast between [ɪ] and [ɛ] differ between the US and the UK, and possibly also between generations in the UK. The second one is going to be a bit harder to explore than the first one. The following is pure exploration, and a bit of hand waving. I’m hoping to follow this up with more detailed exploration of the vowel systems for each of the two groups. Looking at several reports of US English vowel formants for male speakers, it appears that the median “ih” F1 is around 450 Hz and the median “eh” F1 is around 550 Hz (see Clopper et al., 2005). For UK English and glossing over so many dialectal differences, the median “ih” and “eh” F1s might be shifted down by about 50 Hz relative to the US English medians (again, this is very hand wavy, and needs to be verified). The median “ih” F1 for male speakers tends to be around 400 Hz range, and the median “eh” F1 around 500 Hz (see Ferragne & Pellegrino, 2010). In other words, the target word F1 of 450 Hz might be very comfortably in the “eh” range for UK English listeners, but is on the low end for US English listeners. That region of the vowel space is still more likely to be an “ih” vowel in US English.