ASA Lay Language Papers
162nd Acoustical Society of America Meeting


What causes the confusion patterns in speech perception

Feipeng Li -- fli12@jhmi.edu
Center for Language and Speech Processing
Johns Hopkins University
Baltimore, MD 21205

Jont B. Allen
ECE Department
University of Illinois at Urbana Champaign
Urbana, IL 61801

Popular version of paper 2aSC1
Presented Tuesday morning, November 1, 2011
162nd ASA Meeting, San Diego, Calif.


In a classical study on the perceptual confusions among English consonants [1], Miller et. al. showed that stops, fricatives, and nasals are often confused with consonants in the same category. For instance, /ka/s are often mistakenly heard as /pa/s or /ta/s with a probability of more than 50% at 0 dB SNR in white noise, but are seldom mixed with fricatives or nasals. What is the underlying physical mechanism for the confusion patterns?

To analyze how speech sounds are encoded in the human auditory system we have developed a method name 3D Deep Search (3DDS) [2] to identify the perceptual cues of consonant sounds in natural speech. Given a speech sound, the 3DDS first calculates an AI-gram which predicts the audibility of speech components to the central auditory system by simulating the peripheral auditory processing. In order to measure the contribution of individual time-frequency components to sound identification, we then systematically remove the speech components, present the modified sound to human listeners and assess the contribution of removed components from the change in recognition score.


Li_1

Figure 1. A /ka/ pronounced by talker f103 contains a defining cue around 1.4 kHz (solid box) and a conflicting cue above 2.8 kHz (dashed box) that causes the /ka/→/ta/ confusion

Using the 3DDS method, we have identified the perceptual cues of six stop consonants /b, d, g, p, t, k/ and eight fricatives /f, T, s, S, v, D, z, Z/. It was found that many speech sounds contains conflicting cues that are characteristic of confusable sounds due to the physical limitation of articulatory organs. For example, a /ka/ (refer to Fig. 1), defined by a mid-frequency burst from 1-2 kHz, may also contain a high-frequency burst above 4 kHz indicative of /ta/, or vice versa. Similarly, a /ga/ (refer to Fig. 2), defined by a mid-frequency burst from 1.2-2.4 kHz, also contains a high-frequency burst above 2.8 kHz that causes human listeners to hear /da/.

Li_2

Figure 2. A /ga/ pronounced by talker f103 contains a defining cue around 1.4 kHz (solid box) and a conflicting cue above 2.8 kHz (dashed box) that causes the /ga/→/da/ confusion

To demonstrate the impact of conflicting cue on the identification of consonant sounds, we manipulate the dominant cue of /ka, ga/ and the conflicting cue by setting them on and off before presenting to human listeners. Results of the experiment indicate that the conflicting cue has little effect on consonant identification when the dominant cue of target sound is audible. In fact, most subjects report that they cannot hear the difference between the original sound and the sound with the conflicting cue being removed. Once the dominant cue has been removed, the conflicting cue determines what the subjects hear. When the conflicting cue is turned off, half of subjects report /ka/ and the other half report /pa/ (is seems that this sound also has a conflicting cue for /pa/). The subjects are guessing what they hear. When the conflicting cue is turned on, about 30% subjects report hearing /ta/. Similar results are observed for /ga/. The fact that /ka/ and /ga/ morph to /ta/ and /da/ respectively, once the dominant cues are removed, suggesting that conflicting cue plays a significant role in forming the confusion patterns of speech perception.

References

[1] G. A. Miller and P. E. Nicely (1955), “An analysis of perceptual confusions among some English consonants,” J. Acoust. Soc. Am., 27(2), pp. 338-352

[2] F. Li, A. Menon, and J. B. Allen (2010) “A psychoacoustic method to find the perceptual cues of stop consonants in natural speech,” J. Acoust. Soc. Am., 127(4), pp. 2599-2610

[ Lay Language Papers Index | Press Room