4pSC11 – The role of familiarity in audiovisual speech perception

Chao-Yang Lee – leec1@ohio.edu
Margaret Harrison – mh806711@ohio.edu
Ohio University
Grover Center W225
Athens, OH 45701

Seth Wiener – sethw1@cmu.edu
Carnegie Mellon University
160 Baker Hall, 5000 Forbes Avenue
Pittsburgh, PA 15213

Popular version of paper 4pSC11, “The role of familiarity in audiovisual speech perception”
Presented Thursday afternoon, December 7, 2017, 1:00-4:00 PM, Studios Foyer
174th ASA Meeting, New Orleans

When we listen to someone talk, we hear not only the content of the spoken message, but also the speaker’s voice carrying the message. Although understanding content does not require identifying a specific speaker’s voice, familiarity with a speaker has been shown to facilitate speech perception (Nygaard & Pisoni, 1998) and spoken word recognition (Lee & Zhang, 2017).

Because we often communicate with a visible speaker, what we hear is also affected by what we see. This is famously demonstrated by the McGurk effect (McGurk & MacDonald, 1976). For example, an auditory “ba” paired with a visual “ga” usually elicits a perceived “da” that is not present in the auditory or the visual input.

Since familiarity with a speaker’s voice affects auditory perception, does familiarity with a speaker’s face similarly affect audiovisual perception? Walker, Bruce, and O’Malley (1995) found that familiarity with a speaker reduced the occurrence of the McGurk effect. This finding supports the “unity” assumption of intersensory integration (Welch & Warren, 1980), but challenges the proposal that processing facial speech is independent of processing facial identity (Bruce & Young, 1986; Green, Kuhl, Meltzoff, & Stevens, 1991).

In this study, we explored audiovisual speech perception by investigating how familiarity with a speaker affects the perception of English fricatives “s” and “sh”. These two sounds are useful because they contrast visibly in lip rounding. In particular, the lips are usually protruded for “sh” but not “s”, meaning listeners can potentially identify the contrast based on visual information.

Listeners were asked to watch/listen to stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent (e.g., audio “save” paired with visual “shave”). The listeners’ task was to identify whether the first sound of the stimuli was “s” or “sh”. We tested two groups of native English listeners – one familiar with the speaker who produced the stimuli and one unfamiliar with the speaker.

The results showed that listeners familiar with the speaker identified the fricatives faster in all conditions (Figure 1) and more accurately in the visual-only condition (Figure 2). That is, listeners familiar with the speaker were more efficient in identifying the fricatives overall, and were more accurate when visual input was the only source of information.

We also examined whether visual familiarity affects the occurrence of the McGurk effect. Listeners were asked to identify syllable-initial stops (“b”, “d”, “g”) from stimuli that were audiovisual-congruent or incongruent (e.g., audio “ba” paired with visual “ga”). A blended (McGurk) response was indicated by a “da” response to an auditory “ba” paired with a visual “ga”.

Contrary to the “s”-“sh” findings reported earlier, the results from our identification task showed no difference between the familiar and unfamiliar listeners in the proportion of McGurk responses. This finding did not replicate Walker, Bruce, and O’Malley (1995).

In sum, familiarity with a speaker facilitated the speed of identifying fricatives from audiovisual stimuli. Familiarity also improved the accuracy of fricative identification when visual input was the only source of information. Although we did not find an effect of familiarity on the McGurk responses, our findings from the fricative task suggest that processing audiovisual speech is affected by speaker identity.

familiarityFigure 1- Reaction time of fricative identification from stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent. Error bars indicate 95% confidence intervals.

familiarityFigure 2- Accuracy of fricative identification (d’) from stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent (e.g., audio “save” paired with visual “shave”). Error bars indicate 95% confidence intervals.

Figure 3- Proportion of McGurk response (“da” response to audio “ba” paired with visual “ga”).

Video 1 – Example of an audiovisual-incongruent stimulus (audio “save” paired with visual “shave”).

Video 2 – Example of an audiovisual-incongruent stimulus (audio “ba” paired with visual “ga”).


Bruce, V., & Young, A. (1986). Understanding face recognition. British Journal of Psychology, 77, 305-327.

Green, K. P., Kuhl, P. K., Meltzoff, A. N., & Stevens, E. B. (1991). Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception & Psychophysics, 50, 524-536.

Lee, C.-Y., & Zhang, Y. (in press). Processing lexical and speaker information in repetition and semantic/associative priming. Journal of Psycholinguistic Research.

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 26, 746-748.

Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60, 355-376.

Walker, S., Bruce, V., & O’Malley, C. (1995). Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception and Psychophysics, 57, 1124-1133.

Welch, R. B., & Warren, D. H. (1980). Immediate perceptual response to intersensory discrepancy. Psychological Bulletin, 88, 638-667.

4aSC2 – Effects of language and music experience on speech perception

T. Christina Zhao — zhaotc@uw.edu
Patricia K. Kuhl — pkkuhl@uw.edu
Institute for Learning & Brain Sciences
University of Washington, BOX 357988
Seattle, WA, 98195

Popular version of paper 4aSC2, “Top-down linguistic categories dominate over bottom-up acoustics in lexical tone processing”
Presented Thursday morning, May 21st, 2015, 8:00 AM, Ballroom 2
169th ASA Meeting, Pittsburgh

Speech perception involves constant interplay between top-down and bottom-up processing. For example, to process phonemes (e.g. ‘b’ from ‘p’), the listener must accurately process the acoustical information in the speech signals (i.e. bottom-up strategy) and assign these sounds efficiently to a category (i.e. top-down strategy). Listeners’ performance in speech perception tasks is influenced by their experience in either processing strategy. Here, we use lexical tone processing as a window to examine how extensive experience in both strategies influence speech perception.

Lexical tones are contrastive pitch contour patterns at the word level. That is, a small difference in the pitch contour can result in different word meaning. Native speakers of a tonal language thus have extensive experience in using the top-down strategy to assign highly variable pitch contours into lexical tone categories. This top-down influence is reflected by the reduced sensitivity to acoustic differences within a phonemic category compared to across categories (Halle, Chang, & Best, 2004). On the other hand, individuals with extensive music training early in life exhibit enhanced sensitivities to pitch differences not only in music, but also in speech, reflecting stronger bottom-up influence. Such bottom-up influence is reflected by the enhanced sensitivity in detecting differences between lexical tones when the listeners are non-tonal language speakers (Wong, Skoe, Russo, Dees, & Kraus, 2007).
How does extensive experience in both strategies influence lexical tone processing? To address this question, native Mandarin speakers with extensive music training (N=17) completed a music pitch discrimination task and a lexical tone discrimination task. We compared their performance with individuals with extensive experience in only one of the processing strategies (i.e. Mandarin nonmusicians (N=20) and English musicians (N=20), data from Zhao & Kuhl (2015)).

Despite the enhanced performance in the music pitch discrimination task in Mandarin musicians, their performance in the lexical tone discrimination task is similar to the performance of the Mandarin nonmusicians, and different from the English musicians’ performance (Fig. 1, ‘Sensitivity across lexical tone continuum by group’).
That is, they exhibited reduced sensitivities within phonemic categories (i.e. on either end of the line) compared to within categories (i.e. the middle of the line), and their overall performance is lower than the English musicians. This result strongly suggests a dominant effect of the top-down influence in processing lexical tone. Yet, further analyses revealed that Mandarin musicians and Mandarin nonmusicians may still be relying on different underlying mechanisms for performing in the lexical tone discrimination task. In the Mandarin musician, their music pitch discrimination scores are correlated with their lexical tone discrimination scores, suggesting a contribution of the bottom-up strategy in their lexical tone discrimination performance (Fig. 2, ‘Music pitch and lexical tone discrimination’, purple). This relation is similar to the English musicians (Fig. 2, peach) but very different from the Mandarin non-musicians (Fig. 2, yellow). Specifically, for Mandarin nonmusicians, the music pitch discrimination scores do not correlate with the lexical tone discrimination scores, suggesting independent processes.


Halle, P. A., Chang, Y. C., & Best, C. T. (2004). Identification and discrimination of Mandarin Chinese tones by Mandarin Chinese vs. French listeners. Journal of Phonetics, 32(3), 395-421. doi: 10.1016/s0095-4470(03)00016-0
Wong, P. C. M., Skoe, E., Russo, N. M., Dees, T., & Kraus, N. (2007). Musical experience shapes human brainstem encoding of linguistic pitch patterns. Nat. Neurosci., 10(4), 420-422. doi: 10.1038/nn1872
Zhao, T. C., & Kuhl, P. K. (2015). Effect of musical experience on learning lexical tone categories. The Journal of the Acoustical Society of America, 137(3), 1452-1463. doi: doi:http://dx.doi.org/10.1121/1.4913457