1aSC2 – The McGurk Illusion

Kristin J. Van Engen – kvanengen@wustl.edu
Washington University in St. Louis
1 Brookings Dr.
Saint Louis, MO 63130

Popular version of paper 1aSC2 The McGurk illusion
Presented Tuesday morning, June 8, 2021
180th ASA Meeting, Acoustics in Focus

In 1976, Harry McGurk and John MacDonald published their now-famous article, “Hearing Lips and Seeing Voices.” The study was a remarkable demonstration of how what we see affects what we hear: when the audio for the syllable “ba” was presented to listeners with the video of a face saying “ga”, listeners consistently reported hearing “da”.

That original paper has been cited approximately 7500 times to date, and in the subsequent 45 years, the “McGurk effect” has been used in countless studies of audiovisual processing in humans. It is typically assumed that people who are more susceptible to the illusion are also better at integrating auditory and visual information. This assumption has led to the use of susceptibility to the McGurk illusion as a measure of an individual’s ability to process audiovisual speech.

However, when it comes to understanding real-world multisensory speech perception, there are several reasons to think that McGurk-style stimuli are poorly-suited to the task. Most problematic is the fact that McGurk stimuli rely on audiovisual incongruence that never occurs in real-life audiovisual speech perception. Furthermore, recent studies show that susceptibility to the effect does not actually correlate with performance on audiovisual speech perception tasks such as understanding sentences in noisy conditions. This presentation reviews these issues, arguing that, while the McGurk effect is a fascinating illusion, it is the wrong tool for understanding the combined use of auditory and visual information during speech perception.

4pSC11 – The role of familiarity in audiovisual speech perception

Chao-Yang Lee – leec1@ohio.edu
Margaret Harrison – mh806711@ohio.edu
Ohio University
Grover Center W225
Athens, OH 45701

Seth Wiener – sethw1@cmu.edu
Carnegie Mellon University
160 Baker Hall, 5000 Forbes Avenue
Pittsburgh, PA 15213

Popular version of paper 4pSC11, “The role of familiarity in audiovisual speech perception”
Presented Thursday afternoon, December 7, 2017, 1:00-4:00 PM, Studios Foyer
174th ASA Meeting, New Orleans

When we listen to someone talk, we hear not only the content of the spoken message, but also the speaker’s voice carrying the message. Although understanding content does not require identifying a specific speaker’s voice, familiarity with a speaker has been shown to facilitate speech perception (Nygaard & Pisoni, 1998) and spoken word recognition (Lee & Zhang, 2017).

Because we often communicate with a visible speaker, what we hear is also affected by what we see. This is famously demonstrated by the McGurk effect (McGurk & MacDonald, 1976). For example, an auditory “ba” paired with a visual “ga” usually elicits a perceived “da” that is not present in the auditory or the visual input.

Since familiarity with a speaker’s voice affects auditory perception, does familiarity with a speaker’s face similarly affect audiovisual perception? Walker, Bruce, and O’Malley (1995) found that familiarity with a speaker reduced the occurrence of the McGurk effect. This finding supports the “unity” assumption of intersensory integration (Welch & Warren, 1980), but challenges the proposal that processing facial speech is independent of processing facial identity (Bruce & Young, 1986; Green, Kuhl, Meltzoff, & Stevens, 1991).

In this study, we explored audiovisual speech perception by investigating how familiarity with a speaker affects the perception of English fricatives “s” and “sh”. These two sounds are useful because they contrast visibly in lip rounding. In particular, the lips are usually protruded for “sh” but not “s”, meaning listeners can potentially identify the contrast based on visual information.

Listeners were asked to watch/listen to stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent (e.g., audio “save” paired with visual “shave”). The listeners’ task was to identify whether the first sound of the stimuli was “s” or “sh”. We tested two groups of native English listeners – one familiar with the speaker who produced the stimuli and one unfamiliar with the speaker.

The results showed that listeners familiar with the speaker identified the fricatives faster in all conditions (Figure 1) and more accurately in the visual-only condition (Figure 2). That is, listeners familiar with the speaker were more efficient in identifying the fricatives overall, and were more accurate when visual input was the only source of information.

We also examined whether visual familiarity affects the occurrence of the McGurk effect. Listeners were asked to identify syllable-initial stops (“b”, “d”, “g”) from stimuli that were audiovisual-congruent or incongruent (e.g., audio “ba” paired with visual “ga”). A blended (McGurk) response was indicated by a “da” response to an auditory “ba” paired with a visual “ga”.

Contrary to the “s”-“sh” findings reported earlier, the results from our identification task showed no difference between the familiar and unfamiliar listeners in the proportion of McGurk responses. This finding did not replicate Walker, Bruce, and O’Malley (1995).

In sum, familiarity with a speaker facilitated the speed of identifying fricatives from audiovisual stimuli. Familiarity also improved the accuracy of fricative identification when visual input was the only source of information. Although we did not find an effect of familiarity on the McGurk responses, our findings from the fricative task suggest that processing audiovisual speech is affected by speaker identity.

familiarityFigure 1- Reaction time of fricative identification from stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent. Error bars indicate 95% confidence intervals.

familiarityFigure 2- Accuracy of fricative identification (d’) from stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent (e.g., audio “save” paired with visual “shave”). Error bars indicate 95% confidence intervals.

Figure 3- Proportion of McGurk response (“da” response to audio “ba” paired with visual “ga”).

Video 1 – Example of an audiovisual-incongruent stimulus (audio “save” paired with visual “shave”).

Video 2 – Example of an audiovisual-incongruent stimulus (audio “ba” paired with visual “ga”).

References:

Bruce, V., & Young, A. (1986). Understanding face recognition. British Journal of Psychology, 77, 305-327.

Green, K. P., Kuhl, P. K., Meltzoff, A. N., & Stevens, E. B. (1991). Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception & Psychophysics, 50, 524-536.

Lee, C.-Y., & Zhang, Y. (in press). Processing lexical and speaker information in repetition and semantic/associative priming. Journal of Psycholinguistic Research.

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 26, 746-748.

Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60, 355-376.

Walker, S., Bruce, V., & O’Malley, C. (1995). Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception and Psychophysics, 57, 1124-1133.

Welch, R. B., & Warren, D. H. (1980). Immediate perceptual response to intersensory discrepancy. Psychological Bulletin, 88, 638-667.