The role of familiarity in audiovisual speech perception

Chao-Yang Lee – leec1@ohio.edu
Margaret Harrison – mh806711@ohio.edu
Ohio University
Grover Center W225
Athens, OH 45701

Seth Wiener – sethw1@cmu.edu
Carnegie Mellon University
160 Baker Hall, 5000 Forbes Avenue
Pittsburgh, PA 15213

 

Popular version of paper 4pSC11, “The role of familiarity in audiovisual speech perception”

Presented Thursday afternoon, December 7, 2017, 1:00-4:00 PM, Studios Foyer

174th ASA Meeting, New Orleans

 

When we listen to someone talk, we hear not only the content of the spoken message, but also the speaker’s voice carrying the message. Although understanding content does not require identifying a specific speaker’s voice, familiarity with a speaker has been shown to facilitate speech perception (Nygaard & Pisoni, 1998) and spoken word recognition (Lee & Zhang, 2017).

Because we often communicate with a visible speaker, what we hear is also affected by what we see. This is famously demonstrated by the McGurk effect (McGurk & MacDonald, 1976). For example, an auditory “ba” paired with a visual “ga” usually elicits a perceived “da” that is not present in the auditory or the visual input.

Since familiarity with a speaker’s voice affects auditory perception, does familiarity with a speaker’s face similarly affect audiovisual perception? Walker, Bruce, and O’Malley (1995) found that familiarity with a speaker reduced the occurrence of the McGurk effect. This finding supports the “unity” assumption of intersensory integration (Welch & Warren, 1980), but challenges the proposal that processing facial speech is independent of processing facial identity (Bruce & Young, 1986; Green, Kuhl, Meltzoff, & Stevens, 1991).

In this study, we explored audiovisual speech perception by investigating how familiarity with a speaker affects the perception of English fricatives “s” and “sh”. These two sounds are useful because they contrast visibly in lip rounding. In particular, the lips are usually protruded for “sh” but not “s”, meaning listeners can potentially identify the contrast based on visual information.

Listeners were asked to watch/listen to stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent (e.g., audio “save” paired with visual “shave”). The listeners’ task was to identify whether the first sound of the stimuli was “s” or “sh”. We tested two groups of native English listeners – one familiar with the speaker who produced the stimuli and one unfamiliar with the speaker.

The results showed that listeners familiar with the speaker identified the fricatives faster in all conditions (Figure 1) and more accurately in the visual-only condition (Figure 2). That is, listeners familiar with the speaker were more efficient in identifying the fricatives overall, and were more accurate when visual input was the only source of information.

We also examined whether visual familiarity affects the occurrence of the McGurk effect. Listeners were asked to identify syllable-initial stops (“b”, “d”, “g”) from stimuli that were audiovisual-congruent or incongruent (e.g., audio “ba” paired with visual “ga”). A blended (McGurk) response was indicated by a “da” response to an auditory “ba” paired with a visual “ga”.

Contrary to the “s”-“sh” findings reported earlier, the results from our identification task showed no difference between the familiar and unfamiliar listeners in the proportion of McGurk responses. This finding did not replicate Walker, Bruce, and O’Malley (1995).

In sum, familiarity with a speaker facilitated the speed of identifying fricatives from audiovisual stimuli. Familiarity also improved the accuracy of fricative identification when visual input was the only source of information. Although we did not find an effect of familiarity on the McGurk responses, our findings from the fricative task suggest that processing audiovisual speech is affected by speaker identity.

Figure 1- Reaction time of fricative identification from stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent. Error bars indicate 95% confidence intervals.

 

Figure 2- Accuracy of fricative identification (d’) from stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent (e.g., audio “save” paired with visual “shave”). Error bars indicate 95% confidence intervals.

Figure 3- Proportion of McGurk response (“da” response to audio “ba” paired with visual “ga”).

 

Video 1- Example of an audiovisual-incongruent stimulus (audio “save” paired with visual “shave”).

 

Video 2- Example of an audiovisual-incongruent stimulus (audio “ba” paired with visual “ga”).

 

References:

 

Bruce, V., & Young, A. (1986). Understanding face recognition. British Journal of Psychology, 77, 305-327.

Green, K. P., Kuhl, P. K., Meltzoff, A. N., & Stevens, E. B. (1991). Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception & Psychophysics, 50, 524-536.

Lee, C.-Y., & Zhang, Y. (in press). Processing lexical and speaker information in repetition and semantic/associative priming. Journal of Psycholinguistic Research.

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 26, 746-748.

Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60, 355-376.

Walker, S., Bruce, V., & O’Malley, C. (1995). Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception and Psychophysics, 57, 1124-1133.

Welch, R. B., & Warren, D. H. (1980). Immediate perceptual response to intersensory discrepancy. Psychological Bulletin, 88, 638-667.

 

 

Share This