The science of baby speech sounds: men and women may experience them differently

M. Fernanda Alonso Arteche – maria.alonsoarteche@mail.mcgill.ca
Instagram: @laneurotransmisora

School of Communication Science and Disorders, McGill University, Center for Research on Brain, Language, and Music (CRBLM), Montreal, QC, H3A 0G4, Canada

Instagram: @babylabmcgill

Popular version of 2pSCa – Implicit and explicit responses to infant sounds: a cross-sectional study among parents and non-parents
Presented at the 186th ASA Meeting
Read the abstract at https://eppro02.ativ.me/web/index.php?page=IntHtml&project=ASASPRING24&id=3672643

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

Imagine hearing a baby coo and instantly feeling a surge of positivity. Surprisingly, how we react to the simple sounds of a baby speaking might depend on whether we are women or men, and whether we are parents. Our lab’s research delves into this phenomenon, revealing intriguing differences in how adults perceive baby vocalizations, with a particular focus on mothers, fathers, and non-parents.

Using a method that measures reaction time to sounds, we compared adults’ responses to vowel sounds produced by a baby and by an adult, as well as meows produced by a cat and by a kitten. We found that women, including mothers, tend to respond positively only to baby speech sounds. On the other hand, men, especially fathers, showed a more neutral reaction to all sounds. This suggests that the way we process human speech sounds, particularly those of infants, may vary significantly between genders. While previous studies report that both men and women generally show a positive response to baby faces, our findings indicate that their speech sounds might affect us differently.

Moreover, mothers rated babies and their sounds highly, expressing a strong liking for babies, their cuteness, and the cuteness of their sounds. Fathers, although less responsive in the reaction task, still rated highly their liking for babies, the cuteness of them, and the appeal of their sounds. This contrast between implicit (subconscious) reactions and explicit (conscious) opinions highlights an interesting complexity in parental instincts and perceptions. Implicit measures, such as those used in our study, tap into automatic and unconscious responses that individuals might not be fully aware of or may not express when asked directly. These methods offer a more direct window into the underlying feelings that might be obscured by social expectations or personal biases.

This research builds on earlier studies conducted in our lab, where we found that infants prefer to listen to the vocalizations of other infants, a factor that might be important for their development. We wanted to see if adults, especially parents, show similar patterns because their reactions may also play a role in how they interact with and nurture children. Since adults are the primary caregivers, understanding these natural inclinations could be key to supporting children’s development more effectively.

The implications of this study are not just academic; they touch on everyday experiences of families and can influence how we think about communication within families. Understanding these differences is a step towards appreciating the diverse ways people connect with and respond to the youngest members of our society.

Why is it easier to understand people we know?

Emma Holmes – emma.holmes@ucl.ac.uk
X (Twitter): @Emma_Holmes_90

University College London (UCL), Department of Speech Hearing and Phonetic Sciences, London, Greater London, WC1N 1PF, United Kingdom

Popular version of 4aPP4 – How does voice familiarity affect speech intelligibility?
Presented at the 186 ASA Meeting
Read the abstract at https://eppro02.ativ.me/web/index.php?page=IntHtml&project=ASASPRING24&id=3674814

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

It’s much easier to understand what others are saying if you’re listening to a close friend or family member, compared to a stranger. If you practice listening to the voices of people you’ve never met before, you might also become better at understanding them too.

Many people struggle to understand what others are saying in noisy restaurants or cafés. This can become much more challenging as people get older. It’s often one of the first changes that people notice in their hearing. Yet, research shows that these situations are much easier if people are listening to someone they know very well.

In our research, we ask people to visit the lab with a friend or partner. We record their voices while they read sentences aloud. We then invite the volunteers back for a listening test. During the test, they hear sentences and click words on a screen to show what they heard. This is made more difficult by playing a second sentence at the same time, which the volunteers are told to ignore. This is like having a conversation when there are other people talking around you. Our volunteers listen to many sentences over the course of the experiment. Sometimes, the sentence is one recorded from their friend or partner. Other times, it’s one recorded from someone they’ve never met. Our studies have shown that people are best at understanding the sentences spoken by their friend or partner.

In one study, we manipulated the sentence recordings, to change the sound of the voices. The voices still sounded natural. Yet, volunteers could no longer recognize them as their friend or partner. We found that participants were still better at understanding the sentences, even though they didn’t recognize the voice.

In other studies, we’ve investigated how people learn to become familiar with new voices. Each volunteer learns the names of three new people. They’ve never met these people, but we play them lots of recordings of their voices. This is like when you listen to a new podcast or radio show. We’ve found that people become very good at understanding these people. In other words, we can train people to become familiar with new voices.

In new work that hasn’t yet been published, we found that voice familiarization training benefits both older and younger people. So, it may help older people who find it very difficult to listen in noisy places. Many environments contain background noise—from office parties to hospitals and train stations. Ultimately, we hope that we can familiarize people with voices they hear in their daily lives, to make it easier to listen in noisy places.

1aPP – The Role of Talker/Vowel Change in Consonant Recognition with Hearing Loss

Ali Abavisani – aliabavi@illinois.edu
Jont B. Allen – jontalle@illinois.edu
Dept. of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign
405 N Mathews Ave
Urbana, IL, 61801

Popular version of paper 1aPP
Presented Monday, May 13, 2019
177th ASA Meeting, Louisville, KY

Hearing loss can have serious impact on social life of individuals experiencing it. The effect of hearing loss becomes more complicated in environments such as restaurants, where the background noise is similar to speech. Although hearing aids in various designs, intend to address these issues, users complain about hearing aids performance in social situations, where they are mostly needed. Part of this problem refers to the nature of hearing aids, which do not use speech as part of design and fitting process. If we somehow incorporate speech sounds in real life conditions into the fitting process of hearing aids, it may be possible to address most of the shortcomings that irritates the users.

There have been many studies on the features that are important in identification of speech sounds such as isolated consonant + vowel (CV) phones (i.e., meaningless speech sound). Most of these studies ran experiments on normal hearing listeners, to identify the effects of different speech features in correct recognition. It turned out that manipulation of speech sounds, such as replacing a vowel, or amplifying/attenuating certain parts of sound in time-frequency domain, leads to identification of new speech sounds by the normal hearing listeners. One goal of current study is to investigate whether there are similar responses to such manipulations from listeners who have hearing loss.

We designed a speech-based test that may be utilized by audiologists to determine susceptible speech phones for each individual with hearing loss. The design includes a perceptual measure that corresponds to speech understanding in background noise, where the noise is similar to speech. The perceptual measure identifies the noise level in which the speech sound is recognizable by an average normal hearing listener, at least with 90% accuracy. The speech sounds within the test include combinations of 14 consonants {p, t, k, f, s, S, b, d, g, v, z, Z, m, n} and four vowels {A, ae, I, E}, to cover different features that are present in speech. All the test sounds have pre-evaluated to make sure they are recognizable by normal hearing listeners in the noise conditions of the experiments. Two sets of sounds named T$_1$ and T$_2$ having same consonant-vowel combinations of sounds but different talkers, had been presented to the listeners at their most comfortable level of hearing (not depending to their specific hearing loss). The two speech sets had distinct perceptual measure. When two sounds with similar perceptual measure, and with the same consonant but different vowel are presented to a listener with hearing loss, their response can show us how their particular hearing function, may cause errors in understanding this particular speech sound, and why this function led to recognition of a specific sound instead of the presented speech. Also, presenting sounds from the two sets constitute the means to compare the role of perceptual measure (which is based on normal hearing listeners), on listeners with hearing loss. When the recognition score for a particular listener increases as the result of a change in presented speech sounds, it is an indication on how the fitting process of hearing aid should follow, regarding that particular (listener, speech sound) pair.

While the study shows that improvement or degradation of the speech sounds are listener dependent, on average 85% of sounds are improved when we replaced the CV with same CV but with a better perceptual measure. Additionally, using CVs with similar perceptual measure, on average 28% of CVs are improved when we replaced the vowel with vowel {A}, 28% of CVs are improved when we replaced the vowel with vowel {E}, 25% of CVs are improved when we replaced the vowel with vowel {ae}, and 19% of CVs are improved when we replaced the vowel with vowel {I}.

The confusion pattern in each case, provides insight on how these changes affect the phone recognition in each ear. We propose to prescribe hearing aid amplification tailored to individual ears, based on the confusion pattern, the response from change in perceptual measure, and the response from change in vowel.

These tests are directed at the fine-tuning of hearing aid insertion gain, with the ultimate goal of improving speech perception, and to precisely identify when and for what consonants the ear with hearing loss needs treatment to enhance speech recognition.

3pSC87 – What the f***? Making sense of expletives in The Wire

Erica Gold – e.gold@hud.ac.uk
Dan McIntyre – d.mcintyre@hud.ac.uk

University of Huddersfield
Queensgate
Huddersfield, HD1 3DH
United Kingdom

Popular version of 3pSC87 – What the f***: An acoustic-pragmatic analysis of meaning in The Wire
Presented Wednesday afternoon, November 30, 2016
172nd ASA Meeting, Honolulu
Click here to read the abstract

The Wire - expletives

In Season one of HBO’s acclaimed crime drama The Wire, Detectives Jimmy McNulty and ‘Bunk’ Moreland are investigating old homicide cases, including the murder of a young woman shot dead in her apartment. McNulty and Bunk visit the scene of the crime to try and figure out exactly how the woman was killed. What makes the scene unusual dramatically is that, engrossed in their investigation, the two detectives communicate with each other using only the word, “fuck” and its variants (e.g. motherfucker, fuckity fuck, etc.). Somehow, using only this vocabulary, McNulty and Bunk are able to communicate in a meaningful way. The scene is absorbing, engaging and even funny, and it leads to a fascinating question for linguists: how is the viewer able to understand what McNulty and Bunk mean when they communicate using such a restricted set of words?

bunk mcnulty

To investigate this, we first looked at what other linguists have discovered about the word fuck. What is clear is that it’s a hugely versatile word that can be used to express a range of attitudes and emotions. On the basis of this research, we came up with a classification scheme which we then used to categorise all the variants of fuck in the scene. Some seemed to convey disbelief and some were used as insults. Some indicated surprise or realization while others functioned to intensify the following word. And some were idiomatic set phrases (e.g. Fuckin’ A!). Our next step was to see whether there was anything in the acoustic properties of the characters’ speech that would allow us to explain why we interpreted the fucks in the way that we did.

The entire conversation between Bunk and McNulty lasts around three minutes and contains a total of 37 fuck productions (i.e. variations of fuck). Due to the variation in the fucks produced, the one clear and consistent segment for each word was the <u> in fuck. Consequently, this became the focus of our study. The <u> in fuck is the same sound you find in the word strut or duck and is represented as /ᴧ/ in the International Phonetic Alphabet. When analysing vowel sounds, such as <u>, we can look at a number of aspects of its production.

In this study, we looked at the quality of the vowel by measuring the first three formants. In phonetics, the term formant refers to acoustic resonances of sound in the vocal tract. The first two formants can tell us if the production sounds more like, “fuck” rather than, “feck” or “fack,” and the third formant gives us information about the voice quality. We also looked at the duration of the <u> being produced, “fuuuuuck” versus “ fuck.”

After measuring each instance, we ran statistical tests to see if there was any relationship between the way in which it was said, and how we categorised its range of meanings. Our results showed that if we accounted for the differences in the vocal tract shapes of the actors playing Bunk and McNulty, the quality of the vowels are relatively consistent. That is, we get a lot of <u> sounds, rather than “eh,” “oo” or “ih.”

The productions of fucks that were associated with the category of realization were found to be very similar to those associated with disbelief. However, disbelief and realization did contrast with those that were used as insults, idiomatic phrases, or functional words. Therefore, it may be more appropriate to classify the meaning into fewer categories – those that signify disbelief or realization, and those that are idiomatic, insults, or functional. It is important to remember, however, that the latter group of three meanings are represented by fewer examples in the scene. Our initial results show that these two broad groups may be distinguished through the length of the vowel – short <u> is more associated with an insult, function, or idiomatic use rather than disbelief or surprise (for which the vowel tends to be longer). In the future, we would also like to analyse the intonation of the productions. See if you can hear the difference between these samples:

Example 1: realization/surprise

Example 2: general expletive which falls under the functional/idiomatic/insult category

Our results shed new light on what for linguists is an old problem: how do we make sense of what people say when speakers so very rarely say exactly what they mean? Experts in pragmatics (the study of how meaning is affected by context) have suggested that we infer meaning when people break conversational norms. In the example from The Wire, it’s clear that the characters are breaking normal communicative conventions. But pragmatic methods of analysis don’t get us very far in explaining how we are able to infer such a range of meaning from such limited vocabulary. Our results confirm that the answer to this question is that meaning is not just conveyed at the lexical and pragmatic level, but at the phonetic level too. It’s not just what we say that’s important, it’s how we fucking say it!

*all photos are from HBO.com

1aNS4 – Musical mind control: Human speech takes on characteristics of background music

Ryan Podlubny – ryan.podlubny@pg.canterbury.ac.nz
Department of Linguistics, University of Canterbury
20 Kirkwood Avenue, Upper Riccarton
Christchurch, NZ, 8041

Popular version of paper 1aNS4, “Musical mind control: Acoustic convergence to background music in speech production.”
Presented Monday morning, November 28, 2016
172nd ASA Meeting, Honolulu

People often adjust their speech to resemble that of their conversation partners – a phenomenon known as speech convergence. Broadly defined, convergence describes automatic synchronization to some external source, much like running to the beat of music playing at the gym without intentionally choosing to do so. Through a variety of studies a general trend has emerged where we find people automatically synchronizing to various aspects of their environment1,2,3. With specific regard to language use, convergence effects have also been observed in many linguistic domains such as sentence-formation4, word-formation5, and vowel production6 (where differences in vowel production are well associated with perceived accentedness7,8). This prevalence in linguistics raises many interesting questions about the extent to which speakers converge. This research uses a speech-in-noise paradigm to explore whether or not speakers also converge to non-linguistic signals in the environment: Specifically, will a speaker’s rhythm, pitch, or intensity (which is closely related to loudness) be influenced by fluctuations in background music such that the speech echoes specific characteristics of that background music (for example, if the tempo of background music slows down, will that influence those listening to unconsciously decrease their speech rate)?

In this experiment participants read passages aloud while hearing music through headphones. Background music was composed by the experimenter to be relatively stable with regard to pitch, tempo/rhythm, and intensity, so we could manipulate and test only one of these dimensions at a time, within each test-condition. We imposed these manipulations gradually and consistently toward a target, which can be seen in Figure 1, and would similarly return to the level at which they started after reaching that target. We played the participants music with no experimental changes in between all manipulated sessions. (Examples of what participants heard in headphones are available as sound- files 1 and 2)

podlubny_fig1

Fig. 1: Using software designed for digital signal processing (analyzing and altering sound), manipulations were applied in a linear fashion (in a straight line) toward a target – this can be seen above as the blue line, which first rises and then falls. NOTE: After manipulations reach their target (the target is seen above as a dashed, vertical red line), the degree of manipulation would then return to the level at which it started in a similar linear fashion. Graphic captured while using Praat 9 to increase and then decrease the perceived loudness of the background music.

Data from 15 native speakers of New Zealand English were analyzed using statistical tests that allow effects to vary somewhat for each participant where we observed significant convergence in both the pitch and intensity conditions. Analysis of the Tempo condition, however, has not yet been conducted. Interestingly, these effects appear to differ systematically based on a person’s previous musical training. While non-musicians demonstrate the predicted effect and follow the manipulations, musicians appear to invert the effect and reliably alter aspects of their pitch and intensity in the opposite direction of the manipulation (see Figure 2). Sociolinguistic research indicates that under certain conditions speakers will emphasize characteristics of their speech to distinguish themselves socially from conversation partners or groups, as opposed to converging with them6. It seems plausible then that, given a relatively heightened ability to recognize low-level variations of sound, musicians may on some cognitive level be more aware of the variation in their sound environment, and as a result similarly resist the more typical effect. However, more work is required to better understand this phenomenon.

podlubny_fig2

Fig. 2: The above plots measure pitch on the y-axis (up and down on the left edge), and indicate the portions of background music that have been manipulated on the x- axis (across the bottom). The blue lines show that speakers generally lower their pitch as an un-manipulated condition progresses. However the red lines show that when global pitch is lowered during a test-condition, such lowering is relatively more dramatic for non-musicians (left plot) and that the effect is reversed by those with musical training (right plot). NOTE: A follow-up model further accounts for the relatedness of Pitch and Intensity and shows much the same effect.

This work indicates that speakers are not only influenced by human speech partners in production, but also, to some degree, by noise within the immediate speech environment, which suggests that environmental noise may constantly be influencing certain aspects of our speech production in very specific and predictable ways. Human listeners are rather talented when it comes to recognizing subtle cues in speech10, especially compared to computers and algorithms that can’t  yet match this ability. Some language scientists argue these changes in speech occur to make understanding easier for those listening11. That is why work like this is likely to resonate in both academia and the private sector, as a better understanding of how speech will change in different environments contributes to the development of more effective aids for the hearing impaired, as well as improvements to many devices used in global communications. 

Sound-file 1.
An example of what participants heard as a control condition (no experimental manipulation) in between test-conditions. 

Sound-file 2.
An example of what participants heard as a test condition (Pitch manipulation, which drops 200 cents/one full step).

References

1.  Hill, A. R., Adams, J. M., Parker, B. E., & Rochester, D. F. (1988). Short-term entrainment of ventilation to the walking cycle in humans. Journal of Applied Physiology65(2), 570-578.
2. Will, U., & Berg, E. (2007). Brain wave synchronization and entrainment to periodic acoustic stimuli. Neuroscience letters424(1), 55-60.
3.  McClintock, M. K. (1971). Menstrual synchrony and suppression. Nature, Vol 229, 244-245.
4.  Branigan, H. P., Pickering, M. J., McLean, J. F., & Cleland, A. A. (2007). Syntactic alignment and participant role in dialogue. Cognition, 104(2), 163-197.
5.  Beckner, C., Rácz, P., Hay, J., Brandstetter, J., & Bartneck, C. (2015). Participants Conform to Humans but Not to Humanoid
Robots in an English Past Tense Formation Task. Journal of Language and Social Psychology, 0261927X15584682.
Retreived from: http://jls.sagepub.com.ezproxy.canterbury.ac.nz/content/early/2015/05/06/0261927X15584682.
6.  Babel, M. (2012). Evidence for phonetic and social selectivity in spontaneous phonetic imitation. Journal of Phonetics, 40(1), 177-189.
7.  Major, R. C. (1987). English voiceless stop production by speakers of Brazilian Portuguese. Journal of Phonetics, 15, 197—
202.
8.  Rekart, D. M. (1985) Evaluation of foreign accent using synthetic speech. Ph.D. dissertation, the Lousiana State University.
9.  Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (Version 5.4.04) [Computer program]. Retrieved
from www.praat.org.
10.  Hay, J., Podlubny, R., Drager, K., & McAuliffe, M. (under review). Car-talk: Location-specific speech production and
perception.
11.  Lane, H., & Tranel, B. (1971). The Lombard sign and the role of hearing in speech. Journal of Speech, Language, and
Hearing Research14(4), 677-709.