Listen to the Music: We Rely on Musical Genre to Determine Singers’ Accents

Maddy Walter – maddyw37@student.ubc.ca

The University of British Columbia, Department of Linguistics, Vancouver, British Columbia, V6T 1Z4, Canada

Additional authors:
Sydney Norris, Sabrina Luk, Marcell Maitinsky, Md Jahurul Islam, and Bryan Gick

Popular version of 3pPP6 – The Role of Genre Association in Sung Dialect Categorization
Presented at the 187th ASA Meeting
Read the abstract at https://eppro01.ativ.me//web/index.php?page=Session&project=ASAFALL24&id=3771321

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–


Have you ever listened to a song and later been surprised to hear the artist speak with a different accent than the one you heard in the song? Take country singer Keith Urban’s song “What About Me” for instance; when listening, you might assume that he has a Southern American (US) English accent. However, in his interviews, he speaks with an Australian English accent. So why did you think he sounded Southern?

Research suggests that specific accents or dialects are associated with musical genres [2], that singers adjust their accents based on genre [4]; and that foreign accents are more difficult to recognize in songs compared to speech [5]. However, when listeners perceive an accent in a song, it is unclear which type of information they rely on: the acoustic speech information or information about the musical genre. Our previous research investigated this question for Country and Reggae music and found that genre recognition may play a larger role in dialect perception than the actual sound of the voice [9].

Our current study explores American Blues and Folk music, genres that allow for easier separation of vocals from instrumentals, with more refined stimuli manipulation. Blues is strongly associated with African American English [3], while Folk can be associated with a variety of (British, American, etc.) dialects [1]. Participants listened to manipulated clips of sung and “spoken” lines taken from songs in both genres, which were transcribed for participants (see Figure 1). AI applications were used to remove instrumentals for both sung and spoken clips, while “spoken” clips also underwent rhythm and pitch normalization so that they sounded like spoken rather than sung speech. After hearing each sung or spoken line, participants were asked to identify the dialect they heard from six options [7, 8] (see Figure 2).

Figure 1: Participant view of a transcript from a Folk song clip.
Figure 2: Participant view of six dialect options after hearing a clip.

Participants were much more confident and accurate in categorizing accents for clips in the Sung condition, regardless of genre. The proportion of uncertainty (“Not Sure” responses) in the Spoken condition was consistent across genres (see “D” in Figure 3), suggesting that participants were more certain of dialect when musical cues were present. Dialect categories followed genre expectations, as can be seen from the increase in identifying African American English for Blues in the Sung condition (see “A”). Removing uncertainty by adding genre cues did not increase the likelihood of “Irish English” or “British English” being chosen for Blues, though it did for Folk (see “B” and “C” in Figure 3), in line with genre-based expectations.

Figure 3: Participant dialect responses.

These findings enhance our understanding of the relationship between musical genre and accent. Referring again to the example of Keith Urban, the singer’s stylistic accent change may not be the only culprit for our interpretation of a Southern drawl. Rather, we may have assumed we were listening to a musician with a Southern American English Accent when we heard the first banjo-like twang or tuned into iHeartCountry Radio. When we listen to a song and perceive a singer’s accent, we are not only listening to the sounds of their speech, but are also shaping our perception from our expectations of dialect based on the musical genre.

References:

  1. Carrigan, J., Henry L. (2004). Lornell, kip. the NPR curious listener’s guide to american folk music. Library Journal (1976), 129(19), 63.
  2. Coupland, N. (2011). Voice, place and genre in popular song performance. Journal of Sociolinguistics, 15(5), 573–602. https://doi.org/10.1111/j.1467-9841.2011.00514.x.
  3. De Timmerman, Romeo, et al. (2024). The globalization of local indexicalities through music: African‐American English and the blues. Journal of Sociolinguistics, 28(1), 3–25. https://doi.org/10.1111/josl.12616.
  4. Gibson, A. M. (2019). Sociophonetics of popular music: insights from corpus analysis and speech perception experiments [Doctoral dissertation, University of Canterbury]. http://dx.doi.org/10.26021/4007.
  5. Mageau, M., Mekik, C., Sokalski, A., & Toivonen, I. (2019). Detecting foreign accents in song. Phonetica, 76(6), 429–447. https://doi.org/10.1159/000500187.
  6. RStudio. (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA. http://www.rstudio.com/.
  7. Stoet, G. (2010). PsyToolkit – A software package for programming psychological experiments using Linux. Behavior Research Methods, 42(4), 1096-1104.
  8. Stoet, G. (2017). PsyToolkit: A novel web-based method for running online questionnaires and reaction-time experiments. Teaching of Psychology, 44(1), 24-31.
  9. Walter, M., Bengtson, G., Maitinsky, M., Islam, M. J., & Gick, B. (2023). Dialect perception in song versus speech. The Journal of the Acoustical Society of America, 154(4_supplement), A161. https://doi.org/10.1121/10.0023131.

The science of baby speech sounds: men and women may experience them differently

M. Fernanda Alonso Arteche – maria.alonsoarteche@mail.mcgill.ca
Instagram: @laneurotransmisora

School of Communication Science and Disorders, McGill University, Center for Research on Brain, Language, and Music (CRBLM), Montreal, QC, H3A 0G4, Canada

Instagram: @babylabmcgill

Popular version of 2pSCa – Implicit and explicit responses to infant sounds: a cross-sectional study among parents and non-parents
Presented at the 186th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0027179

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

Imagine hearing a baby coo and instantly feeling a surge of positivity. Surprisingly, how we react to the simple sounds of a baby speaking might depend on whether we are women or men, and whether we are parents. Our lab’s research delves into this phenomenon, revealing intriguing differences in how adults perceive baby vocalizations, with a particular focus on mothers, fathers, and non-parents.

Using a method that measures reaction time to sounds, we compared adults’ responses to vowel sounds produced by a baby and by an adult, as well as meows produced by a cat and by a kitten. We found that women, including mothers, tend to respond positively only to baby speech sounds. On the other hand, men, especially fathers, showed a more neutral reaction to all sounds. This suggests that the way we process human speech sounds, particularly those of infants, may vary significantly between genders. While previous studies report that both men and women generally show a positive response to baby faces, our findings indicate that their speech sounds might affect us differently.

Moreover, mothers rated babies and their sounds highly, expressing a strong liking for babies, their cuteness, and the cuteness of their sounds. Fathers, although less responsive in the reaction task, still rated highly their liking for babies, the cuteness of them, and the appeal of their sounds. This contrast between implicit (subconscious) reactions and explicit (conscious) opinions highlights an interesting complexity in parental instincts and perceptions. Implicit measures, such as those used in our study, tap into automatic and unconscious responses that individuals might not be fully aware of or may not express when asked directly. These methods offer a more direct window into the underlying feelings that might be obscured by social expectations or personal biases.

This research builds on earlier studies conducted in our lab, where we found that infants prefer to listen to the vocalizations of other infants, a factor that might be important for their development. We wanted to see if adults, especially parents, show similar patterns because their reactions may also play a role in how they interact with and nurture children. Since adults are the primary caregivers, understanding these natural inclinations could be key to supporting children’s development more effectively.

The implications of this study are not just academic; they touch on everyday experiences of families and can influence how we think about communication within families. Understanding these differences is a step towards appreciating the diverse ways people connect with and respond to the youngest members of our society.

Why is it easier to understand people we know?

Emma Holmes – emma.holmes@ucl.ac.uk
X (Twitter): @Emma_Holmes_90

University College London (UCL), Department of Speech Hearing and Phonetic Sciences, London, Greater London, WC1N 1PF, United Kingdom

Popular version of 4aPP4 – How does voice familiarity affect speech intelligibility?
Presented at the 186th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0027437

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

It’s much easier to understand what others are saying if you’re listening to a close friend or family member, compared to a stranger. If you practice listening to the voices of people you’ve never met before, you might also become better at understanding them too.

Many people struggle to understand what others are saying in noisy restaurants or cafés. This can become much more challenging as people get older. It’s often one of the first changes that people notice in their hearing. Yet, research shows that these situations are much easier if people are listening to someone they know very well.

In our research, we ask people to visit the lab with a friend or partner. We record their voices while they read sentences aloud. We then invite the volunteers back for a listening test. During the test, they hear sentences and click words on a screen to show what they heard. This is made more difficult by playing a second sentence at the same time, which the volunteers are told to ignore. This is like having a conversation when there are other people talking around you. Our volunteers listen to many sentences over the course of the experiment. Sometimes, the sentence is one recorded from their friend or partner. Other times, it’s one recorded from someone they’ve never met. Our studies have shown that people are best at understanding the sentences spoken by their friend or partner.

In one study, we manipulated the sentence recordings, to change the sound of the voices. The voices still sounded natural. Yet, volunteers could no longer recognize them as their friend or partner. We found that participants were still better at understanding the sentences, even though they didn’t recognize the voice.

In other studies, we’ve investigated how people learn to become familiar with new voices. Each volunteer learns the names of three new people. They’ve never met these people, but we play them lots of recordings of their voices. This is like when you listen to a new podcast or radio show. We’ve found that people become very good at understanding these people. In other words, we can train people to become familiar with new voices.

In new work that hasn’t yet been published, we found that voice familiarization training benefits both older and younger people. So, it may help older people who find it very difficult to listen in noisy places. Many environments contain background noise—from office parties to hospitals and train stations. Ultimately, we hope that we can familiarize people with voices they hear in their daily lives, to make it easier to listen in noisy places.

2pSC14 – Improving the Accuracy of Automatic Detection of Emotions From Speech

Reza Asadi and Harriet Fell

Popular version of poster 2pSC14 “Improving the accuracy of speech emotion recognition using acoustic landmarks and Teager energy operator features.”
Presented Tuesday afternoon, May 19, 2015, 1:00 pm – 5:00 pm, Ballroom 2
169th ASA Meeting, Pittsburgh

“You know, I can feel the fear that you carry around and I wish there was… something I could do to help you let go of it because if you could, I don’t think you’d feel so alone anymore.”
— Samantha, a computer operating system in the movie “Her”

Introduction
Computers that can recognize human emotions could react appropriately to a user’s needs and provide more human like interactions. Emotion recognition can also be used as a diagnostic tool for medical purposes, onboard car driving systems to keep the driver alert if stress is detected, a similar system in aircraft cockpits, and also electronic tutoring and interaction with virtual agents or robots. But is it really possible for computers to detect the emotions of their users?

During the past fifteen years, computer and speech scientists have worked on the automatic detection of emotion in speech. In order to interpret emotions from speech the machine will gather acoustic information in the form of sound signals, then extract related information from the signals and find patterns which relate acoustic information to the emotional state of speaker. In this study new combinations of acoustic feature sets were used to improve the performance of emotion recognition from speech. Also a comparison of feature sets for detecting different emotions is provided.

Methodology
Three sets of acoustic features were selected for this study: Mel-Frequency Cepstral Coefficients, Teager Energy Operator features and Landmark features.

Mel-Frequency Cepstral Coefficients:
In order to produce vocal sounds, vocal cords vibrate and produce periodic pulses which result in glottal wave. The vocal tract starting from the vocal cords and ending in the mouth and nose acts as a filter on the glottal wave. The Cepstrum is a signal analysis tool which is useful in separating source from filter in acoustic waves. Since the vocal tract acts as a filter on a glottal wave we can use the cepstrum to extract information only related to the vocal tract.

The mel scale is a perceptual scale for pitches as judged by listeners to be equal in distance from one another. Using mel frequencies in cepstral analysis approximates the human auditory system’s response more closely than using the linearly-spaced frequency bands. If we map frequency powers of energy in original speech wave spectrum to mel scale and then perform cepstral analysis we get Mel-Frequency Cepstral Coefficients (MFCC). Previous studies use MFCC for speaker and speech recognition. It has also been used to detect emotions.

Teager Energy Operator features:
Another approach to modeling speech production is to focus on the pattern of airflow in the vocal tract. While speaking in emotional states of panic or anger, physiological changes like muscle tension alter the airflow pattern and can be used to detect stress in speech. It is difficult to mathematically model the airflow, therefore Teager proposed the Teager Energy Operators (TEO), which computes the energy of vortex-flow interaction at each instance of time. Previous studies show that TEO related features contain information which can be used to determine stress in speech.

Acoustic landmarks:
Acoustic landmarks are locations in the speech signal where important and easily perceptible speech properties are rapidly changing. Previous studies show that the number of landmarks in each syllable might reflect underlying cognitive, mental, emotional, and developmental states of the speaker.

Asadi1 - EmotionsFigure 1 – Spectrogram (top) and acoustic landmarks (bottom) detected in neutral speech sample

Sound File 1 – A speech sample with neutral emotion

Asadi2 - Emotions

Figure 2 – Spectrogram (top) and acoustic landmarks (bottom) detected in anger speech sample

Sound File 2 – A speech sample with anger emotion

 

Classification:
The data used in this study came from the Linguistic Data Consortium’s Emotional Prosody and Speech Transcripts. In this database four actresses and three actors, all in their mid-20s, read a series of semantically neutral utterances (four-syllable dates and numbers) in fourteen emotional states. A description for each emotional state was handed over to the participants to be articulated in the proper emotional context. Acoustic features described previously were extracted from the speech samples in this database. These features were used for training and testing Support Vector Machine classifiers with the goal of detecting emotions from speech. The target emotions included anger, fear, disgust, sadness, joy, and neutral.

Results
The results of this study show an average detection accuracy of approximately 91% among these six emotions. This is 9% better than a previous study conducted at CMU on the same data set.

Specifically TEO features resulted in improvements in detecting anger and fear and landmark features improved the results for detecting sadness and joy. The classifier had the highest accuracy, 92%, in detecting anger and the lowest, 87%, in detecting joy.