2aMU5 – Do people find vocal fry in popular music expressive?

Mackenzie Parrott – mackenzie.lanae@gmail.com
John Nix – john.nix@utsa.edu

Popular version of paper 2aMU5, “Listener Ratings of Singer Expressivity in Musical Performance.”
Presented Tuesday, May 24, 2016, 10:20-10:35 am, Salon B/C, ASA meeting, Salt Lake City

Vocal fry is the lowest register of the human voice.  Its distinct sound is characterized by a low rumble interspersed with uneven popping and crackling.  The use of fry as a vocal mannerism is becoming increasingly common in American speech, fueling discussion about the implications of its use and how listeners perceive the speaker [1].  Previous studies have suggested that listeners find vocal fry to be generally unpleasant in women’s speech, but associate it with positive characteristics in men’s speech [2].

As it has become more prevalent, fry has perhaps not surprisingly found its place in many commercial song styles as well.  Many singers are implementing fry as a stylistic device at the onset or offset of a sung tone.  This can be found very readily in popular musical styles, presumably to impact and amplify the emotion that the performer is attempting to convey.

Researchers at the University of Texas at San Antonio conducted a survey to analyze whether listeners’ ratings of a singer’s expressivity in musical samples in two contemporary commercial styles (pop and country) were affected by the presence of vocal fry, and to see if there was a difference in listener ratings according to the singer’s gender.  A male and a female singer recorded musical samples for the study in a noise reduction booth.  As can be seen in the table below, the singers were asked to sing most of the musical selections twice, once using vocal fry at phrase onsets, and once without fry, while maintaining the same vocal quality, tempo, dynamics, and stylization.  Some samples were presented more than one time in the survey portion of the study to test listener reliability.

Song Singer Gender Vocal Mode
(Hit Me) Baby One More Time Female Fry Only
If I Die Young Female With and Without Fry
National Anthem Female With and Without Fry
Thinking Out Loud Male Without Fry Only
Amarillo By Morning Male With and Without Fry
National Anthem Male With and Without Fry

Across all listener ratings of all the songs, the recordings which included vocal fry were rated as being only slightly more expressive than the recordings which contained no vocal fry.  When comparing the use of fry between the male and female singer, there were some differences between the genders.  The listeners rated the samples where the female singer used vocal fry higher (e.g., more expressive) than those without fry, which was surprising considering the negative association with women using vocal fry in speech.  Conversely, the listeners rated the male samples without fry as being more expressive than those with fry. Part of this preference pattern may have also been an indication of the singer; the male singer was much more experienced with pop styles than the female singer, who is primarily classically trained.  The overall expressivity ratings for the male singer were higher than those of the female singer by a statistically significant margin.

There were also listener rating trends between the differing age groups of participants.  Younger listeners drove the gap of preference between the female singer’s performances with fry versus non-fry and the male singer’s performances without fry versus with fry further apart.  Presumably they are more tuned into stylistic norms of current pop singers.  However, this could also imply a gender bias in younger listeners.  The older listener groups rated the mean expressivity of the performers as being lower than the younger listener groups.  Since most of the songs that we sampled are fairly recent in production, this may indicate a generational trend in preference.  Perhaps listeners rate the style of vocal production that is most similar to what they listened to during their young adult years as the most expressive style of singing. These findings have raised many questions for further studies about vocal fry in pop and country music.

 

  1. Anderson, R.C., Klofstad, C.A., Mayew, W.J., Venkatachalam, M. “Vocal Fry May Undermine the Success of Young Women in the Labor Market. “ PLoS ONE, 2014. 9(5): e97506. doi:10.1371/journal.pone.0097506.
  2. Yuasa, I. P. “Creaky Voice: A New Feminine Voice Quality for Young Urban-Oriented Upwardly Mobile American Women.” American Speech, 2010. 85(3): 315-337.

4aSC2 – Effects of language and music experience on speech perception

T. Christina Zhao — zhaotc@uw.edu
Patricia K. Kuhl — pkkuhl@uw.edu
Institute for Learning & Brain Sciences
University of Washington, BOX 357988
Seattle, WA, 98195

Popular version of paper 4aSC2, “Top-down linguistic categories dominate over bottom-up acoustics in lexical tone processing”
Presented Thursday morning, May 21st, 2015, 8:00 AM, Ballroom 2
169th ASA Meeting, Pittsburgh

Speech perception involves constant interplay between top-down and bottom-up processing. For example, to process phonemes (e.g. ‘b’ from ‘p’), the listener must accurately process the acoustical information in the speech signals (i.e. bottom-up strategy) and assign these sounds efficiently to a category (i.e. top-down strategy). Listeners’ performance in speech perception tasks is influenced by their experience in either processing strategy. Here, we use lexical tone processing as a window to examine how extensive experience in both strategies influence speech perception.

Lexical tones are contrastive pitch contour patterns at the word level. That is, a small difference in the pitch contour can result in different word meaning. Native speakers of a tonal language thus have extensive experience in using the top-down strategy to assign highly variable pitch contours into lexical tone categories. This top-down influence is reflected by the reduced sensitivity to acoustic differences within a phonemic category compared to across categories (Halle, Chang, & Best, 2004). On the other hand, individuals with extensive music training early in life exhibit enhanced sensitivities to pitch differences not only in music, but also in speech, reflecting stronger bottom-up influence. Such bottom-up influence is reflected by the enhanced sensitivity in detecting differences between lexical tones when the listeners are non-tonal language speakers (Wong, Skoe, Russo, Dees, & Kraus, 2007).
How does extensive experience in both strategies influence lexical tone processing? To address this question, native Mandarin speakers with extensive music training (N=17) completed a music pitch discrimination task and a lexical tone discrimination task. We compared their performance with individuals with extensive experience in only one of the processing strategies (i.e. Mandarin nonmusicians (N=20) and English musicians (N=20), data from Zhao & Kuhl (2015)).

Despite the enhanced performance in the music pitch discrimination task in Mandarin musicians, their performance in the lexical tone discrimination task is similar to the performance of the Mandarin nonmusicians, and different from the English musicians’ performance (Fig. 1, ‘Sensitivity across lexical tone continuum by group’).
ZhaoFig1
That is, they exhibited reduced sensitivities within phonemic categories (i.e. on either end of the line) compared to within categories (i.e. the middle of the line), and their overall performance is lower than the English musicians. This result strongly suggests a dominant effect of the top-down influence in processing lexical tone. Yet, further analyses revealed that Mandarin musicians and Mandarin nonmusicians may still be relying on different underlying mechanisms for performing in the lexical tone discrimination task. In the Mandarin musician, their music pitch discrimination scores are correlated with their lexical tone discrimination scores, suggesting a contribution of the bottom-up strategy in their lexical tone discrimination performance (Fig. 2, ‘Music pitch and lexical tone discrimination’, purple). This relation is similar to the English musicians (Fig. 2, peach) but very different from the Mandarin non-musicians (Fig. 2, yellow). Specifically, for Mandarin nonmusicians, the music pitch discrimination scores do not correlate with the lexical tone discrimination scores, suggesting independent processes.

ZhaoFig2

Halle, P. A., Chang, Y. C., & Best, C. T. (2004). Identification and discrimination of Mandarin Chinese tones by Mandarin Chinese vs. French listeners. Journal of Phonetics, 32(3), 395-421. doi: 10.1016/s0095-4470(03)00016-0
Wong, P. C. M., Skoe, E., Russo, N. M., Dees, T., & Kraus, N. (2007). Musical experience shapes human brainstem encoding of linguistic pitch patterns. Nat. Neurosci., 10(4), 420-422. doi: 10.1038/nn1872
Zhao, T. C., & Kuhl, P. K. (2015). Effect of musical experience on learning lexical tone categories. The Journal of the Acoustical Society of America, 137(3), 1452-1463. doi: doi:http://dx.doi.org/10.1121/1.4913457

3aSPb5 – Improving Headphone Spatialization: Fixing a problem you’ve learned to accept

Muhammad Haris Usmani – usmani@cmu.edu
Ramón Cepeda Jr. – rcepeda@andrew.cmu.edu
Thomas M. Sullivan – tms@ece.cmu.edu
Bhiksha Raj – bhiksha@cs.cmu.edu
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213

Popular version of paper 3aSPb5, “Improving headphone spatialization for stereo music”
Presented Wednesday morning, May 20, 2015, 10:15 AM, Brigade room
169th ASA Meeting, Pittsburgh

The days of grabbing a drink, brushing dust from your favorite record and playing it in the listening room of the house are long gone. Today, with the portability technology has enabled, almost everybody listens to music on their headphones. However, most commercially produced stereo music is mixed and mastered for playback on loudspeakers– this presents a problem for the growing number of headphone listeners. When a legacy stereo mix is played on headphones, all instruments or voices in that piece get placed in between the listener’s ears, inside of their head. This not only is unnatural and fatiguing for the listener, but is detrimental toward the original placement of the instruments in that musical piece. It disturbs the spatialization of the music and makes the sound image appear as three isolated lobes inside of the listener’s head [1], see Figure 1.

usmani_1

Hard-panned instruments separate into the left and right lobes, while instruments placed at center stage are heard in the center of the head. However, as hearing is a dynamic process that adapts and settles with the perceived sound, we have accepted headphones to sound this way [2].

In order to improve the spatialization of headphones, the listener’s ears must be deceived into thinking that they are listening to the music inside of a listening room. When playing music in a room, the sound travels through the air, reverberates inside the room, and interacts with the listener’s head and torso before reaching the ears [3]. These interactions add the necessary psychoacoustic cues for perception of an externalized stereo soundstage presented in front of the listener. If this listening room is a typical music studio, the soundstage perceived is close to what the artist intended. Our work tries to place the headphone listener into the sound engineer’s seat inside a music studio to improve the spatialization of music. For the sake of compatibility across different headphones, we try to make minimal changes to the mastering equalization curve of the music.

Since there is a compromise between sound quality and the spatialization that can be presented, we developed three different systems that present different levels of such compromise. We label these as Type-I, Type-II, and Type-0. Type-I focuses on improving spatialization but at the cost of losing some sound quality, Type-II improves spatialization while taking into account that the sound quality is not degraded too much, and Type-0 focuses on refining conventional listening by making the sound image more homogeneous. Since the sound quality is key in music, we will skip over Type-I and focus on the other two systems.

Type-II, consists of a head related transfer function (HRTF) model [4], room reverberation (synthesized reverb [5]), and a spectral correction block. HRTFs embody all the complex spatialization cues that exist due to the relative positions of the listener and the source [6]. In our case, a general HRTF model is used which is configured to place the listener at the “sweet spot” in the studio (right and left speakers placed at an angle of 30° from the listener’s head). The spectral correction attempts to keep the original mastering equalization curve as intact as possible.

Type-0, is made up of a side-content crossfeed block and a spectral correction block. Some headphone amps allow crossfeed between the left and right channels to model the fact that when listening to music through loudspeakers, each ear can hear the music from each speaker with a delay attached to the sound originating from the speaker that is furthest away. A shortcoming of conventional crossfeed is that the delay we can apply is limited (to avoid comb filtering) [7]. Side-content crossfeed resolves this by only crossfeeding unique content between the two channels, allowing us to use larger delays. In this system, the side-content is extracted by using a stereo-to-3 upmixer, which is implemented as a novel extension to Nikunen et al.’s upmixer [8].

These systems were put to the test by conducting a subjective evaluation with 28 participants, all between 18 to 29 years of age. The participants were introduced to the metrics that were being measured in the beginning of the evaluation. Since the first part of the evaluation included specific spatial metrics which are a bit complicated to grasp for untrained listeners, we used a collection of descriptions, diagrams, and/or music excerpts that represented each metric to provide in-evaluation training for the listeners. The results of the first part of the evaluation suggest that this method worked well.
We were able to conclude from the results that Type-II externalized the sounds while performing at a level analogous to the original source in the other metrics and Type-0 was able to improve sound quality and comfort by compromising stereo width when compared to the original source, which is what we expected. Also, there was strong content-dependence observed in the results suggesting that a different setting of improving spatialization must be used with music that’s been produced differently. Overall, two of the three proposed systems in this work are preferred in equal or greater amounts to the legacy stereo mix.

Tags: music, acoustics, design, technology

References

[1] G-Sonique, “Monitor MSX5 – Headphone monitoring system,” G-Sonique, 2011. [Online]. Available: http://www.g-sonique.com/msx5headphonemonitoring.html.
[2] S. Mushendwa, “Enhancing Headphone Music Sound Quality,” Aalborg University – Institute of Media Technology and Engineering Science, 2009.
[3] C. J. C. H. K. K. Y. J. L. Yong Guk Kim, “An Integrated Approach of 3D Sound Rendering,” Springer-Verlag Berlin Heidelberg, vol. II, no. PCM 2010, p. 682–693, 2010.
[4] D. Rocchesso, “3D with Headphones,” in DAFX: Digital Audio Effects, Chichester, John Wiley & Sons, 2002, pp. 154-157.
[5] P. E. Roos, “Samplicity’s Bricasti M7 Impulse Response Library v1.1,” Samplicity, [Online]. Available: http://www.samplicity.com/bricasti-m7-impulse-responses/.
[6] R. O. Duda, “3-D Audio for HCI,” Department of Electrical Engineering, San Jose State University, 2000. [Online]. Available: http://interface.cipic.ucdavis.edu/sound/tutorial/. [Accessed 15 4 2015].
[7] J. Meier, “A DIY Headphone Amplifier With Natural Crossfeed,” 2000. [Online]. Available: http://headwize.com/?page_id=654.
[8] J. Nikunen, T. Virtanen and M. Vilermo, “Multichannel Audio Upmixing by Time-Frequency Filtering Using Non-Negative Tensor Factorization,” Journal of the AES, vol. 60, no. 10, pp. 794-806, October 2012.

5aMU1 – Understanding timbral effects of multi-resonator/generator systems of wind instruments in the context of western and non-western music

Popular version of poster 5aMU1
Presented Friday morning, May 22, 2015, 8:35 AM – 8:55 AM, Kings 4
169th ASA Meeting, Pittsburgh

In this paper the relationship between musical instruments and the rooms they are performed in was investigated. A musical instrument is typically characterized as a system that consists of a tone generator combined with a resonator. A saxophone for example has a reed as a tone generator and a comical shaped resonator that can be effectively changed in length with keys to produce different musical notes. Often neglected is the fact that there is a second resonator for all wind instruments coupled to the tone generator – the vocal cavity. We use our vocal cavity everyday when we speak to form characteristic formants, local enhancements in frequency to shape vowels. This is achieved by varying the diameter of the vocal tract at specific local positions along its axis. In contrast to the resonator of a wind instrument, the vocal tract is fixed its length by the dimensions between the vocal chords and the lips. Consequently, the vocal tract cannot be used to change the fundamental frequency over a larger melodic range. For out voice, the change in frequency is controlled via the tension of the vocal chords. The musical instrument’s instrument resonator however is not an adequate device to control the timbre (harmonic spectrum) of an instrument because it can only be varied in length but not in width. Therefore, the players adjustment of the vocal tract is necessary to control the timbre if the instrument. While some instruments posses additional mechanisms to control timbre, e.g., via the embouchure to control the tone generator directly using the lip muscles, for others like the recorder changes in the wind supply provided by the lungs and the changes of the vocal tract. The role of the vocal tract has not been addressed systematically in literature and learning guides for two obvious reasons. Firstly, there is no known systematic approach of how to quantify internal body movements to shape the vocal tract. Each performer has to figure out the best vocal tract configurations in an intuitive manner. For the resonator system, the changes are described through the musical notes, and in cases where multiple ways exist to produce the same note, additional signs exist to demonstrate how to finger this note (e.g., by providing a specific key combination). Secondly, in western classic music culture the vocal tract adjustments predominantly have a correctional function to balance out the harmonic spectrum to make the instrument sound as even as possible across the register.

Braasch2

PVC-Didgeridoo adapter for soprano saxophone

In non-western cultures, the role of the oral cavity can be much more important to convey musical meaning. The didgeridoo, for example, has a fixed resonator with no keyholes and consequently it can only produce a single pitched drone. The musical parameter space is then defined by modulating the overtone spectrum above the tone by changing the vocal tract dimensions and creating vocal sounds on top of the buzzing lips on the didgeridoo edge. Mouthpieces of Western brass instruments have a cup behind the rim with a very narrow opening to the resonator, the throat. The didgeridoo does not have a cup, and the rim is the edge of the resonator with a ring of bee wax. While the narrow throat of western mouthpiece mutes additional sounds produced with the voice, didgeridoos are very open from end to end and carry the voice much better.

The room, a musical instrument is performed in acts as a third resonator, which also affect the timbre of the instrument. In our case, the room was simulated using a computer model with early reflections and late reverberation.

Braasch 1 - wind instruments

Tone generators for soprano saxophone from left to right: Chinese Bawu, soprano saxophone, Bassoon reed, cornetto.

In general, it is difficult to assess the effect of a mouthpiece and resonator individually, because both vary across instruments. The trumpet for example has a narrow cylindrical bore with a brass mouthpiece, the saxophone has a wide conical bore with reed-based mouthpiece. To mitigate this effect, several tone generators were adapted for a soprano saxophone, including a brass mouthpiece from a cornetto, a bassoon mouthpiece and a didgeridoo adapter made from a 140 cm folded PCV pipe that can be attached to the saxophone as well. It turns out that the exchange of tone generators change the timbre of the saxophone significantly. The cornetto mouthpiece gives the instrument a much mellower tone. Similar to the baroque cornetto, the instruments sounds better in a bright room with lot of high frequencies, while the saxophone is at home at a 19th-century concert hall with a steeper roll off at high frequencies.

Share This