Modern music can be inaccessible to those with hearing loss; sound mixing tweaks could make a difference.
Listeners with hearing loss can struggle to make out vocals and certain frequencies in modern music. Credit: Aravindan Joseph Benjamin
WASHINGTON, August 22, 2023 – Millions of people around the world experience some form of hearing loss, resulting in negative impacts to their health and quality of life. Treatments exist in the form of hearing aids and cochlear implants, but these assistive devices cannot replace the full functionality of human hearing and remain inaccessible for most people. Auditory experiences, such as speech and music…click to read more
Queen Mary University of London, Mile End Road, London, England, E1 4NS, United Kingdom
Popular version of 3aSP1-Artificial intelligence in music production: controversy and opportunity, presented at the 183rd ASA Meeting.
Music production
In music production, one typically has many sources. They each need to be heard simultaneously, but can all be created in different ways, in different environments and with different attributes. The mix should have all sources sound distinct yet contribute to a nice clean blend of the sounds. To achieve this is labour intensive and requires a professional engineer. Modern production systems help, but they’re incredibly complex and all require manual manipulation. As technology has grown, it has become more functional but not simpler for the user.
Intelligent music production
Intelligent systems could analyse all the incoming signals and determine how they should be modified and combined. This has the potential to revolutionise music production, in effect putting a robot sound engineer inside every recording device, mixing console or audio workstation. Could this be achieved? This question gets to the heart of what is art and what is science, what is the role of the music producer and why we prefer one mix over another.
Figure 1 Caption: The architecture of an automatic mixing system. [Image courtesy of the author]
Perception of mixing
But there is little understanding of how we perceive audio mixes. Almost all studies have been restricted to lab conditions; like measuring the perceived level of a tone in the presence of background noise. This tells us very little about real world cases. It doesn’t say how well one can hear lead vocals when there are guitar, bass and drums.
Best practices
And we don’t know why one production will sound dull while another makes you laugh and cry, even though both are on the same piece of music, performed by competent sound engineers. So we needed to establish what is good production, how to translate it into rules and exploit it within algorithms. We needed to step back and explore more fundamental questions, filling gaps in our understanding of production and perception.
Knowledge engineering
We used an approach that incorporated one of the earliest machine learning methods, knowledge engineering. Its so old school that its gone out of fashion. It assumes experts have already figured things out, they are experts after all. So let’s capture best practices as a set of rules and processes. But this is no easy task. Most sound engineers don’t know what they did. Ask a famous producer what he or she did on a hit song and you often get an answer like ‘I turned the knob up to 11 to make it sound phat.” How do you turn that into a mathematical equation? Or worse, they say it was magic and can’t be put into words.
We systematically tested all the assumptions about best practices and supplemented them with listening tests that helped us understand how people perceive complex sound mixtures. We also curated multitrack audio, with detailed information about how it was recorded, multiple mixes and evaluations of those mixes.
This enabled us to develop intelligent systems that automate much of the music production process.
Video Caption: An automatic mixing system based on a technology we developed.
Transformational impact
I gave a talk about this once in a room that had panel windows all around. These talks are usually half full. But this time it was packed, and I could see faces outside pressed up against the windows. They all wanted to find out about this idea of automatic mixing. It’s a unique opportunity for academic research to have transformational impact on an entire industry. It addresses the fact that music production technologies are often not fit for purpose. Intelligent systems open up new opportunities. Amateur musicians can create high quality mixes of their content, small venues can put on live events without needing a professional engineer, time and preparation for soundchecks could be drastically reduced, and large venues and broadcasters could significantly cut manpower costs.
Taking away creativity
Its controversial. We entered an automatic mix in a student recording competition as a sort of Turing Test. Technically we cheated, because the mixes were supposed to be made by students, not by an ‘artificial intelligence’ (AI) created by a student. Afterwards I asked the judges what they thought of the mix. The first two were surprised and curious when I told them how it was done. The third judge offered useful comments when he thought it was a student mix. But when I told him that it was an ‘automatic mix’, he suddenly switched and said it was rubbish and he could tell all along.
Mixing is a creative process where stylistic decisions are made. Is this taking away creativity, is it taking away jobs? Such questions come up time and time again with new technologies, going back to 19th century protests by the Luddites, textile workers who feared that time spent on their skills and craft would be wasted as machines could replace their role in industry.
Not about replacing sound engineers
These are valid concerns, but its important to see other perspectives. A tremendous amount of music production work is technical, and audio quality would be improved by addressing these problems. As the graffiti artist Banksy said “All artists are willing to suffer for their work. But why are so few prepared to learn to draw?”
Creativity still requires technical skills. To achieve something wonderful when mixing music, you first have to achieve something pretty good and address issues with masking, microphone placement, level balancing and so on.
Video Caption: Time offset (comb filtering) correction, a technical problem in music production solved by an intelligent system.
The real benefit is not replacing sound engineers. Its dealing with all those situations when a talented engineer is not available; the band practicing in the garage, the small restaurant venue that does not provide any support, or game audio, where dozens of sounds need to be mixed and there is no miniature sound engineer living inside the games console.
Ryan Podlubny – ryan.podlubny@pg.canterbury.ac.nz Department of Linguistics, University of Canterbury 20 Kirkwood Avenue, Upper Riccarton Christchurch, NZ, 8041
Popular version of paper 1aNS4, “Musical mind control: Acoustic convergence to background music in speech production.” Presented Monday morning, November 28, 2016 172nd ASA Meeting, Honolulu
People often adjust their speech to resemble that of their conversation partners – a phenomenon known as speech convergence. Broadly defined, convergence describes automatic synchronization to some external source, much like running to the beat of music playing at the gym without intentionally choosing to do so. Through a variety of studies a general trend has emerged where we find people automatically synchronizing to various aspects of their environment1,2,3. With specific regard to language use, convergence effects have also been observed in many linguistic domains such as sentence-formation4, word-formation5, and vowel production6 (where differences in vowel production are well associated with perceived accentedness7,8). This prevalence in linguistics raises many interesting questions about the extent to which speakers converge. This research uses a speech-in-noise paradigm to explore whether or not speakers also converge to non-linguistic signals in the environment: Specifically, will a speaker’s rhythm, pitch, or intensity (which is closely related to loudness) be influenced by fluctuations in background music such that the speech echoes specific characteristics of that background music (for example, if the tempo of background music slows down, will that influence those listening to unconsciously decrease their speech rate)?
In this experiment participants read passages aloud while hearing music through headphones. Background music was composed by the experimenter to be relatively stable with regard to pitch, tempo/rhythm, and intensity, so we could manipulate and test only one of these dimensions at a time, within each test-condition. We imposed these manipulations gradually and consistently toward a target, which can be seen in Figure 1, and would similarly return to the level at which they started after reaching that target. We played the participants music with no experimental changes in between all manipulated sessions. (Examples of what participants heard in headphones are available as sound- files 1 and 2)
Fig. 1: Using software designed for digital signal processing (analyzing and altering sound), manipulations were applied in a linear fashion (in a straight line) toward a target – this can be seen above as the blue line, which first rises and then falls. NOTE: After manipulations reach their target (the target is seen above as a dashed, vertical red line), the degree of manipulation would then return to the level at which it started in a similar linear fashion. Graphic captured while using Praat 9 to increase and then decrease the perceived loudness of the background music.
Data from 15 native speakers of New Zealand English were analyzed using statistical tests that allow effects to vary somewhat for each participant where we observed significant convergence in both the pitch and intensity conditions. Analysis of the Tempo condition, however, has not yet been conducted. Interestingly, these effects appear to differ systematically based on a person’s previous musical training. While non-musicians demonstrate the predicted effect and follow the manipulations, musicians appear to invert the effect and reliably alter aspects of their pitch and intensity in the opposite direction of the manipulation (see Figure 2). Sociolinguistic research indicates that under certain conditions speakers will emphasize characteristics of their speech to distinguish themselves socially from conversation partners or groups, as opposed to converging with them6. It seems plausible then that, given a relatively heightened ability to recognize low-level variations of sound, musicians may on some cognitive level be more aware of the variation in their sound environment, and as a result similarly resist the more typical effect. However, more work is required to better understand this phenomenon.
Fig. 2: The above plots measure pitch on the y-axis (up and down on the left edge), and indicate the portions of background music that have been manipulated on the x- axis (across the bottom). The blue lines show that speakers generally lower their pitch as an un-manipulated condition progresses. However the red lines show that when global pitch is lowered during a test-condition, such lowering is relatively more dramatic for non-musicians (left plot) and that the effect is reversed by those with musical training (right plot). NOTE: A follow-up model further accounts for the relatedness of Pitch and Intensity and shows much the same effect.
This work indicates that speakers are not only influenced by human speech partners in production, but also, to some degree, by noise within the immediate speech environment, which suggests that environmental noise may constantly be influencing certain aspects of our speech production in very specific and predictable ways. Human listeners are rather talented when it comes to recognizing subtle cues in speech10, especially compared to computers and algorithms that can’t yet match this ability. Some language scientists argue these changes in speech occur to make understanding easier for those listening11. That is why work like this is likely to resonate in both academia and the private sector, as a better understanding of how speech will change in different environments contributes to the development of more effective aids for the hearing impaired, as well as improvements to many devices used in global communications.
Sound-file 1. An example of what participants heard as a control condition (no experimental manipulation) in between test-conditions.
Sound-file 2. An example of what participants heard as a test condition (Pitch manipulation, which drops 200 cents/one full step).
References
1. Hill, A. R., Adams, J. M., Parker, B. E., & Rochester, D. F. (1988). Short-term entrainment of ventilation to the walking cycle in humans. Journal of Applied Physiology, 65(2), 570-578. 2. Will, U., & Berg, E. (2007). Brain wave synchronization and entrainment to periodic acoustic stimuli. Neuroscience letters, 424(1), 55-60. 3. McClintock, M. K. (1971). Menstrual synchrony and suppression. Nature, Vol 229, 244-245. 4. Branigan, H. P., Pickering, M. J., McLean, J. F., & Cleland, A. A. (2007). Syntactic alignment and participant role in dialogue. Cognition, 104(2), 163-197. 5. Beckner, C., Rácz, P., Hay, J., Brandstetter, J., & Bartneck, C. (2015). Participants Conform to Humans but Not to Humanoid Robots in an English Past Tense Formation Task. Journal of Language and Social Psychology, 0261927X15584682. Retreived from: http://jls.sagepub.com.ezproxy.canterbury.ac.nz/content/early/2015/05/06/0261927X15584682. 6. Babel, M. (2012). Evidence for phonetic and social selectivity in spontaneous phonetic imitation. Journal of Phonetics, 40(1), 177-189. 7. Major, R. C. (1987). English voiceless stop production by speakers of Brazilian Portuguese. Journal of Phonetics, 15, 197— 202. 8. Rekart, D. M. (1985) Evaluation of foreign accent using synthetic speech. Ph.D. dissertation, the Lousiana State University. 9. Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (Version 5.4.04) [Computer program]. Retrieved from www.praat.org. 10. Hay, J., Podlubny, R., Drager, K., & McAuliffe, M. (under review). Car-talk: Location-specific speech production and perception. 11. Lane, H., & Tranel, B. (1971). The Lombard sign and the role of hearing in speech. Journal of Speech, Language, and Hearing Research, 14(4), 677-709.
Ki-Hong Kim — kim.kihong@surugadai.ac.jp Faculty of Media & Information Resources, Surugadai University 698 Azu, Hanno-shi, Saitama-ken, Japan 357-8555
Mikiko Kubo — kubmik.0914@gmail.com Hitachi Solutions, Ltd. 4-12-7 Shinagawa-ku, Tokyo, Japan 140-0002
Shin-ichiro Iwamiya – iwamiya@design.kyushu-u.ac.jp Faculty of Design, Kyushu University 4-9-1 Shiobaru, Minami-ku, Fukuoka, Japan 815-8540
Popular version of paper 1pMU4, “Optimal insertion timing of symbolic music to induce laughter in video content.” Presented Monday afternoon, November 28, 2016 172nd ASA Meeting, Honolulu
A study of optimal insertion timing of symbolic music to induce laughter in videos
In television variety shows or comedy programs various sound effects and music are combined with humorous scenes to induce more pronounced laughter from viewers or listeners [1]. The aim of our study was to clarify the optimum insertion timing of symbolic music to induce laughter in video contents. Symbolic music is music that is associated with a special meaning such as something funny as a sort of “punch line” to emphasize their humorous nature.
Fig. 1 Sequence of video and audio tracks in the video editing timeline
We conducted a series of rating experiments to explore the best timing for insertion of such music into humorous video contents. We also examined the affects of audiovisual contents. The experimental stimuli were four short video contents, which were created by mixing the two video (V1 & V2) and four music clips (M1, M2, M3 & M4).
The rating experiments clarified that insertion timing of symbolic music contributed to inducing laughter of video contents. In the case of a purely comical scene (V1), we found the optimal insertion time for high funniness rating was the shortest, at 0-0.5 seconds. In the case of a tragicomic scene, a humorous accident (V2), the optimal insertion time was longer, at 0.5-1 seconds after the scene; i.e., a short pause before the music was effective to increase funniness.
Fig. 2 Subjective evaluation value for the funniness in each insertion timing of symbolic music for each video scene.
Furthermore, the subjective evaluation value rating experiments showed that optimal timing was associated with the highest impressiveness of the videos, the highest evaluations, the highest congruence between moving pictures and sounds, and inducement of maximum laughter. We discovered all of the correlation coefficients are very high, seen in the table summarizing the test.
Table 1 Correlation coefficient between the optimal timing for symbolic music and the affects for audiovisual contents.
funniness
impressiveness
congruence
evaluation
best timing
.95**
.90**
.90**
.98**
funniness
–
.94**
.92**
.97**
impressiveness
.94**
–
.92**
.95**
congruence
.92**
.92**
–
.94**
evaluation
.97**
.95**
.94**
–
** p< .01
In television variety shows or comedy programs, when symbolic music is dubbed over the video as a punch line just after the humorous scenes, insertion of a short pause of between half a second and a full second is very effective at emphasizing the humor of scenes, and increasing the impressiveness of viewer-listeners.
1. Kim, K.H., et al., F. Effectiveness of Sound Effects and Music to Induce Laugh in Comical Entertainment Television Show. The 13th International Conference on Music Perception and Cognition, 2014. CD-ROM. 2. Kim, K.H., et al., Effects of Music and Sound Effects to Increase Laughter in Television Programs. Media & Information Resources, 2014. 21(2): 15-28. (in Japanese with English abstract).
University of Utah 201 Presidents Cir Salt Lake City, UT
Popular version of paper 4pMU4 “How well can a human mimic the sound of a trumpet?” Presented Thursday May 26, 2:00 pm, Solitude room 171st ASA Meeting Salt Lake City
Man-made musical instruments are sometimes designed or played to mimic the human voice, and likewise vocalists try to mimic the sounds of man-made instruments. If flutes and strings accompany a singer, a “brassy” voice is likely to produce mismatches in timbre (tone color or sound quality). Likewise, a “fluty” voice may not be ideal for a brass accompaniment. Thus, singers are looking for ways to color their voice with variable timbre.
Acoustically, brass instruments are close cousins of the human voice. It was discovered prehistorically that sending sound over long distances (to locate, be located, or warn of danger) is made easier when a vibrating sound source is connected to a horn. It is not known which came first – blowing hollow animal horns or sea shells with pursed and vibrating lips, or cupping the hands to extend the airway for vocalization. In both cases, however, airflow-induced vibration of soft tissue (vocal folds or lips) is enhanced by a tube that resonates the frequencies and radiates them (sends them out) to the listener.
Around 1840, theatrical singing by males went through a revolution. Men wanted to portray more masculinity and raw emotion with vocal timbre. “Do di Petto”, which is Italien for “C in chest voice” was introduced by operatic tenor Gilbert Duprez in 1837, which soon became a phenomenon. A heroic voice in opera took on more of a brass-like quality than a flute-like quality. Similarly, in the early to mid- twentieth century (1920-1950), female singers were driven by the desire to sing with a richer timbre, one that matched brass and percussion instruments rather than strings or flutes. Ethel Merman became an icon in this revolution. This led to the theatre belt sound produced by females today, which has much in common with a trumpet sound.
Fig 1. Mouth opening to head-size ratio for Ethel Merman and corresponding frequency spectrum for the sound “aw” with a fundamental frequency fo (pitch) at 547 Hz and a second harmonic frequency 2 fo at 1094 Hz.
The length of an uncoiled trumpet horn is about 2 meters (including the full length of the valves), whereas the length of a human airway above the glottis (the space between the vocal cords) is only about 17 cm (Fig. 2). The vibrating lips and the vibrating vocal cords can produce similar pitch ranges, but the resonators have vastly different natural frequencies due to the more than 10:1 ratio in airway length. So, we ask, how can the voice produce a brass-like timbre in a “call” or “belt”?
One structural similarity between the human instrument and the brass instrument is the shape of the airway directly above the glottis, a short and narrow tube formed by the epiglottis. It corresponds to the mouthpiece of brass instruments. This mouthpiece plays a major role in shaping the sound quality. A second structural similarity is created when a singer uses a wide mouth opening, simulating the bell of the trumpet. With these two structural similarities, the spectrum of tones produced by the two instruments can be quite similar, despite the huge difference in the overall length of the instrument.
Fig 2. Human airway and trumpet (not drawn to scale).
Acoustically, the call or belt-like quality is achieved by strengthening the second harmonic frequency 2fo in relation to the fundamental frequency fo. In the human instrument, this can be done by choosing a bright vowel like /ᴂ/ that puts an airway resonance near the second harmonic. The fundamental frequency will then have significantly less energy than the second harmonic.
Why does that resonance adjustment produce a brass-like timbre? To understand this, we first recognize that, in brass-instrument playing, the tones produced by the lips are entrained (synchronized) to the resonance frequencies of the tube. Thus, the tones heard from the trumpet are the resonance tones. These resonance tones form a harmonic series, but the fundamental tone in this series is missing. It is known as the pedal tone. Thus, by design, the trumpet has a strong second harmonic frequency with a missing fundamental frequency.
Perceptually, an imaginary fundamental frequency may be produced by our auditory system when a series of higher harmonics (equally spaced overtones) is heard. Thus, the fundamental (pedal tone) may be perceptually present to some degree, but the highly dominant second harmonic determines the note that is played.
In belting and loud calling, the fundamental is not eliminated, but suppressed relative to the second harmonic. The timbre of belt is related to the timbre of a trumpet due to this lack of energy in the fundamental frequency. There is a limit, however, in how high the pitch can be raised with this timbre. As pitch goes up, the first resonance of the airway has to be raised higher and higher to maintain the strong second harmonic. This requires ever more mouth opening, literally creating a trumpet bell (Fig. 3).
Fig 3. Mouth opening to head-size ratio for Idina Menzel and corresponding frequency spectrum for a belt sound with a fundamental frequency (pitch) at 545 Hz.
Note the strong second harmonic frequency 2fo in the spectrum of frequencies produced by Idina Menzel, a current musical theatre singer.
One final comment about the perceived pitch of a belt sound is in order. Pitch perception is not only related to the fundamental frequency, but the entire spectrum of frequencies. The strong second harmonic influences pitch perception. The belt timbre on a D5 (587 Hz) results in a higher pitch perception for most people than a classical soprano sound on the same note. This adds to the excitement of the sound.