3pMUa4 – Acoustical analysis of 3D-printed snare drums

Chris Jasinski – jasinski@hartford.edu
University of Hartford
200 Bloomfield Ave
West Hartford, CT 06117

Popular version of paper 3pMUa4
Presented Wednesday afternoon, December 09, 2020
179th ASA Meeting, Acoustics Virtually Everywhere

For many years, 3D printing (or additive manufacturing) has been a growing field with applications ranging from desktop trinkets to prototypes for replacements of human organs. Now, Klapel Percussion Instruments has designed its first line of 3D-printed snare drums.

Snare drums are commonly used in drum sets, orchestras, and marching bands. They are traditionally made with wood or metal shells, metal rims, plastic (mylar) skins, and metal connective hardware including bolts, lugs, and fasteners. For the first phase of Klapel prototypes, the shell and rim are replaced with a proprietary carbon fiber composite. Future iterations intend to replace all of the hardware with 3D printing as well.  The shell and rim are produced layer by layer until the final shape is formed. Even with high quality printers, layers can be seen in the final texture of 3D-printed objects. These layers appear as horizontal lines and vertical seams where each layer starts and finishes.

Snare drums

3D-printed snare drum and detail of finished texture.

Klapel Percussion Instruments contacted the University of Hartford Acoustics Program to assess if having a 3D-printing shell and rim changes the fundamental vibrational and acoustical characteristics of the drum. To test this, undergraduate students developed a repeatable drum striking device. The machine relies on gravity and a nearly zero-friction bearing to strike a snare drum from a consistent height above the playing surface. With precise striking force, the resulting sound produced by the drum was recorded in the University of Hartford’s anechoic chamber (a laboratory designed to eliminate all sound reflections or ‘echoes’, shown in the example photo of the striking machine). The recordings were then analyzed for their frequency content.

Snare drums

Snare drum striking machine inside Paul S. Veneklasen Research Foundation Anechoic Chamber at University of Hartford.

Along with the acoustical testing, the drum shell (the largest single component of a snare drum) underwent ‘modal analysis’, where 30 points are marked on each shell and struck with a calibrated force-sensing hammer. The resulting vibration of the drum is measured with an accelerometer. The fundamental shapes (or ‘modes’) of vibration can then be visualized using processing software.

Snare drums

Vibrational mode shapes for maple drum shell [left] and 3D-printed shell [right].

Ultimately, the vibrational and acoustical analysis resulted in the same conclusions. The fundamental shapes of vibration and the primary frequency content of the snare drum is unaffected by the process of 3D printing. The most prominent audible frequencies and vibrational shapes are identical in both the maple wood shell and the carbon fiber 3D-printed shell, as seen in the visualized modes of vibration. This means that the 3D-printed drum technology is a viable alternative to more traditional manufacturing techniques for drums.

There are substantial, measurable variations that impact the more subtle characteristics of the drum at higher, less prominent frequencies, and for more complex vibration shapes. These are noticeable above 1000 Hz in the frequency analysis comparison.

Snare drums

Frequency analysis at two striking locations for maple (wood) and carbon fiber (3D-printed) drum.

Future testing, including subjective listening tests, will aim to identify how these smaller variations impact listeners and performers. The results of the future tests can help determine how acoustical metrics can predict listener impressions.

2pSCb4 – The Science of Voice Acting

Colette Feehan – cmfeehan@iu.edu
Indiana University

Popular version of paper 2pSCb4
Presented Tuesday afternoon, December 8, 2020
179th ASA meeting, Virtually Everywhere
Click here to read the abstract

Many people do not realize that the “children” they hear in animation are actually voiced by adults1. There are several reasons for this, including: children cannot work long hours, are difficult to direct, and their voices change as they grow. Using an adult who can simulate a child voice bypasses these issues, but surprisingly not all voice actors (VAs) can create a believable child voice.

Studying what VAs do can tell us about how the vocal tract works. They can speak intelligibly while contorting their mouths in unnatural ways. A few previous studies2-10 have looked at the acoustics of VAs, or just the sounds that they produce, such as changes in pitch, voice quality (how raspy or breathy a voice sounds), and what kinds of regional dialects they use. This study uses 3D ultrasound and acoustic data from 3 professional and 3 amateur VAs to start answering the question: What do voice actors do with their vocal tracts to sound like a child? There are multiple different strategies to make your vocal tract sound smaller and different actors combine different strategies to make their child-like voices.

Looking at both the acoustics (the sounds they produce) and the ultrasound imaging of their vocal tracts, the strategies identified so far include: Gesture fronting and raising and hyoid bone raising.

Gesture fronting and raising refers to the position of the tongue within the mouth while you speak. If you think about the location of your tongue when repeating “ta ka ta ka…” you will notice that your tongue touches the roof of your mouth in different places to make each of those consonant sounds—farther forward in the mouth for “ta” and farther back for “ka” and the same is true for vowels. Figure 1 comes from analyzing the recording of their speech and shows that the position of the tongue for the adult versus child voice is pretty different for [i] and [ɑ] sounds for this subject. Given this information, we can then look at the ultrasound and see that the tongue positions are indeed farther forward (right) or higher in the mouth for the child voice, see Figure 2

The hyoid bone is a small bone above the larynx in your neck. This bone interrupts the ultrasound signal and prevents an image from showing up, but looking at the location of this hyoid “shadow” can still give us information. If the hyoid shadow is raised and fronted, as seen in Figure 3, it might be the case that the actor is shortening their vocal tract by contracting muscles in their throat.

Figure 4 shows that, for this VA, the hyoid bone shadow was higher throughout the entire utterance while doing a child voice, meaning that the actor might physically shorten the whole vocal tract the whole time while speaking

Data from VAs can help find alternative pronunciations for speech sounds which could help people with speech impediments but could also be used to help trans individuals sound closer to their identity.

References

  1. Holliday, C. “Emotion Capture: Vocal Performances by Children in the Computer-Animated Film”. Alphaville: Journal of Film and Screen Media 3 (Summer 2012). Web. ISSN: 2009-4078.
  2. Starr, R. L. (2015). Sweet voice: The role of voice quality in a Japanese feminine style. Language in Society, 44(01), 1-34.
  3. Teshigawara, M. (2003). Voices in Japanese animation: a phonetic study of vocal stereotypes of heroes and villains in Japanese culture. Dissertation.
  4. Teshigawara, M. (2004). Vocally expressed emotions and stereotypes in Japanese animation: Voice qualities of the bad guys compared to those of the good guys. Journal of the Phonetic Society of Japan8(1), 60-76.
  5. Teshigawara, M., & Murano, E. Z. (2004). Articulatory correlates of voice qualities of good guys and bad guys in Japanese anime: An MRI study. In Proceedings of INTERSPEECH (pp. 1249-1252).
  6. Teshigawara, M., Amir, N., Amir, O., Wlosko, E., & Avivi, M. (2007). Effects of random splicing on listeners’ perceptions. In 16th international congress of phonetic sciences (icphs).
  7. Teshigawara, M. 2009. Vocal expressions of emotions and personalities in Japanese anime. In Izdebski, K. (ed.), Emotions of the Human Voice, Vol. III Culture and Perception. San Diego: Plural Publishing, 275-287.
  8. Teshigawara, K. (2011). Voice-based person perception: two dimensions and their phonetic properties. ICPhSXVII, 1974-1977.
  9. Uchida, T. 2007. Effects of F0 range and contours in speech upon the image of speakers’ personality. Proc.19th ICA Madrid. http://www.seaacustica.es/WEB_ICA_07/fchrs/papers/cas-03-024.pdf
  10. Lippi-Green, R. (2011). English with an accent : language, ideology and discrimination in the united states. Retrieved from https://ebookcentral.proquest.com

5aPPb2 – Using a virtual restaurant to test hearing aid settings

Gregory M Ellis – gregory.ellis@northwestern.edu
Pamela Souza – p-souza@northwestern.edu

Northwestern University
Frances Searle Building
2240 Campus Drive
Evanston, IL 60201

Popular version of paper 5aPPb2
Presented Friday morning, December 11th, 2020
179th ASA Meeting, Acoustics Virtually Everywhere

True scientific discoveries require a series of tightly controlled experiments conducted in lab settings. These kinds of studies tell us how to implement and improve technologies we use every day—technologies like fingerprint scanners, face recognition, and voice recognition. One of the downsides of these tightly controlled environments, however, is that the real world is anything but tightly controlled. Dust may be on your fingerprint, the light may make it difficult for the face recognition software to work, or the background may be noisy making your voice impossible to pick up. Can we account for these scenarios in the lab when we’re performing experiments? Can we bring the real world—or parts of it—into a lab setting?

In our line of research, we believe we can. While the technologies listed above are interesting in their own right, our research focuses on hearing aid processing. Our lab generally asks: what factors, and to what extent do those factors, affect speech understanding for a person with a hearing aid? The project I’m presenting at this conference is specifically looking at environmental and hearing aid processing factors. Environmental factors include the loudness of background noises and echoes. Processing factors involve the software within the hearing aid that attempts to reduce or eliminate background noise and amplification strategies that make relatively quiet parts of speech louder so they’re easier to hear. We are using computer simulations to look at both the environmental and the processing factors. We can examine the effects of the environmental and processing factors on a listener by seeing how speech intelligibility is affected by those factors.

The room simulation is first. We built a very simple virtual environment pictured below:

virtual restaurant

The virtual room used in our experiments. The red dot represents the listener. The green dot represents the speaker. The blue dots represent other people in the restaurant having their own conversations and making noise.”

We can simulate the properties of the sounds in that room using a model that has been shown to be a good approximation of real recordings of sounds in rooms. After passing the speech for the speaker and all of the competing talkers through this room model, you will have a realistic simulation of the sounds in a room.

If you’re wearing headphones while you read this article, you can listen to an example here:

A woman speaking the sentence “Ten pins were set in order.” You should be able to hear other people talking to your right, all of whom are quieter than the woman in front. All of the sound has a slight echo to it. Note that this will not work if you aren’t wearing headphones!”

We then take this simulation and pass it through a hearing aid simulator. This imposes the processing you might expect in a widely-available hearing aid. Here’s an example of what that would sound like:

Same sentence as the restaurant simulation, but this is processed through a simulated hearing aid. You should notice a slightly different pitch to the sentence and the environment. This is because the simulated hearing loss is more extreme at higher pitches.”

Based on the results of hundreds of sentences, we would have a better understanding of how the environmental factors and the hearing aid processing interact. We found that for listeners with hearing impairment, there is an interaction between noise level and processing strategy, though more data will need to be collected before we can draw any solid conclusions. While these results are a promising first step, there are many more factors to look at—different amounts of echo, different amounts of noise, different types of processing strategies… and none of these factors include anything about the person listening to the sentences either. Does age, attention span, or degree of hearing loss affect their ability to perform the task? Ongoing and future research will be able to answer these questions.

This work is important because it shows that we can account for some environmental factors in tightly-controlled research. The method works well and produces results that we would expect to see. If you want results from the lab to be relatable to the real world, try to bring the real world into the lab!

1aSCa3 – Training effects on speech prosody production by Cantonese-speaking children with autism spectrum disorder

Si Chen – sarah.chen@polyu.edu.hk
Bei Li
Fang Zhou
Angel Wing Shan Chan
Tempo Po Yi Tang
Eunjin Chun
Phoebe Choi
Chakling Ng
Fiona Cheng
Xinrui Gou

Department of Chinese and Bilingual Studies
The Hong Kong Polytechnic University
11 Yuk Choi Road, Hung Hom, Hong Kong, China

Popular version of paper 1aSCa3
Presented Monday, December 07, 2020, 9:30 AM – 10:15 AM EST
179th ASA Meeting, Acoustics Virtually Everywhere

Speakers can utilize prosodic variations to express their intentions, states and emotions. Specifically, the relatively new information of an utterance, namely the focus, is often associated with expanded range of prosodic cues. The main types of focus include broad, narrow, and contrastive focus. Broad focus involves focus in a whole sentence (A: What did you say? B: [Emily ate an apple]FOCUS), whereas narrow focus emphasizes one constituent asked in the question (A: What did Emily eat? B: Emily ate an [apple]FOCUS). Contrastive focus rejects alternative statements (A: Did Emily eat an orange? B: (No,) Emily ate an [apple]FOCUS).

Children with autism spectrum disorder (ASD) have been reported to show difficulties in using speech prosody to mark focus. The presented research aims to test whether speech training and sung speech training may improve the use of speech prosody to mark focus. Fifteen Cantonese-speaking ASD children finished pre- and post-training speech production tasks and received either speech or sung speech training. In the pre- and post- training speech production tasks, we designed games to measure participants’ ability to mark focus in conversations. In the training tasks, we improved the mapping between acoustic cues and information structure categories through a series of tasks. The conversations used in sung speech training were designed with melodies that imitated the change of acoustic cues in speech.

Training.mp4, An example of congruous and incongruous conversation pairs in sung speech training

Both of the two training methods consisted of three phases of training. In the first phase, participants listened to congruous conversations pairs attentively in a designed game. In the second phase, participants were told that the odd trial of conversation was incongruous (the focus type that the question elicited did not match with that of the answer), and the even trial was congruous. They need to attend to the differences between the odd and even trials. In the third phase, all the trials were presented in a random order. Participants needed to judge if a pair was congruous or not. Instant feedback was provided after each response.

We extracted acoustic cues from ASD children’s speech before and after training and performed statistical analyses. Our pilot results showed that both speech and sung speech training might have improved the use of prosodic cues such as intensity and f0 in marking focus across various focus positions (e.g. meanF0.tiff). However, ASD children may still have difficulties in integrating all the prosodic cues across focus conditions.

autism spectrum disorderMean f0 of narrow focus in the initial position before and after training

1aBAa4 – Deep Learning the Sound of Light to Guide Surgeries

Muyinatu Bell — mledijubell@jhu.edu

Johns Hopkins University
3400 N. Charles St.
Baltimore, MD 21218

Popular version of paper 1aBAa4

Presented Monday morning, December 7, 2020

179th ASA Meeting, Acoustics Virtually Everywhere

Injuries to major blood vessels and nerves during surgical procedures such as neurosurgery, spinal fusion surgery, hysterectomies, and biopsies can lead to severe complications for the patient, like paralysis, or even death. Adding to the difficulty for surgeons is that in many cases, these bodily structures are not visible from their immediate viewpoint.

Photoacoustic imaging is a technique that has great potential to aid surgeons by utilizing acoustic responses from light transmission to make images of blood vessels and nerves. However, confusing artifacts that appear in the photoacoustic images, which are caused by acoustic reflections from bone and other highly reflective structures, challenge this technique and lead to inaccuracies in assumptions needed to form images.

Demonstration of ideal image formation (also known as beamforming) vs. beamforming that yields artifacts, distortions, incorrect localization, and acoustic reflections.

This paper summarizes novel methods developed by the Photoacoustic and Ultrasonic Systems Engineering (PULSE) Lab at Johns Hopkins University to eliminate surgical complications by creating more informative images for surgical guidance.

The overall goal of the proposed approach is to learn the unique shape-to-depth relationship of data from point-like photoacoustic sources – such as needle and catheter tips or the tips of surgical tools – in order to provide a deep learning-based image formation replacement that can more clearly guide surgeons. Accurately determining the proximity of these point-like tips to anatomical landmarks that appear in photoacoustic images — like major blood vessels and nerves—is a critical feature of the entire photoacoustic technology for surgical guidance. Convolutional neural networks (CNNs) – a class of deep neural networks, most commonly applied to analyzing visual imagery – were trained, tested, and implemented to achieve the end goal of producing clear and interpretable photoacoustic images.

After training networks using photoacoustic computer simulations, CNNs that achieved greater than 90% source classification accuracy were transferred to real photoacoustic data. These networks were trained to output the locations of both sources and artifacts, as well as classifications of the detected wavefronts. These outputs were then displayed in an image format called CNN-based images, which show both detected point source locations — such as a needle or catheter tip— and its location error, as illustrated below.

The well-trained CNN (top) inputs recorded sensor data from three experimental point-like sources (bottom left). This data produces an artifact appearing as a fourth source in with traditional beamforming (bottom middle). The CNN based image (bottom right) provides more clarity of true source locations and eliminates artifacts.

Overall, the classification rates ranged from 92-99.62% for simulated data. The network that utilized Resnet101 experienced both the greatest classification performances (99.62%) and the lowest misclassification rate (0.28%). A similar result was achieved with experimental water bath, phantom, ex vivo, and in vivo tissue data when using the Faster R-CNN architecture with the plain VGG16 convolutional neural network.

This success demonstrates two major breakthroughs for the field of deep learning applied to photoacoustic image formation. First, computer simulations of acoustic wave propagation can be used to successfully train deep neural networks, meaning that extensive experiments are not necessary to generate the thousands of example data needed to train CNNs for the proposed task. Second, these networks transfer well to real experimental data that were not included during training, meaning that the CNN based image can potentially be incorporated into future products that will use the photoacoustic process to minimize errors during surgeries and interventions.

4aMUa1 – Songs: Lyrics on the melody or melody of the lyrics?

Archi Banerjee – archibanerjee7@iitkgp.ac.in
Priyadarshi Patnaik – bapi@hss.iitkgp.ac.in
Rekhi Centre of Excellence for the Science of Happiness
Indian Institute of Technology Kharagpur, 721301, INDIA

Shankha Sanyal – ssanyal.ling@jadavpuruniversity.in
Souparno Roy – thesouparnoroy@gmail.com
Sir C. V. Raman Centre for Physics and Music
Jadavpur University, Kolkata: 700032, INDIA

Popular version of paper 4aMUa1 Lyrics on the melody or melody of the lyrics?
Presented Thursday morning, December 10, 2020
179th ASA Meeting, Acoustics Virtually Everywhere
Read the article in Proceedings of Meetings on Acoustics

The musicians often say “When a marriage happens between a lyric and a melody, only then a true song is born”! But, which impacts the audience more in a song – Melody or lyrics? The answer to this question is still unknown. What happens when the melody is hummed independently without the lyrics? How does that affect the acoustical waveform of the original song? Does the emotional appraisal remain same in both cases? The present work attempts to answer these questions using songs from different genres of Indian music. Recordings of two pairs of contrast emotion (happy-sad) evoking Raga bandishes from Indian Classical Music (ICM) and one pair of Bengali contemporary songs of opposite emotions (happy-sad) were taken from an experienced female professional singer who was asked to consecutively sing (with proper meaningful lyrics) and hum (without using any lyric or meaningful words) the songs, keeping the melodic structure, pitch and tempo same. In ICM, the basic building blocks are Ragas – bandish is a song composed in a particular Raga. The chosen audio clips are:

Genre  Chosen Songs Primary emotion Tempo
Indian Classical Music Raga Multani vilambit bandish Sad ~ 45 bpm
Raga Hamsadhwani vilambit bandish Happy ~ 50 bpm
Raga Multani drut bandish Sad ~ 90 bpm
Raga Hamsadhwani drut bandish Happy ~ 110 bpm
Bengali Contemporary Music O tota pakhi re Sad ~ 50 bpm
Ami cheye cheye dekhi Happy ~ 130 bpm
a)

b)

Audio 1(a,b): Sample audios of (a) Humming and (b) Song versions of same melodic part

a)Song b)
Song

Figure 1(a,b): Sample acoustical waveforms and pitch contours of (a) Humming and (b) Song versions of same melodic part

Next, using different sets of humming-song pairs from the chosen songs as stimuli, Electroencephalogram (EEG) recordings were taken from 5 musically untrained participants who understand the languages Hindi (of bandishes) and Bengali. Both music and EEG signals have highly complex structures, but their inherent geometry features self similarity or structural repetitions. Chaos based nonlinear fractal technique (Detrended fluctuation analysis or DFA) was applied both on the acoustical waveforms and their corresponding EEG signals. The changes in self similarity were calculated for each humming-song pair to study the impact of lyrics both in acoustical and neurological levels.

a)song pairs b)song pairs

Figure 2(a,b): Variation in DFA scaling exponent in acoustical signals of humming-song pairs taken from songs of (a) Indian Classical music and (b) Bengali contemporary music

Acoustical analysis revealed that in songs where the lyrics is highly projected or emphasized (slow tempo vilambit bandish and Bengali contemporary songs), the DFA scaling exponent or self similarity decreases from humming to song version if the melodic pattern remains same. The sudden and spontaneous fluctuations in the pitch and intensity levels of the song versions due to the introduction of several consonants, rhythmic variations and pauses between words embedded in the lyrics may help in lowering the scale of self similarity.

a)song pairs b)

Figure 3(a,b): Average of differences in DFA scaling exponent in (a) Frontal electrodes (F3, F4, F7, F8 & FZ) and (b) Occipital (O1 & O2), Parietal (P3 & P4) and Temporal (T3 &T4) electrodes for different Humming-Song pairs of Bengali contemporary songs

EEG analysis revealed that for both genres of music, in songs with highly projected lyrics, self similarity in frontal lobe electrodes increases from humming to song version of the same melody; whereas in occipital, temporal and parietal electrodes, we observed an increment in DFA scaling exponent from humming to song for the slow tempo songs and a decrement in the same for high tempo songs.

Combining results of acoustic analysis and EEG analysis, the impact of lyrics was found to be significantly higher in lower tempo songs compared to higher tempo songs both in acoustical and neuro-cognitive level. This is a pilot study in the context of Indian music which endeavors to quantitatively analyze the contribution of lyrics in songs of different genre as well as different emotional content in both acoustical and neuro-cognitive domain with the help of a unique scaling exponent.