3pMUa4 – Acoustical analysis of 3D-printed snare drums

Chris Jasinski – jasinski@hartford.edu
University of Hartford
200 Bloomfield Ave
West Hartford, CT 06117

Popular version of paper 3pMUa4
Presented Wednesday afternoon, December 09, 2020
179th ASA Meeting, Acoustics Virtually Everywhere

For many years, 3D printing (or additive manufacturing) has been a growing field with applications ranging from desktop trinkets to prototypes for replacements of human organs. Now, Klapel Percussion Instruments has designed its first line of 3D-printed snare drums.

Snare drums are commonly used in drum sets, orchestras, and marching bands. They are traditionally made with wood or metal shells, metal rims, plastic (mylar) skins, and metal connective hardware including bolts, lugs, and fasteners. For the first phase of Klapel prototypes, the shell and rim are replaced with a proprietary carbon fiber composite. Future iterations intend to replace all of the hardware with 3D printing as well.  The shell and rim are produced layer by layer until the final shape is formed. Even with high quality printers, layers can be seen in the final texture of 3D-printed objects. These layers appear as horizontal lines and vertical seams where each layer starts and finishes.

Snare drums

3D-printed snare drum and detail of finished texture.

Klapel Percussion Instruments contacted the University of Hartford Acoustics Program to assess if having a 3D-printing shell and rim changes the fundamental vibrational and acoustical characteristics of the drum. To test this, undergraduate students developed a repeatable drum striking device. The machine relies on gravity and a nearly zero-friction bearing to strike a snare drum from a consistent height above the playing surface. With precise striking force, the resulting sound produced by the drum was recorded in the University of Hartford’s anechoic chamber (a laboratory designed to eliminate all sound reflections or ‘echoes’, shown in the example photo of the striking machine). The recordings were then analyzed for their frequency content.

Snare drums

Snare drum striking machine inside Paul S. Veneklasen Research Foundation Anechoic Chamber at University of Hartford.

Along with the acoustical testing, the drum shell (the largest single component of a snare drum) underwent ‘modal analysis’, where 30 points are marked on each shell and struck with a calibrated force-sensing hammer. The resulting vibration of the drum is measured with an accelerometer. The fundamental shapes (or ‘modes’) of vibration can then be visualized using processing software.

Snare drums

Vibrational mode shapes for maple drum shell [left] and 3D-printed shell [right].

Ultimately, the vibrational and acoustical analysis resulted in the same conclusions. The fundamental shapes of vibration and the primary frequency content of the snare drum is unaffected by the process of 3D printing. The most prominent audible frequencies and vibrational shapes are identical in both the maple wood shell and the carbon fiber 3D-printed shell, as seen in the visualized modes of vibration. This means that the 3D-printed drum technology is a viable alternative to more traditional manufacturing techniques for drums.

There are substantial, measurable variations that impact the more subtle characteristics of the drum at higher, less prominent frequencies, and for more complex vibration shapes. These are noticeable above 1000 Hz in the frequency analysis comparison.

Snare drums

Frequency analysis at two striking locations for maple (wood) and carbon fiber (3D-printed) drum.

Future testing, including subjective listening tests, will aim to identify how these smaller variations impact listeners and performers. The results of the future tests can help determine how acoustical metrics can predict listener impressions.

2pSCb4 – The Science of Voice Acting

Colette Feehan – cmfeehan@iu.edu
Indiana University

Popular version of paper 2pSCb4
Presented Tuesday afternoon, December 8, 2020
179th ASA meeting, Virtually Everywhere
Click here to read the abstract

Many people do not realize that the “children” they hear in animation are actually voiced by adults1. There are several reasons for this, including: children cannot work long hours, are difficult to direct, and their voices change as they grow. Using an adult who can simulate a child voice bypasses these issues, but surprisingly not all voice actors (VAs) can create a believable child voice.

Studying what VAs do can tell us about how the vocal tract works. They can speak intelligibly while contorting their mouths in unnatural ways. A few previous studies2-10 have looked at the acoustics of VAs, or just the sounds that they produce, such as changes in pitch, voice quality (how raspy or breathy a voice sounds), and what kinds of regional dialects they use. This study uses 3D ultrasound and acoustic data from 3 professional and 3 amateur VAs to start answering the question: What do voice actors do with their vocal tracts to sound like a child? There are multiple different strategies to make your vocal tract sound smaller and different actors combine different strategies to make their child-like voices.

Looking at both the acoustics (the sounds they produce) and the ultrasound imaging of their vocal tracts, the strategies identified so far include: Gesture fronting and raising and hyoid bone raising.

Gesture fronting and raising refers to the position of the tongue within the mouth while you speak. If you think about the location of your tongue when repeating “ta ka ta ka…” you will notice that your tongue touches the roof of your mouth in different places to make each of those consonant sounds—farther forward in the mouth for “ta” and farther back for “ka” and the same is true for vowels. Figure 1 comes from analyzing the recording of their speech and shows that the position of the tongue for the adult versus child voice is pretty different for [i] and [ɑ] sounds for this subject. Given this information, we can then look at the ultrasound and see that the tongue positions are indeed farther forward (right) or higher in the mouth for the child voice, see Figure 2

The hyoid bone is a small bone above the larynx in your neck. This bone interrupts the ultrasound signal and prevents an image from showing up, but looking at the location of this hyoid “shadow” can still give us information. If the hyoid shadow is raised and fronted, as seen in Figure 3, it might be the case that the actor is shortening their vocal tract by contracting muscles in their throat.

Figure 4 shows that, for this VA, the hyoid bone shadow was higher throughout the entire utterance while doing a child voice, meaning that the actor might physically shorten the whole vocal tract the whole time while speaking

Data from VAs can help find alternative pronunciations for speech sounds which could help people with speech impediments but could also be used to help trans individuals sound closer to their identity.

References

  1. Holliday, C. “Emotion Capture: Vocal Performances by Children in the Computer-Animated Film”. Alphaville: Journal of Film and Screen Media 3 (Summer 2012). Web. ISSN: 2009-4078.
  2. Starr, R. L. (2015). Sweet voice: The role of voice quality in a Japanese feminine style. Language in Society, 44(01), 1-34.
  3. Teshigawara, M. (2003). Voices in Japanese animation: a phonetic study of vocal stereotypes of heroes and villains in Japanese culture. Dissertation.
  4. Teshigawara, M. (2004). Vocally expressed emotions and stereotypes in Japanese animation: Voice qualities of the bad guys compared to those of the good guys. Journal of the Phonetic Society of Japan8(1), 60-76.
  5. Teshigawara, M., & Murano, E. Z. (2004). Articulatory correlates of voice qualities of good guys and bad guys in Japanese anime: An MRI study. In Proceedings of INTERSPEECH (pp. 1249-1252).
  6. Teshigawara, M., Amir, N., Amir, O., Wlosko, E., & Avivi, M. (2007). Effects of random splicing on listeners’ perceptions. In 16th international congress of phonetic sciences (icphs).
  7. Teshigawara, M. 2009. Vocal expressions of emotions and personalities in Japanese anime. In Izdebski, K. (ed.), Emotions of the Human Voice, Vol. III Culture and Perception. San Diego: Plural Publishing, 275-287.
  8. Teshigawara, K. (2011). Voice-based person perception: two dimensions and their phonetic properties. ICPhSXVII, 1974-1977.
  9. Uchida, T. 2007. Effects of F0 range and contours in speech upon the image of speakers’ personality. Proc.19th ICA Madrid. http://www.seaacustica.es/WEB_ICA_07/fchrs/papers/cas-03-024.pdf
  10. Lippi-Green, R. (2011). English with an accent : language, ideology and discrimination in the united states. Retrieved from https://ebookcentral.proquest.com

5aPPb2 – Using a virtual restaurant to test hearing aid settings

Gregory M Ellis – gregory.ellis@northwestern.edu
Pamela Souza – p-souza@northwestern.edu

Northwestern University
Frances Searle Building
2240 Campus Drive
Evanston, IL 60201

Popular version of paper 5aPPb2
Presented Friday morning, December 11th, 2020
179th ASA Meeting, Acoustics Virtually Everywhere

True scientific discoveries require a series of tightly controlled experiments conducted in lab settings. These kinds of studies tell us how to implement and improve technologies we use every day—technologies like fingerprint scanners, face recognition, and voice recognition. One of the downsides of these tightly controlled environments, however, is that the real world is anything but tightly controlled. Dust may be on your fingerprint, the light may make it difficult for the face recognition software to work, or the background may be noisy making your voice impossible to pick up. Can we account for these scenarios in the lab when we’re performing experiments? Can we bring the real world—or parts of it—into a lab setting?

In our line of research, we believe we can. While the technologies listed above are interesting in their own right, our research focuses on hearing aid processing. Our lab generally asks: what factors, and to what extent do those factors, affect speech understanding for a person with a hearing aid? The project I’m presenting at this conference is specifically looking at environmental and hearing aid processing factors. Environmental factors include the loudness of background noises and echoes. Processing factors involve the software within the hearing aid that attempts to reduce or eliminate background noise and amplification strategies that make relatively quiet parts of speech louder so they’re easier to hear. We are using computer simulations to look at both the environmental and the processing factors. We can examine the effects of the environmental and processing factors on a listener by seeing how speech intelligibility is affected by those factors.

The room simulation is first. We built a very simple virtual environment pictured below:

virtual restaurant

The virtual room used in our experiments. The red dot represents the listener. The green dot represents the speaker. The blue dots represent other people in the restaurant having their own conversations and making noise.”

We can simulate the properties of the sounds in that room using a model that has been shown to be a good approximation of real recordings of sounds in rooms. After passing the speech for the speaker and all of the competing talkers through this room model, you will have a realistic simulation of the sounds in a room.

If you’re wearing headphones while you read this article, you can listen to an example here:

A woman speaking the sentence “Ten pins were set in order.” You should be able to hear other people talking to your right, all of whom are quieter than the woman in front. All of the sound has a slight echo to it. Note that this will not work if you aren’t wearing headphones!”

We then take this simulation and pass it through a hearing aid simulator. This imposes the processing you might expect in a widely-available hearing aid. Here’s an example of what that would sound like:

Same sentence as the restaurant simulation, but this is processed through a simulated hearing aid. You should notice a slightly different pitch to the sentence and the environment. This is because the simulated hearing loss is more extreme at higher pitches.”

Based on the results of hundreds of sentences, we would have a better understanding of how the environmental factors and the hearing aid processing interact. We found that for listeners with hearing impairment, there is an interaction between noise level and processing strategy, though more data will need to be collected before we can draw any solid conclusions. While these results are a promising first step, there are many more factors to look at—different amounts of echo, different amounts of noise, different types of processing strategies… and none of these factors include anything about the person listening to the sentences either. Does age, attention span, or degree of hearing loss affect their ability to perform the task? Ongoing and future research will be able to answer these questions.

This work is important because it shows that we can account for some environmental factors in tightly-controlled research. The method works well and produces results that we would expect to see. If you want results from the lab to be relatable to the real world, try to bring the real world into the lab!

1aSCa3 – Training effects on speech prosody production by Cantonese-speaking children with autism spectrum disorder

Si Chen – sarah.chen@polyu.edu.hk
Bei Li
Fang Zhou
Angel Wing Shan Chan
Tempo Po Yi Tang
Eunjin Chun
Phoebe Choi
Chakling Ng
Fiona Cheng
Xinrui Gou

Department of Chinese and Bilingual Studies
The Hong Kong Polytechnic University
11 Yuk Choi Road, Hung Hom, Hong Kong, China

Popular version of paper 1aSCa3
Presented Monday, December 07, 2020, 9:30 AM – 10:15 AM EST
179th ASA Meeting, Acoustics Virtually Everywhere

Speakers can utilize prosodic variations to express their intentions, states and emotions. Specifically, the relatively new information of an utterance, namely the focus, is often associated with expanded range of prosodic cues. The main types of focus include broad, narrow, and contrastive focus. Broad focus involves focus in a whole sentence (A: What did you say? B: [Emily ate an apple]FOCUS), whereas narrow focus emphasizes one constituent asked in the question (A: What did Emily eat? B: Emily ate an [apple]FOCUS). Contrastive focus rejects alternative statements (A: Did Emily eat an orange? B: (No,) Emily ate an [apple]FOCUS).

Children with autism spectrum disorder (ASD) have been reported to show difficulties in using speech prosody to mark focus. The presented research aims to test whether speech training and sung speech training may improve the use of speech prosody to mark focus. Fifteen Cantonese-speaking ASD children finished pre- and post-training speech production tasks and received either speech or sung speech training. In the pre- and post- training speech production tasks, we designed games to measure participants’ ability to mark focus in conversations. In the training tasks, we improved the mapping between acoustic cues and information structure categories through a series of tasks. The conversations used in sung speech training were designed with melodies that imitated the change of acoustic cues in speech.

Training.mp4, An example of congruous and incongruous conversation pairs in sung speech training

Both of the two training methods consisted of three phases of training. In the first phase, participants listened to congruous conversations pairs attentively in a designed game. In the second phase, participants were told that the odd trial of conversation was incongruous (the focus type that the question elicited did not match with that of the answer), and the even trial was congruous. They need to attend to the differences between the odd and even trials. In the third phase, all the trials were presented in a random order. Participants needed to judge if a pair was congruous or not. Instant feedback was provided after each response.

We extracted acoustic cues from ASD children’s speech before and after training and performed statistical analyses. Our pilot results showed that both speech and sung speech training might have improved the use of prosodic cues such as intensity and f0 in marking focus across various focus positions (e.g. meanF0.tiff). However, ASD children may still have difficulties in integrating all the prosodic cues across focus conditions.

autism spectrum disorderMean f0 of narrow focus in the initial position before and after training

1aBAa4 – Deep Learning the Sound of Light to Guide Surgeries

Muyinatu Bell — mledijubell@jhu.edu

Johns Hopkins University
3400 N. Charles St.
Baltimore, MD 21218

Popular version of paper 1aBAa4

Presented Monday morning, December 7, 2020

179th ASA Meeting, Acoustics Virtually Everywhere

Injuries to major blood vessels and nerves during surgical procedures such as neurosurgery, spinal fusion surgery, hysterectomies, and biopsies can lead to severe complications for the patient, like paralysis, or even death. Adding to the difficulty for surgeons is that in many cases, these bodily structures are not visible from their immediate viewpoint.

Photoacoustic imaging is a technique that has great potential to aid surgeons by utilizing acoustic responses from light transmission to make images of blood vessels and nerves. However, confusing artifacts that appear in the photoacoustic images, which are caused by acoustic reflections from bone and other highly reflective structures, challenge this technique and lead to inaccuracies in assumptions needed to form images.

Demonstration of ideal image formation (also known as beamforming) vs. beamforming that yields artifacts, distortions, incorrect localization, and acoustic reflections.

This paper summarizes novel methods developed by the Photoacoustic and Ultrasonic Systems Engineering (PULSE) Lab at Johns Hopkins University to eliminate surgical complications by creating more informative images for surgical guidance.

The overall goal of the proposed approach is to learn the unique shape-to-depth relationship of data from point-like photoacoustic sources – such as needle and catheter tips or the tips of surgical tools – in order to provide a deep learning-based image formation replacement that can more clearly guide surgeons. Accurately determining the proximity of these point-like tips to anatomical landmarks that appear in photoacoustic images — like major blood vessels and nerves—is a critical feature of the entire photoacoustic technology for surgical guidance. Convolutional neural networks (CNNs) – a class of deep neural networks, most commonly applied to analyzing visual imagery – were trained, tested, and implemented to achieve the end goal of producing clear and interpretable photoacoustic images.

After training networks using photoacoustic computer simulations, CNNs that achieved greater than 90% source classification accuracy were transferred to real photoacoustic data. These networks were trained to output the locations of both sources and artifacts, as well as classifications of the detected wavefronts. These outputs were then displayed in an image format called CNN-based images, which show both detected point source locations — such as a needle or catheter tip— and its location error, as illustrated below.

The well-trained CNN (top) inputs recorded sensor data from three experimental point-like sources (bottom left). This data produces an artifact appearing as a fourth source in with traditional beamforming (bottom middle). The CNN based image (bottom right) provides more clarity of true source locations and eliminates artifacts.

Overall, the classification rates ranged from 92-99.62% for simulated data. The network that utilized Resnet101 experienced both the greatest classification performances (99.62%) and the lowest misclassification rate (0.28%). A similar result was achieved with experimental water bath, phantom, ex vivo, and in vivo tissue data when using the Faster R-CNN architecture with the plain VGG16 convolutional neural network.

This success demonstrates two major breakthroughs for the field of deep learning applied to photoacoustic image formation. First, computer simulations of acoustic wave propagation can be used to successfully train deep neural networks, meaning that extensive experiments are not necessary to generate the thousands of example data needed to train CNNs for the proposed task. Second, these networks transfer well to real experimental data that were not included during training, meaning that the CNN based image can potentially be incorporated into future products that will use the photoacoustic process to minimize errors during surgeries and interventions.