1aSC31 – Shape changing artificial ear inspired by bats enriches speech signals – Anupam K Gupta

1aSC31 – Shape changing artificial ear inspired by bats enriches speech signals – Anupam K Gupta

Shape changing artificial ear inspired by bats enriches speech signals

Anupam K Gupta1,2 , Jin-Ping Han ,2, Philip Caspers1, Xiaodong Cui2, Rolf Müller1

1 Dept. of Mechanical Engineering, Virginia Tech, Blacksburg, VA, USA
2 IBM T. J. Watson Research Center, Yorktown, NY, USA

Contact: Jin-Ping Han – hanjp@us.ibm.com

Popular version of paper 1aSC31, “Horseshoe bat inspired reception dynamics embed dynamic features into speech signals.”
Presented Monday morning, Novemeber 28, 2016
172nd ASA Meeting, Honolulu


Have you ever had difficulty understanding what someone was saying to you while walking down a busy big city street, or in a crowded restaurant? Even if that person was right next to you? Words can become difficult to make out when they get jumbled with the ambient noise – cars honking, other voices – making it hard for our ears to pick up what we want to hear. But this is not so for bats. Their ears can move and change shape to precisely pick out specific sounds in their environment.

This biosonar capability inspired our artificial ear research and improving the accuracy of automatic speech recognition (ASR) systems and speaker localization. We asked if could we enrich a speech signal with direction-dependent, dynamic features by using bat-inspired reception dynamics?

Horseshoe bats, for example, are found throughout Africa, Europe and Asia, and so-named for the shape of their noses, can change the shape of their outer ears to help extract additional information about the environment from incoming ultrasonic echoes. Their sophisticated biosonar systems emit ultrasonic pulses and listen to the incoming echoes that reflect back after hitting surrounding objects by changing their ear shape (something other mammals cannot do). This allows them to learn about the environment, helping them navigate and hunt in their home of dense forests.

While probing the environment, horseshoe bats change their ear shape to modulate the incoming echoes, increasing the information content embedded in the echoes. We believe that this shape change is one of the reasons bats’ sonar exhibit such high performance compared to technical sonar systems of similar size.

To test this, we first built a robotic bat head that mimics the ear shape changes we observed in horseshoe bats.


Figure 1: Horseshoe bat inspired robotic set-up used to record speech signal



We then recorded speech signals to explore if using shape change, inspired by the bats, could embed direction-dependent dynamic features into speech signals. The potential applications of this could range from improving hearing aid accuracy to helping a machine more-accurately hear – and learn from – sounds in real-world environments.
We compiled a digital dataset of 11 US English speakers from open source speech collections provided by Carnegie Mellon University. The human acoustic utterances were shifted to the ultrasonic domain so our robot could understand and play back the sounds into microphones, while the biomimetic bat head actively moved its ears. The signals at the base of the ears were then translated back to the speech domain to extract the original signal.
This pilot study, performed at IBM Research in collaboration with Virginia Tech, showed that the ear shape change was, in fact, able to significantly modulate the signal and concluded that these changes, like in horseshoe bats, embed dynamic patterns into speech signals.

The dynamically enriched data we explored improved the accuracy of speech recognition. Compared to a traditional system for hearing and recognizing speech in noisy environments, adding structural movement to a complex outer shape surrounding a microphone, mimicking an ear, significantly improved its performance and access to directional information. In the future, this might improve performance in devices operating in difficult hearing scenarios like a busy street in a metropolitan center.


Figure 2: Example of speech signal recorded without and with the dynamic ear. Top row: speech signal without the dynamic ear, Bottom row: speech signal with the dynamic ear



3pSC87 – What the f***? Making sense of expletives in The Wire – Erica Gold

3pSC87 – What the f***? Making sense of expletives in The Wire – Erica Gold

What the f***? Making sense of expletives in The Wire
Erica Gold – e.gold@hud.ac.uk
Dan McIntyre – d.mcintyre@hud.ac.uk

University of Huddersfield
Huddersfield, HD1 3DH
United Kingdom

Popular version of paper 3pSC87, “ What the f***? Making sense of expletives in ‘The Wire'”

Presented Wednesday afternoon, November 30, 2016
172nd ASA Meeting, Honolulu


In Season one of HBO’s acclaimed crime drama The Wire, Detectives Jimmy McNulty and ‘Bunk’ Moreland are investigating old homicide cases, including the murder of a young woman shot dead in her apartment. McNulty and Bunk visit the scene of the crime to try and figure out exactly how the woman was killed. What makes the scene unusual dramatically is that, engrossed in their investigation, the two detectives communicate with each other using only the word, “fuck” and its variants (e.g. motherfucker, fuckity fuck, etc.). Somehow, using only this vocabulary, McNulty and Bunk are able to communicate in a meaningful way. The scene is absorbing, engaging and even funny, and it leads to a fascinating question for linguists: how is the viewer able to understand what McNulty and Bunk mean when they communicate using such a restricted set of words?


To investigate this, we first looked at what other linguists have discovered about the word fuck. What is clear is that it’s a hugely versatile word that can be used to express a range of attitudes and emotions. On the basis of this research, we came up with a classification scheme which we then used to categorise all the variants of fuck in the scene. Some seemed to convey disbelief and some were used as insults. Some indicated surprise or realization while others functioned to intensify the following word. And some were idiomatic set phrases (e.g. Fuckin’ A!). Our next step was to see whether there was anything in the acoustic properties of the characters’ speech that would allow us to explain why we interpreted the fucks in the way that we did.

The entire conversation between Bunk and McNulty lasts around three minutes and contains a total of 37 fuck productions (i.e. variations of fuck). Due to the variation in the fucks produced, the one clear and consistent segment for each word was the <u> in fuck. Consequently, this became the focus of our study. The <u> in fuck is the same sound you find in the word strut or duck and is represented as /ᴧ/ in the International Phonetic Alphabet. When analysing vowel sounds, such as <u>, we can look at a number of aspects of its production.

In this study, we looked at the quality of the vowel by measuring the first three formants. In phonetics, the term formant refers to acoustic resonances of sound in the vocal tract. The first two formants can tell us if the production sounds more like, “fuck” rather than, “feck” or “fack,” and the third formant gives us information about the voice quality. We also looked at the duration of the <u> being produced, “fuuuuuck” versus “ fuck.”

After measuring each instance, we ran statistical tests to see if there was any relationship between the way in which it was said, and how we categorised its range of meanings. Our results showed that if we accounted for the differences in the vocal tract shapes of the actors playing Bunk and McNulty, the quality of the vowels are relatively consistent. That is, we get a lot of <u> sounds, rather than “eh,” “oo” or “ih.”

The productions of fucks that were associated with the category of realization were found to be very similar to those associated with disbelief. However, disbelief and realization did contrast with those that were used as insults, idiomatic phrases, or functional words. Therefore, it may be more appropriate to classify the meaning into fewer categories – those that signify disbelief or realization, and those that are idiomatic, insults, or functional. It is important to remember, however, that the latter group of three meanings are represented by fewer examples in the scene. Our initial results show that these two broad groups may be distinguished through the length of the vowel – short <u> is more associated with an insult, function, or idiomatic use rather than disbelief or surprise (for which the vowel tends to be longer). In the future, we would also like to analyse the intonation of the productions. See if you can hear the difference between these samples:

Example 1: realization/surprise

Example 2: general expletive which falls under the functional/idiomatic/insult category


Our results shed new light on what for linguists is an old problem: how do we make sense of what people say when speakers so very rarely say exactly what they mean? Experts in pragmatics (the study of how meaning is affected by context) have suggested that we infer meaning when people break conversational norms. In the example from The Wire, it’s clear that the characters are breaking normal communicative conventions. But pragmatic methods of analysis don’t get us very far in explaining how we are able to infer such a range of meaning from such limited vocabulary. Our results confirm that the answer to this question is that meaning is not just conveyed at the lexical and pragmatic level, but at the phonetic level too. It’s not just what we say that’s important, it’s how we fucking say it!

*all photos are from HBO.com



4pEA7 – Acoustic Cloaking Using the Principles of Active Noise Cancellation – Jordan Cheer

4pEA7 – Acoustic Cloaking Using the Principles of Active Noise Cancellation – Jordan Cheer

Acoustic Cloaking Using the Principles of Active Noise Cancellation

Jordan Cheer – j.cheer@soton.ac.uk
Institute of Sound and Vibration Research
University of Southampton
Southampton, UK 

Popular version of paper 4pEA7, “Cancellation, reproduction and cloaking using sound field control”
Presented Thursday morning, December 1, 2016
172nd ASA Meeting, Honolulu

Loudspeakers are synonymous with audio reproduction and are widely used to play sounds people want to hear. Loudspeakers have also been used for the opposite purpose, to attenuate noise that people may not want to hear. Active noise cancellation technology is an example of this, which combines loudspeakers, microphones and digital signal processing to adaptively control unwanted noise sources [1].

More recently, the scientific community has focused attention on controlling and manipulating sound fields to acoustically cloak objects, with the aim of rendering objects acoustically invisible. A new class of engineered materials called metamaterials have already demonstrated this ability [2]. However, acoustic cloaking has also been demonstrated using methods based on both sound field reproduction and active noise cancellation [3]. Despite its demonstration there has been limited research exploring the physical links between acoustic cloaking, active noise cancellation and sound field reproduction. Therefore, we began exploring these links with the aim of developing active acoustic cloaking systems that build on the advanced knowledge of implementing both audio reproduction and active noise cancellation systems.

Acoustic cloaking attempts to control the sound scattered from a solid object. Using a numerical computer simulation, we therefore investigated the physical limits on active acoustic cloaking in the presence of a rigid scattering sphere. The scattering sphere, shown in Figure 1, was surrounded by an array of sources (loudspeakers) used to control the sound field, shown by the black dots surrounding the sphere in the figure. In the first instance we investigated the effect of the scattering sphere on a simple sound field.

Looking at a horizontal slice through the simulated sound field without a scattering object, shown in the second figure, modifications by the presence of the scattering sphere are obvious in comparison to the same slice when the object is present, seen in third figure. Scattering from the sphere distorts the sound field, rendering it acoustically visible.


Figure 1 – The geometry of the rigid scattering sphere and the array of sources, or loudspeakers used to control the sound field (black dots).


Figure 2 – The sound field due to an acoustic plane wave in the free field (without scattering).

Figure 3 – The sound field produced when an acoustic plane wave is incident on the rigid scattering sphere.



To understand the physical limitations on controlling this sound field, and thus implementing an active acoustic cloak, we investigated the ability of the array of loudspeakers surrounding the scattering sphere to achieve acoustic cloaking [4]. In comparison to active noise cancellation, rather than attempting to cancel the total sound field, we only attempted to control the scattered component of the sound field and thus render the sphere acoustically invisible.


With active acoustic cloaking, the sound field appears undisturbed, where the scattered component has been significantly attenuated and results in a field, shown in the fourth figure, that is indistinguishable from the object-less simulation of the Figure 2.


Figure 4 – The sound field produced when active acoustic cloaking is used to attempt to cancel the sound field scattered by a rigid scattering sphere and thus render the scattering sphere  acoustically ‘invisible’.


Our results indicate active acoustic attenuation can be achieved using an array of loudspeakers surrounding a sphere that would otherwise scatter sound detectably. In this and related work[4], further investigations showed that the performance of active acoustic cloaking is most effective when the loudspeakers are in close proximity to the object being cloaked. This may lead to design concepts involving acoustic sources embedded in objects for acoustic cloaking or control of the scattered sound field.


Future work will attempt to demonstrate the performance of active acoustic cloaking experimentally and overcome significant challenges of not only controlling the scattered sound field, but detecting it using an array of microphones.


[1]   P. Nelson and S. J. Elliott, Active Control of Sound, 436 (Academic Press, London) (1992).

[2]   L. Zigoneanu, B.I. Popa, and S.A. Cummer, “Three-dimensional broadband omnidirectional acoustic ground cloak”. Nat. Mater, 13(4), 352-355, (2014).

[3]   E. Friot and C. Bordier, “Real-time active suppression of scattered acoustic radiation”, J. Sound Vib., 278, 563–580 (2004).

[4]   J. Cheer, “Active control of scattered acoustic fields: Cancellation, reproduction and cloaking”, J. Acoust. Soc. Am., 140 (3), 1502-1512 (2016).



4aEA1 – Aero-Acoustic Noise and Control Lab – Seoryong Park

4aEA1 – Aero-Acoustic Noise and Control Lab – Seoryong Park

Aero-Acoustic Noise and Control Lab – Seoryong Park – tjfyd11@snu.ac.kr

School of Mechanical and Aerospace Eng., Seoul National University
301-1214, 1 Gwanak-ro, Gwanak-gu, Seoul 151-742, Republic of Korea

Popular version of paper 4aEA1, “Integrated simulation model for prediction of acoustic environment of launch vehicle”
Presented Thursday morning, December 1, 2016
172nd ASA Meeting, Honolulu

Literally speaking, a “sound” refers to a pressure fluctuation of the air. This means, for example, the sound of a bus passing means our ear senses the pressure fluctuation or pressure variation the bus created. During our daily lives, there are rarely significant pressure fluctuations in the air above common noises, but in special cases it happens. Windows are commonly featured in movies breaking from someone screaming loudly or in high pitches in the movie. This is usually exaggerated, but not out of the realm of what is physically possible.
The pressure fluctuations in the air caused by sound can cause engineering problems for loud structures such as rockets, especially given that the pressure nature of the sounds waves that means louder sounds result from larger pressure fluctuations and can cause more damage. Rocket launches are particularly loud and the resulting pressure change in the air can affect the surface of the launched vehicle as the form of the force shown as Figure 1.

Figure 1. The Magnitude of Acoustic Loads on the Launch Vehicle

As the vehicle is launched (Figure. 2),it reaches volumes over 180dB, which corresponds to about 20,000 Pascals in pressure change. This pressure change is about 20% of atmospheric pressure, which is considered very large. Because of the pressure change during launching, communication equipment and antenna panel can incur damage, causing the malfunctioning of the fairing, the protective cone covering the satellite. In the engineering field, the load created by the launching noise is called acoustic load, and many studies are in progress related to acoustic load.

Studies focused on the relationship between a launching vehicle and its acoustic load is categorized, to rocket engineers, under “prediction and control.” Prediction is divided into two aspects: internal acoustic load; and external acoustic load. Internal acoustic load refers to sound delivered from outside to inside, while external acoustic load is the noise directly from the jet fire. There are two ways to predict the external acoustic load, namely an empirical method and numerical method. The empirical method was developed by NASA in 1972 and uses the collected information from various studies. The numerical method employs mathematical formulas related to noise and electric wave calculated using computer modeling. As computers become more powerful, this method continues to gain favor. However, because numerical methods require so much calculation time, they often require the use of dedicated computing centers. Our team instead focused on using the more efficient and faster empirical method. fig-3-external-acoustic-loads-prediction-result-%28spectrum%29

Figure 3 shows the results of our calculations, depicting the expected sound spectrum. We can consider various physics principles involved during a lift-off, such as sound reflection, diffraction and impingement that could affect the original empirical method results.
Meanwhile, our team used a statistical energy analysis method to predict the internal acoustic load caused by the predicted external acoustic load. This method is used often to predict internal noise environments. It is used to predict the internal noise of a launching vehicle as well as aircraft and automobile noise. Our research team used a program called, VA One SEA, for predicting these noise effects, shown as figure. 4.

fig-4-modeling-of-the-payloads-and-forcing-of-the-external-acoustic-loads EMB00002e70bbf9
Figure 4. Modeling of the Payloads and Forcing of the External Acoustic Loads
After predicting internal acoustic load, we decreased the acoustic load to conduct an internal noise control study. A common way to do this is by sticking noise-reducing material to the structure. However, the extra weight from the noise-reducing material can cause decreased performance. To overcome this side effect, we also conducted a study about active noise control, which is in progress. Active noise control refers to reducing the noise by making antiphase waves of the sound for cancelling. Figure 5 shows the experimental results of applied SISO Noise Control, showing the reduction of noise is significant, especially for low frequencies.

Figure 5. Experimental Results of SISO Active Noise Control
Our research team applied the acoustic load prediction method and control method to the Korean launching vehicle, KSR-111. Through this application, we developed an improved empirical prediction method that is more accurate than previous methods, and we found usefulness of the noise control as we established the best algorithm for our experimental facilities and the active noise control area.


2pSC – How do narration experts provide expressive storytelling in Japanese fairy tales? -Takashi Saito

2pSC – How do narration experts provide expressive storytelling in Japanese fairy tales? -Takashi Saito

How do narration experts provide expressive storytelling in Japanese fairy tales?

Takashi Saito – saito@sc.shonan-it.ac.jp
Shonan Institute of Technology
1-1-25 Tsujido-Nishikaigan,
Fujisawa, Kanagawa, JAPAN

Popular version of paper 2pSC, “Prosodic analysis of storytelling speech in Japanese fairy tale”
Presented Tuesday afternoon, November 29, 2016
172nd ASA Meeting, Honolulu

Recent advances in speech synthesis technologies bring us relatively high quality synthetic speech, as smartphones today often provide it with speech message output. The acoustic sound quality especially seems to sometimes be particularly close to that of human voices. Prosodic aspects, or the patterns of rhythm and intonation, however, still have large room for improvement. The overall speech messages generated by speech synthesis systems sound somewhat awkward and monotonous. In other words, those messages lack expressiveness of speech compared with human speech. One of the reasons for this is that most systems use a one-sentence speech synthesis scheme in which each sentence in the message is generated independently, connected just to construct the message. The lack of expressiveness might hinder widening the range of applications for speech synthesis. Storytelling is a typical application to expect speech synthesis to be capable of having a control mechanism beyond just one sentence to provide really vivid and expressive storytelling. This work attempts to investigate the actual storytelling strategies of human narration experts for the purpose of ultimately reflecting them on the expressiveness of speech synthesis.

A Japanese popular fairy tale titled, “The Inch-High Samurai,” in its English translation was the storytelling material in this study. It is a short story taking about six minutes to tell verbally. The story consists of four elements typically found in simple fairy tales: introduction, build-up, climax, and ending. These common features suit the story well for observing prosodic changes in the story’s flow. The story was told by six narration experts (four female and two male narrators) and were recorded. First, we were interested in what they were thinking while telling the story, so we interviewed them on their actual reading strategies after the recording. We found they usually did not adopt fixed reading techniques for each sentence, but tried to go into the world of the story, and make a clear image of characters appearing in the story, as would an actor. They also reported paying attention to the following aspects of the scenes associated with the story elements: In the introduction, featuring the birth of the little Samurai character, they started to speak slowly and gently in effort to grasp the hearts of listeners. In the story’s climax, depicting the extermination of the devil character, they tried to express a tense feeling through a quick rhythm and tempo. Finally, in the ending, they gradually changed their reading styles to make the audience understand that the happy ending is coming soon.

For all six speakers a baseline speech segmentation was conducted for words, and accentual phrases in a semi-automatic way. We then used a multi-layered prosodic tagging method, performed manually, to provide information on various changes of “story states” relevant to impersonation, emotional involvement and scene flow control. Figure 1 shows an example of the labeled speech data. Wavesurfer [1] software served as our speech visualization and labelling tool. The example utterance contains a part of the storyteller’s speech (containing the phrase “oniwa bikkuridesu” meaning, “the devil was surprised,” and devil’s part, “ta ta tasukekuree,” meaning, “please help me!”) and is shown in the top label pane for characters (chrlab). The second top label pane (evelab) shows event labels such as scene changes and emotional involvement (desire, joy, fear, etc…). In this example, a “fear” event is attached to the devil’s utterance part. The dynamic pitch movement can be observed in the pitch contour pane located at the bottom of the figure.


How are the events of scene change or emotional involvement provided by human narrators manifested in speech data? Prosodic parameters of speed, measured in speech rate or mora/sec; pitch, measured in Hz; power, measured in dB; and preceding pause length, measured in seconds, are investigated for all the breath groups in the speech data. Breath group refers to a speech segment which is uttered consecutively without pausing. Figure 2, 3 and 4 show these parameters at a scene-change event (Figure 2), desire event (Figure 3), and fear event (Figure 4). The axis on the left of the figures shows the ratio of the parameter to its average value. Each event has its own distinct tendency on prosodic parameters, also seen in the figures, which seems to be fairly common to all speakers. For instance, the differences between the scene-change event and the desire event are the amount of preceding pause and the degree of the contributions from the other three parameters. The fear event shows a quite different tendency from other events, but it is common to all speakers though the degree of the parameter movement differs between speakers. Figure 5 shows how to expresses character differences, when the reader impersonates the story’s characters, with the three parameters. In short, speed and pitch are changed dynamically for impersonation, and this is a common tendency of all speakers.

Based on findings obtained from these human narrations, we are designing a framework of mapping story events through scene changes and emotional involvement to prosodic parameters. Simultaneously, it is necessary to build additional databases to ensure and reinforce story event description and mapping framework.

[1] Wavesurfer: http://www.speech.kth.se/wavesurfer/