November, 2016 | Acoustics.org

3pSC87 – What the f***? Making sense of expletives in The Wire

Erica Gold – e.gold@hud.ac.uk
Dan McIntyre – d.mcintyre@hud.ac.uk

University of Huddersfield
Queensgate
Huddersfield, HD1 3DH
United Kingdom

Popular version of 3pSC87 – What the f***: An acoustic-pragmatic analysis of meaning in The Wire
Presented Wednesday afternoon, November 30, 2016
172nd ASA Meeting, Honolulu
Click here to read the abstract

In Season one of HBO’s acclaimed crime drama The Wire, Detectives Jimmy McNulty and ‘Bunk’ Moreland are investigating old homicide cases, including the murder of a young woman shot dead in her apartment. McNulty and Bunk visit the scene of the crime to try and figure out exactly how the woman was killed. What makes the scene unusual dramatically is that, engrossed in their investigation, the two detectives communicate with each other using only the word, “fuck” and its variants (e.g. motherfucker, fuckity fuck, etc.). Somehow, using only this vocabulary, McNulty and Bunk are able to communicate in a meaningful way. The scene is absorbing, engaging and even funny, and it leads to a fascinating question for linguists: how is the viewer able to understand what McNulty and Bunk mean when they communicate using such a restricted set of words?

To investigate this, we first looked at what other linguists have discovered about the word fuck. What is clear is that it’s a hugely versatile word that can be used to express a range of attitudes and emotions. On the basis of this research, we came up with a classification scheme which we then used to categorise all the variants of fuck in the scene. Some seemed to convey disbelief and some were used as insults. Some indicated surprise or realization while others functioned to intensify the following word. And some were idiomatic set phrases (e.g. Fuckin’ A!). Our next step was to see whether there was anything in the acoustic properties of the characters’ speech that would allow us to explain why we interpreted the fucks in the way that we did.

The entire conversation between Bunk and McNulty lasts around three minutes and contains a total of 37 fuck productions (i.e. variations of fuck). Due to the variation in the fucks produced, the one clear and consistent segment for each word was the in fuck. Consequently, this became the focus of our study. The in fuck is the same sound you find in the word strut or duck and is represented as /ᴧ/ in the International Phonetic Alphabet. When analysing vowel sounds, such as , we can look at a number of aspects of its production.

In this study, we looked at the quality of the vowel by measuring the first three formants. In phonetics, the term formant refers to acoustic resonances of sound in the vocal tract. The first two formants can tell us if the production sounds more like, “fuck” rather than, “feck” or “fack,” and the third formant gives us information about the voice quality. We also looked at the duration of the being produced, “fuuuuuck” versus “ fuck.”

After measuring each instance, we ran statistical tests to see if there was any relationship between the way in which it was said, and how we categorised its range of meanings. Our results showed that if we accounted for the differences in the vocal tract shapes of the actors playing Bunk and McNulty, the quality of the vowels are relatively consistent. That is, we get a lot of sounds, rather than “eh,” “oo” or “ih.”

The productions of fucks that were associated with the category of realization were found to be very similar to those associated with disbelief. However, disbelief and realization did contrast with those that were used as insults, idiomatic phrases, or functional words. Therefore, it may be more appropriate to classify the meaning into fewer categories – those that signify disbelief or realization, and those that are idiomatic, insults, or functional. It is important to remember, however, that the latter group of three meanings are represented by fewer examples in the scene. Our initial results show that these two broad groups may be distinguished through the length of the vowel – short is more associated with an insult, function, or idiomatic use rather than disbelief or surprise (for which the vowel tends to be longer). In the future, we would also like to analyse the intonation of the productions. See if you can hear the difference between these samples:

Example 1: realization/surprise

Example 2: general expletive which falls under the functional/idiomatic/insult category

Our results shed new light on what for linguists is an old problem: how do we make sense of what people say when speakers so very rarely say exactly what they mean? Experts in pragmatics (the study of how meaning is affected by context) have suggested that we infer meaning when people break conversational norms. In the example from The Wire, it’s clear that the characters are breaking normal communicative conventions. But pragmatic methods of analysis don’t get us very far in explaining how we are able to infer such a range of meaning from such limited vocabulary. Our results confirm that the answer to this question is that meaning is not just conveyed at the lexical and pragmatic level, but at the phonetic level too. It’s not just what we say that’s important, it’s how we fucking say it!

*all photos are from HBO.com

1aSC31 – Shape changing artificial ear inspired by bats enriches speech signals

Anupam K Gupta^1,2, Jin-Ping Han ,², Philip Caspers¹, Xiaodong Cui², Rolf Müller¹

Dept. of Mechanical Engineering, Virginia Tech, Blacksburg, VA, USA
IBM T. J. Watson Research Center, Yorktown, NY, USA

Contact: Jin-Ping Han – hanjp@us.ibm.com

Popular version of paper 1aSC31, “Horseshoe bat inspired reception dynamics embed dynamic features into speech signals.”
Presented Monday morning, Novemeber 28, 2016
172nd ASA Meeting, Honolulu

Have you ever had difficulty understanding what someone was saying to you while walking down a busy big city street, or in a crowded restaurant? Even if that person was right next to you? Words can become difficult to make out when they get jumbled with the ambient noise – cars honking, other voices – making it hard for our ears to pick up what we want to hear. But this is not so for bats. Their ears can move and change shape to precisely pick out specific sounds in their environment.

This biosonar capability inspired our artificial ear research and improving the accuracy of automatic speech recognition (ASR) systems and speaker localization. We asked if could we enrich a speech signal with direction-dependent, dynamic features by using bat-inspired reception dynamics?

Horseshoe bats, for example, are found throughout Africa, Europe and Asia, and so-named for the shape of their noses, can change the shape of their outer ears to help extract additional information about the environment from incoming ultrasonic echoes. Their sophisticated biosonar systems emit ultrasonic pulses and listen to the incoming echoes that reflect back after hitting surrounding objects by changing their ear shape (something other mammals cannot do). This allows them to learn about the environment, helping them navigate and hunt in their home of dense forests.

While probing the environment, horseshoe bats change their ear shape to modulate the incoming echoes, increasing the information content embedded in the echoes. We believe that this shape change is one of the reasons bats’ sonar exhibit such high performance compared to technical sonar systems of similar size.

To test this, we first built a robotic bat head that mimics the ear shape changes we observed in horseshoe bats.

Figure 1: Horseshoe bat inspired robotic set-up used to record speech signal

We then recorded speech signals to explore if using shape change, inspired by the bats, could embed direction-dependent dynamic features into speech signals. The potential applications of this could range from improving hearing aid accuracy to helping a machine more-accurately hear – and learn from – sounds in real-world environments.

We compiled a digital dataset of 11 US English speakers from open source speech collections provided by Carnegie Mellon University. The human acoustic utterances were shifted to the ultrasonic domain so our robot could understand and play back the sounds into microphones, while the biomimetic bat head actively moved its ears. The signals at the base of the ears were then translated back to the speech domain to extract the original signal.
This pilot study, performed at IBM Research in collaboration with Virginia Tech, showed that the ear shape change was, in fact, able to significantly modulate the signal and concluded that these changes, like in horseshoe bats, embed dynamic patterns into speech signals.

The dynamically enriched data we explored improved the accuracy of speech recognition. Compared to a traditional system for hearing and recognizing speech in noisy environments, adding structural movement to a complex outer shape surrounding a microphone, mimicking an ear, significantly improved its performance and access to directional information. In the future, this might improve performance in devices operating in difficult hearing scenarios like a busy street in a metropolitan center.

Figure 2: Example of speech signal recorded without and with the dynamic ear. Top row: speech signal without the dynamic ear, Bottom row: speech signal with the dynamic ear

4pEA7 – Acoustic Cloaking Using the Principles of Active Noise Cancellation

Jordan Cheer – j.cheer@soton.ac.uk
Institute of Sound and Vibration Research
University of Southampton
Southampton, UK

Popular version of paper 4pEA7, “Cancellation, reproduction and cloaking using sound field control”
Presented Thursday morning, December 1, 2016
172^nd ASA Meeting, Honolulu

Loudspeakers are synonymous with audio reproduction and are widely used to play sounds people want to hear. Loudspeakers have also been used for the opposite purpose, to attenuate noise that people may not want to hear. Active noise cancellation technology is an example of this, which combines loudspeakers, microphones and digital signal processing to adaptively control unwanted noise sources [1].

More recently, the scientific community has focused attention on controlling and manipulating sound fields to acoustically cloak objects, with the aim of rendering objects acoustically invisible. A new class of engineered materials called metamaterials have already demonstrated this ability [2]. However, acoustic cloaking has also been demonstrated using methods based on both sound field reproduction and active noise cancellation [3]. Despite its demonstration there has been limited research exploring the physical links between acoustic cloaking, active noise cancellation and sound field reproduction. Therefore, we began exploring these links with the aim of developing active acoustic cloaking systems that build on the advanced knowledge of implementing both audio reproduction and active noise cancellation systems.

Acoustic cloaking attempts to control the sound scattered from a solid object. Using a numerical computer simulation, we therefore investigated the physical limits on active acoustic cloaking in the presence of a rigid scattering sphere. The scattering sphere, shown in Figure 1, was surrounded by an array of sources (loudspeakers) used to control the sound field, shown by the black dots surrounding the sphere in the figure. In the first instance we investigated the effect of the scattering sphere on a simple sound field.

Looking at a horizontal slice through the simulated sound field without a scattering object, shown in the second figure, modifications by the presence of the scattering sphere are obvious in comparison to the same slice when the object is present, seen in third figure. Scattering from the sphere distorts the sound field, rendering it acoustically visible.

Figure 1 – The geometry of the rigid scattering sphere and the array of sources, or loudspeakers used to control the sound field (black dots).	Figure 2 – The sound field due to an acoustic plane wave in the free field (without scattering).
Figure 3 – The sound field produced when an acoustic plane wave is incident on the rigid scattering sphere.	Figure 4 – The sound field produced when active acoustic cloaking is used to attempt to cancel the sound field scattered by a rigid scattering sphere and thus render the scattering sphere acoustically ‘invisible’.

To understand the physical limitations on controlling this sound field, and thus implementing an active acoustic cloak, we investigated the ability of the array of loudspeakers surrounding the scattering sphere to achieve acoustic cloaking [4]. In comparison to active noise cancellation, rather than attempting to cancel the total sound field, we only attempted to control the scattered component of the sound field and thus render the sphere acoustically invisible.

With active acoustic cloaking, the sound field appears undisturbed, where the scattered component has been significantly attenuated and results in a field, shown in the fourth figure, that is indistinguishable from the object-less simulation of the Figure 2.

Our results indicate active acoustic attenuation can be achieved using an array of loudspeakers surrounding a sphere that would otherwise scatter sound detectably. In this and related work[4], further investigations showed that the performance of active acoustic cloaking is most effective when the loudspeakers are in close proximity to the object being cloaked. This may lead to design concepts involving acoustic sources embedded in objects for acoustic cloaking or control of the scattered sound field.

Future work will attempt to demonstrate the performance of active acoustic cloaking experimentally and overcome significant challenges of not only controlling the scattered sound field, but detecting it using an array of microphones.

[1] P. Nelson and S. J. Elliott, Active Control of Sound, 436 (Academic Press, London) (1992).

[2] L. Zigoneanu, B.I. Popa, and S.A. Cummer, “Three-dimensional broadband omnidirectional acoustic ground cloak”. Nat. Mater, 13(4), 352-355, (2014).

[3] E. Friot and C. Bordier, “Real-time active suppression of scattered acoustic radiation”, J. Sound Vib., 278, 563–580 (2004).

[4] J. Cheer, “Active control of scattered acoustic fields: Cancellation, reproduction and cloaking”, J. Acoust. Soc. Am., 140 (3), 1502-1512 (2016).

1aNS4 – Musical mind control: Human speech takes on characteristics of background music

Ryan Podlubny – ryan.podlubny@pg.canterbury.ac.nz
Department of Linguistics, University of Canterbury
20 Kirkwood Avenue, Upper Riccarton
Christchurch, NZ, 8041

Popular version of paper 1aNS4, “Musical mind control: Acoustic convergence to background music in speech production.”
Presented Monday morning, November 28, 2016
172nd ASA Meeting, Honolulu

People often adjust their speech to resemble that of their conversation partners – a phenomenon known as speech convergence. Broadly defined, convergence describes automatic synchronization to some external source, much like running to the beat of music playing at the gym without intentionally choosing to do so. Through a variety of studies a general trend has emerged where we find people automatically synchronizing to various aspects of their environment^1,2,3. With specific regard to language use, convergence effects have also been observed in many linguistic domains such as sentence-formation⁴, word-formation⁵, and vowel production6 (where differences in vowel production are well associated with perceived accentedness^7,8). This prevalence in linguistics raises many interesting questions about the extent to which speakers converge. This research uses a speech-in-noise paradigm to explore whether or not speakers also converge to non-linguistic signals in the environment: Specifically, will a speaker’s rhythm, pitch, or intensity (which is closely related to loudness) be influenced by fluctuations in background music such that the speech echoes specific characteristics of that background music (for example, if the tempo of background music slows down, will that influence those listening to unconsciously decrease their speech rate)?

In this experiment participants read passages aloud while hearing music through headphones. Background music was composed by the experimenter to be relatively stable with regard to pitch, tempo/rhythm, and intensity, so we could manipulate and test only one of these dimensions at a time, within each test-condition. We imposed these manipulations gradually and consistently toward a target, which can be seen in Figure 1, and would similarly return to the level at which they started after reaching that target. We played the participants music with no experimental changes in between all manipulated sessions. (Examples of what participants heard in headphones are available as sound- files 1 and 2)

Fig. 1: Using software designed for digital signal processing (analyzing and altering sound), manipulations were applied in a linear fashion (in a straight line) toward a target – this can be seen above as the blue line, which first rises and then falls. NOTE: After manipulations reach their target (the target is seen above as a dashed, vertical red line), the degree of manipulation would then return to the level at which it started in a similar linear fashion. Graphic captured while using Praat 9 to increase and then decrease the perceived loudness of the background music.

Data from 15 native speakers of New Zealand English were analyzed using statistical tests that allow effects to vary somewhat for each participant where we observed significant convergence in both the pitch and intensity conditions. Analysis of the Tempo condition, however, has not yet been conducted. Interestingly, these effects appear to differ systematically based on a person’s previous musical training. While non-musicians demonstrate the predicted effect and follow the manipulations, musicians appear to invert the effect and reliably alter aspects of their pitch and intensity in the opposite direction of the manipulation (see Figure 2). Sociolinguistic research indicates that under certain conditions speakers will emphasize characteristics of their speech to distinguish themselves socially from conversation partners or groups, as opposed to converging with them6. It seems plausible then that, given a relatively heightened ability to recognize low-level variations of sound, musicians may on some cognitive level be more aware of the variation in their sound environment, and as a result similarly resist the more typical effect. However, more work is required to better understand this phenomenon.

Fig. 2: The above plots measure pitch on the y-axis (up and down on the left edge), and indicate the portions of background music that have been manipulated on the x- axis (across the bottom). The blue lines show that speakers generally lower their pitch as an un-manipulated condition progresses. However the red lines show that when global pitch is lowered during a test-condition, such lowering is relatively more dramatic for non-musicians (left plot) and that the effect is reversed by those with musical training (right plot). NOTE: A follow-up model further accounts for the relatedness of Pitch and Intensity and shows much the same effect.

This work indicates that speakers are not only influenced by human speech partners in production, but also, to some degree, by noise within the immediate speech environment, which suggests that environmental noise may constantly be influencing certain aspects of our speech production in very specific and predictable ways. Human listeners are rather talented when it comes to recognizing subtle cues in speech¹⁰, especially compared to computers and algorithms that can’t yet match this ability. Some language scientists argue these changes in speech occur to make understanding easier for those listening¹¹. That is why work like this is likely to resonate in both academia and the private sector, as a better understanding of how speech will change in different environments contributes to the development of more effective aids for the hearing impaired, as well as improvements to many devices used in global communications.

Sound-file 1.
An example of what participants heard as a control condition (no experimental manipulation) in between test-conditions.

Sound-file 2.
An example of what participants heard as a test condition (Pitch manipulation, which drops 200 cents/one full step).

References

1. Hill, A. R., Adams, J. M., Parker, B. E., & Rochester, D. F. (1988). Short-term entrainment of ventilation to the walking cycle in humans. Journal of Applied Physiology, 65(2), 570-578.
2. Will, U., & Berg, E. (2007). Brain wave synchronization and entrainment to periodic acoustic stimuli. Neuroscience letters, 424(1), 55-60.
3. McClintock, M. K. (1971). Menstrual synchrony and suppression. Nature, Vol 229, 244-245.
4. Branigan, H. P., Pickering, M. J., McLean, J. F., & Cleland, A. A. (2007). Syntactic alignment and participant role in dialogue. Cognition, 104(2), 163-197.
5. Beckner, C., Rácz, P., Hay, J., Brandstetter, J., & Bartneck, C. (2015). Participants Conform to Humans but Not to Humanoid
Robots in an English Past Tense Formation Task. Journal of Language and Social Psychology, 0261927X15584682.
Retreived from: http://jls.sagepub.com.ezproxy.canterbury.ac.nz/content/early/2015/05/06/0261927X15584682.
6. Babel, M. (2012). Evidence for phonetic and social selectivity in spontaneous phonetic imitation. Journal of Phonetics, 40(1), 177-189.
7. Major, R. C. (1987). English voiceless stop production by speakers of Brazilian Portuguese. Journal of Phonetics, 15, 197—
202.
8. Rekart, D. M. (1985) Evaluation of foreign accent using synthetic speech. Ph.D. dissertation, the Lousiana State University.
9. Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (Version 5.4.04) [Computer program]. Retrieved
from www.praat.org.
10. Hay, J., Podlubny, R., Drager, K., & McAuliffe, M. (under review). Car-talk: Location-specific speech production and
perception.
11. Lane, H., & Tranel, B. (1971). The Lombard sign and the role of hearing in speech. Journal of Speech, Language, and
Hearing Research, 14(4), 677-709.

1pMU4 – When To Cue the Music

Ki-Hong Kim — kim.kihong@surugadai.ac.jp
Faculty of Media & Information Resources, Surugadai University
698 Azu, Hanno-shi, Saitama-ken, Japan 357-8555

Mikiko Kubo — kubmik.0914@gmail.com
Hitachi Solutions, Ltd.
4-12-7 Shinagawa-ku, Tokyo, Japan 140-0002

Shin-ichiro Iwamiya – iwamiya@design.kyushu-u.ac.jp
Faculty of Design, Kyushu University
4-9-1 Shiobaru, Minami-ku, Fukuoka, Japan 815-8540

Popular version of paper 1pMU4, “Optimal insertion timing of symbolic music to induce laughter in video content.”
Presented Monday afternoon, November 28, 2016
172nd ASA Meeting, Honolulu

A study of optimal insertion timing of symbolic music to induce laughter in videos

In television variety shows or comedy programs various sound effects and music are combined with humorous scenes to induce more pronounced laughter from viewers or listeners [1]. The aim of our study was to clarify the optimum insertion timing of symbolic music to induce laughter in video contents. Symbolic music is music that is associated with a special meaning such as something funny as a sort of “punch line” to emphasize their humorous nature.

Fig. 1 Sequence of video and audio tracks in the video editing timeline

We conducted a series of rating experiments to explore the best timing for insertion of such music into humorous video contents. We also examined the affects of audiovisual contents. The experimental stimuli were four short video contents, which were created by mixing the two video (V1 & V2) and four music clips (M1, M2, M3 & M4).

The rating experiments clarified that insertion timing of symbolic music contributed to inducing laughter of video contents. In the case of a purely comical scene (V1), we found the optimal insertion time for high funniness rating was the shortest, at 0-0.5 seconds. In the case of a tragicomic scene, a humorous accident (V2), the optimal insertion time was longer, at 0.5-1 seconds after the scene; i.e., a short pause before the music was effective to increase funniness.

Fig. 2 Subjective evaluation value for the funniness in each insertion timing of symbolic music for each video scene.

Furthermore, the subjective evaluation value rating experiments showed that optimal timing was associated with the highest impressiveness of the videos, the highest evaluations, the highest congruence between moving pictures and sounds, and inducement of maximum laughter. We discovered all of the correlation coefficients are
very high, seen in the table summarizing the test.

Table 1 Correlation coefficient between the optimal timing for symbolic music and the affects for audiovisual contents.
	funniness	impressiveness	congruence	evaluation
best timing	.95**	.90**	.90**	.98**
funniness	–	.94**	.92**	.97**
impressiveness	.94**	–	.92**	.95**
congruence	.92**	.92**	–	.94**
evaluation	.97**	.95**	.94**	–

** p< .01

In television variety shows or comedy programs, when symbolic music is dubbed over the video as a punch line just after the humorous scenes, insertion of a short pause of between half a second and a full second is very effective at emphasizing the humor of scenes, and increasing the impressiveness of viewer-listeners.

1. Kim, K.H., et al., F. Effectiveness of Sound Effects and Music to Induce Laugh in Comical Entertainment Television Show. The 13th International Conference on Music Perception and Cognition, 2014. CD-ROM.
2. Kim, K.H., et al., Effects of Music and Sound Effects to Increase Laughter in Television Programs. Media & Information Resources, 2014. 21(2): 15-28. (in Japanese with English abstract).

Tags:

Music
Video
Television
Audiovisual

« Older Entries

Next Entries »

3pSC87 – What the f***? Making sense of expletives in The Wire

1aSC31 – Shape changing artificial ear inspired by bats enriches speech signals

4pEA7 – Acoustic Cloaking Using the Principles of Active Noise Cancellation

1aNS4 – Musical mind control: Human speech takes on characteristics of background music

1pMU4 – When To Cue the Music

Search for papers by Acoustics Keyword