1pAA6 – Listening for solutions to a speech intelligibility problem

Anthony Hoover, FASA – thoover@mchinc.com
McKay Conant Hoover, Inc.
Acoustics & Media Systems Consultants
5655 Lindero Canyon Road, Suite 325
Westlake Village, CA 91362

Popular version of paper 1pAA6, “Listening for solutions to a speech intelligibility problem”
Presented Monday afternoon, May 23, 2016, 2:45 in Salon E
171st ASA Meeting in Salt Lake City, UT

Loudspeakers for sound reinforcement systems are designed to project their sound in specific directions. Sound system designers take advantage of the “directivity” characteristics of these loudspeakers, aiming their sound uniformly throughout seating areas, while avoiding walls and ceilings and other surfaces from which undesirable reflections could reduce clarity and fidelity.

Many high-quality sound reinforcement loudspeaker systems incorporate horn loudspeakers that provide very good control, but these are relatively large and conspicuous.   In recent years, “steerable column arrays” have become available, which are tall but narrow, allowing them to better blend into the architectural design.  These are well suited to the frequency range of speech, and to some degree their sound output can be steered up or down using electronic signal processing.

Figure 1 - steerable column arrays - speech intelligibility

Figure 1. steerable column arrays

Figure 1 illustrates the steering technique, with six individual loudspeakers in a vertical array.  Each loudspeaker generates an ever-expanding sphere of sound (in this figure, simplified to show only the horizontal diameter of each sphere), propagating outward at the speed of sound, which is roughly 1 foot per millisecond.  In the “not steered” column, all of the loudspeakers are outputting their sound at the same time, with a combined wavefront spreading horizontally, as an ever-expanding cylinder of sound.  In the “steered downward” column, the electronic signal to each successively lower loudspeaker is slightly delayed; the top loudspeaker outputs its sound first, while each lower loudspeaker in turn outputs its sound just a little later, so that the sound energy is generally steered slightly downward. This steering allows for some flexibility in positioning the loudspeaker column.  However, these systems only offer some vertical control; left-to-right projection is not well controlled.

Steerable column arrays have reasonably resolved speech reinforcement issues in many large, acoustically-problematic spaces. Such arrays were appropriate selections for a large worship space, with a balcony and a huge dome, that had undergone a comprehensive renovation.  Unfortunately, in this case, problems with speech intelligibility persisted, even after multiple adjustments by reputable technicians, who had used their instrumentation to identify several sidewall surfaces that appeared to be reflecting sound and causing problematic echoes. They recommended additional sound absorptive treatment that could adversely affect visual aesthetics and negatively impact the popular classical music concerts.

Upon visiting the space as requested to investigate potential acoustical treatments, speech was difficult to understand in various areas on the main floor.  While playing a click track (imagine a “pop” every 5 seconds) through the sound system, and listening to the results around the main floor, we heard strong echoes emanating from the direction of the surfaces that had been recommended for sound-absorptive treatment.

Nearby those surfaces, additional column loudspeakers had been installed to augment coverage of the balcony seating area.  These balcony loudspeakers were time-delayed (in accordance with common practice, to accommodate the speed of sound) so that they would not produce their sound until the sound from the main loudspeakers had arrived at the balcony. With proper time delay, listeners on the balcony would hear sound from both main and balcony loudspeakers at approximately the same time, and thereby avoid what would otherwise seem like an echo from the main loudspeakers.

With more listening, it became clear that the echo was not due to reflections from the walls at all, but rather from the delayed balcony loudspeakers’ sound inadvertently spraying back to the main seating area.  These loudspeakers cannot be steered in a multifaceted manner that would both cover the balcony and avoid the main floor.

We simply turned off the balcony loudspeakers, and the echo disappeared.  More importantly, speech intelligibility improved significantly throughout the main floor. Intelligibility throughout the balcony remained acceptable, albeit not quite as good as with the balcony loudspeakers operating.

The general plan is to remove the balcony loudspeakers and relocate them to the same wall as the main loudspeakers, but steer them to cover the balcony.

Adding sound-absorptive treatment on the side walls would not have solved the problem, and would have squandered funds while impacting the visual aesthetics and classical music programming.  Listening for solutions proved to be more effective than interpreting test results from sophisticated instrumentation.

5aSCb17 – Pronunciation differences: Gender and ethnicity in Southern English

Wendy Herd – wherd@english.msstate.edu
Devan Torrence – dct74@msstate.edu
Joy Carino – carinoj16@themsms.org

Linguistics Research Laboratory
English Department
Mississippi State University
Mississippi State, MS 39762

Popular version of paper 5aSCb17, “Prevoicing differences in Southern English: Gender and ethnicity effects”
Presented Friday morning, May 27, 10:05 – 12:00 in Salon F
171st ASA Meeting, Salt Lake City

We often notice differences in pronunciation between ourselves and other speakers. More noticeable differences, like the Southern drawl or the New York City pronunciation yuge instead of huge, are even used overtly when we guess where a given speaker is from. Our speech also varies in more subtle ways.

If you hold your hand in front of your mouth when saying tot and dot aloud, you will be able to feel a difference in the onset of vocal fold vibration. Tot begins with a sound that lacks vocal fold vibration, so a large rush of air can be felt on the hand at the beginning of the word. No such rush of air can be felt at the beginning of dot because it begins with a sound with vocal fold vibration. A similar difference can be felt when comparing [p] of pot to [b] of bot and [k] of cot to [ɡ] of got. This difference between [t] and [d] is very noticeable, but the timing of our vocal fold vibration also varies each time we pronounce a different version of [t] or [d].

Our study is particularly focused, not on the large difference between sounds like [t] and [d], but on how speakers produce the smaller differences between different [d] pronunciations. For example, an English [d] might be pronounced with no vocal fold vibration before the [d] as shown in Figure 1(a) or with vocal fold vibration before the [d] as shown in Figure 1(b). As can be heard in the accompanying sound files, the difference between these two [d] pronunciations is less noticeable for English speakers than the difference between [t] and [d].

Pronunciation differences

Figure 1. Spectrogram of (a) dot with no vocal fold vibration before [d] and (b) dot with vocal fold vibration before [d]. (Only the first half of dot is shown.)

We compared the pronunciations of 40 native speakers of English from Mississippi to see if some speakers were more likely to vibrate their vocal folds before [b, d, ɡ] rather than shortly after those sounds. These speakers included equal numbers of African American participants (10 women, 10 men) and Caucasian American participants (10 women, 10 men).

Previous research found that men were more likely to vibrate their vocal folds before [b, d, ɡ] than women, but we found no such gender differences [1]. Men and women from Mississippi employed vocal fold vibration similarly. Instead, we found a clear effect of ethnicity. African American participants produced vocal fold vibration before initial [b, d, ɡ] 87% of the time while Caucasian American participants produced vocal fold vibration before these sounds just 37% of the time. This striking difference, which can be seen in Figure 2, is consistent with a previous smaller study that found ethnicity effects in vocal fold vibration among young adults from Florida [1, 2]. It is also consistent with descriptions of regional variation in vocal fold vibration [3].

Figure 2. Percentage of pronunciations produced with vocal fold vibration before [b, d, ɡ] displayed by ethnicity and gender.

The results suggest that these pronunciation differences are due to dialect variation. African American speakers from Mississippi appear to systematically use vocal fold vibration before [b, d, ɡ] to differentiate them from [p, t, k], but the Caucasian American speakers are using the cue differently and less frequently. Future research in the perception of these sounds could shed light on how speakers of different dialects vary in the way they interpret this cue. For example, if African American speakers are using this cue to differentiate [d] from [t], but Caucasian American speakers are using the same cue to add emphasis or to convey emotion, it is possible that listeners sometimes use these cues to (mis)interpret the speech of others without ever realizing it. We are currently attempting to replicate these results in other regions.

Each accompanying sound file contains two repetitions of the same word. The first repetition does not include fold vibration before the initial sound, and the second repetition does include vocal fold vibration before the initial sound.

  1. Ryalls, J., Zipprer, A., & Baldauff, P. (1997). A preliminary investigation of the effects of gender and race on voice onset time. Journal of Speech Language and Hearing, 40(3), 642-645.
  2. Ryalls, J., Simon, M., & Thomason, J. (2004). Voice onset time production in older Caucasian- and African-Americans. Journal of Multilingual Communication Disorders, 2(1), 61-67.
  3. Jacewicz, E., Fox, R.A., & Lyle, S. (2009). Variation in stop consonant voicing in two regional varieties of American English. Language Variation and Change, 39(3), 313-334.

1aAA4 – Optimizing the signal to noise ratio in classrooms using passive acoustics

Peter D’Antonio – pdantonio@rpginc.com

RPG Diffusor Systems, Inc.
651 Commerce Dr
Upper Marlboro, MD 20774

Popular version of paper 1aAA4 “Optimizing the signal to noise ratio in classrooms using passive acoustics”
Presented on Monday May 23, 10:20 AM – 5:00 pm, SALON I
171st ASA Meeting, Salt Lake City

The 2012 Program of International Student Assessment (PISA) has carried out an international comparative trial of student performance in reading comprehension, calculus, and natural science. The US ranks 36th out of 64 countries testing ½ million 15 year olds, as shown in Figure 1.

Dantonio1

Figure 1 PISA Study

What is the problem? Existing acoustical designs and products have not evolved to incorporate the current state-of-the-art and the result is schools that are failing to meet their intended goals. Learning areas are only beginning to include adjustable intensity and color lighting, shown to increase reading speeds, reduce testing errors and reduce hyperactivity; existing acoustical designs are limited to conventional absorptive-only acoustical materials, like thin fabric wrapped panels and acoustical ceiling tiles, which cannot address all of the speech intelligibility and music appreciation challenges.

What is the solution? Adopt modern products and designs for core and ancillary learning spaces which utilize binary, ternary, quaternary and other transitional hybrid surfaces, which simultaneously scatter consonant-containing high frequency early reflections and absorb mid-low frequencies to passively improve the signal to noise ratio, adopt recommendations of ANSI 12.6 to control reverberation, background noise and noise intrusion and integrate lighting that adjusts to the task at hand.

Let’s begin by considering how we hear and understand what is being said when information is being delivered via the spoken word. We often hear people say, I can hear what he or she is saying, but I cannot understand what is being said. The understanding of speech is referred to as speech intelligibility. How do we interpret speech? The ear / brain processor can fill in a substantial amount of missing information in music, but requires more detailed information for understanding speech. The speech power is delivered in the vowels (a, e, i, o, u and sometimes y) which are predominantly in the frequency range of 250Hz to 500Hz. The speech intelligibility is delivered in the consonants (b, c, d, f, g, h, j, k, l, m, n, p, q, r, s, t, v, w), which occur in the 2,000Hz to 6,000 Hz frequency range. People who suffer from noise induced hearing loss typically have a 4,000Hz notch, which causes severe degradation of speech intelligibility. I raise the question, “Why would we want to use exclusively absorption on the entire ceiling of a speech room and thin fabric wrapped panels on a significant proportion of wall areas, when these porous materials absorb these important consonant frequencies and prevents them from fusing with the direct sound making it louder and more intelligible?

Exclusive treatment of absorbing material on the ceiling of the room may excessively reduce the high-frequency consonants sound and result in the masking of high-frequency consonants by low-frequency vowel sounds, thereby reducing the signal to noise ratio (SNR).

The signal has two contributions. The direct line-of-sight sound and the early reflections arriving from the walls, ceiling, floor and people and items in the room. So the signal consists of direct sound and early reflection. Our auditory system, our ears and brain, have a unique ability called temporal fusion, which combines or fuses these two signals into one apparently louder and more intelligible signal. The goal then is to utilize these passive early reflections as efficiently as possible to increase the signal. The denominator in the SNR consists of external noise intrusion, occupant noise, HVAC noise and reverberation. These ideas are summarized in Figure 2.

Dantonio figure2

Figure 2 Signal to Noise Ratio

In Figure 3, we illustrate a concept model for an improved speech environment, whether it is a classroom, a lecture hall, a meeting/conference room, essentially any room in which information is being conveyed.

The design includes a reflective front, because the vertical and horizontal divergence of the consonants is roughly 120 degrees, so if a speaker turns away from the audience, the consonants must reflect from the front wall and ceiling overhead. The perimeter of the ceiling is absorptive to control the reverberation (noise). The center of the ceiling is diffusive to provide early reflections to increase the signal and its coverage in the room. The mid third of the walls utilize novel binary, ternary, quaternary and other transitional diffsorptive (diffusive/absorptive) panels, which scatter the information above 1 kHz (the signal) and absorb the sound below 1 kHz (the reverberation=noise). This design suggests that the current exclusive use of acoustical ceiling tile and traditional fabric wrapped panels is counterproductive in improving the SNR, speech intelligibility and coverage.

Dantonio figure3 - classrooms

Figure 3 Concept model for a classroom with a high SNR

2aSP5 – Using Automatic Speech Recognition to Identify Dementia in Early Stages

Roozbeh Sadeghian, J. David Schaffer, and Stephen A. Zahorian
Rsadegh1@binghamton.edu
SUNY at Binghamton
Binghamton, NY

Popular version of paper 2aSP5, “Using automatic speech recognition to identify dementia in early stages”
Presented Tuesday morning, November 3, 2015, 10:15 AM, City Terrace room
170th ASA Meeting, Jacksonville, Fl

The clinical diagnosis of Alzheimer’s disease (AD) and other dementias is very challenging, especially in the early stages. It is widely believed to be underdiagnosed, at least partially because of the lack of a reliable non-invasive diagnostic test.  Additionally, recruitment for clinical trials of experimental dementia therapies might be improved with a highly specific test. Although there is much active research into new biomarkers for AD, most of these methods are expensive and or invasive such as brain imaging, often with radioactive tracers, or taking blood or spinal fluid samples and expensive lab procedures.

There are good indications that dementias can be characterized by several aphasias (defects in the use of speech). This seems plausible since speech production involves many brain regions, and thus a disease that effects particular regions involved in speech processing might leave detectable finger prints in the speech. Computerized analysis of speech signals and computational linguistics (analysis of word patterns) have progressed to the point where an automatic speech analysis system could be within reach as a tool for detection of dementia. The long-term goal is an inexpensive, short duration, non-invasive test for dementia; one that can be administered in an office or home by clinicians with minimal training.

If a pilot study (cross sectional design: only one sample from each subject) indicates that suitable combinations of features derived from a voice sample can strongly indicate disease, then the research will move to a longitudinal design (many samples collected over time) where sizable cohorts will be followed so that early indicators might be discovered.

A simple procedure for acquiring speech samples is to ask subjects to describe a picture (see Figure 1). Some such samples are available on the web (DementiaBank), but they were collected long ago and the audio quality is often lacking in quality. We used 140 of these older samples, but also collected 71 new samples with good quality audio. Roughly half of the samples had a clinical diagnosis of probable AD, and the others were demographically similar and cognitively normal (NL).

(a) (b)Sadeghian Figure1b

Figure 1- The picture used for recording samples (a) famous cookie theft samples and (b) newly recorded samples

One hundred twenty eight features were automatically extracted from speech signals, including pauses and pitch variation (indicating emotion); word-use features were extracted from manually-prepared transcripts. In addition, we had the results of a popular cognitive test, the mini mental state exam (MMSE) for all subjects. While widely used as an indicator of cognitive difficulties, the MMSE is not sufficiently diagnostic for dementia by itself. We searched for patterns with and without the MMSE. This gives the possibility of a clinical test that combines speech with the MMSE. Multiple patterns were found using an advanced pattern discovery approach (genetic algorithms with support vector machines). The performances of two example patterns are shown in Figure 2. The training samples (red circles) were used to discover the patterns, so we expect them to perform well. The validation samples (blue) were not used for learning, only to test the discovered patterns. If we say that a subject will be declared AD if the test score is > 0.5 (the red line in Figure 2), we can see some errors: in the left panel we see one false positive (NL case with a high test score, blue triangle) and several false negatives (AD cases with low scores, red circles).  

Sadeghian 2_graphs - Dementia

Figure 2. Two discovered diagnostic patterns (left with MMSE) (right without MMSE). The normal subjects are to the left in each plot (low scores) and the AD subjects to the right (high scores). No perfect pattern has yet been discovered. 

As mentioned above, manually prepared transcripts were used for these results, since automatic speaker-independent speech recognition is very challenging for small highly variable data sets.  To be viable, the test should be completely automatic.  Accordingly, the main emphasis of the research presented at this conference is the design of an automatic speech-to-text system and automatic pause recognizer, taking into account the special features of the type of speech used for this test of dementia.

2pSCb11 – Effect of Menstrual Cycle Hormone Variations on Dichotic Listening Results

Richard Morris – Richard.morris@cci.fsu.edu
Alissa Smith

Florida State University
Tallahassee, Florida

Popular version of poster presentation 2pSCb11, “Effect of menstrual phase on dichotic listening”
Presented Tuesday afternoon, November 3, 2015, 3:30 PM, Grand Ballroom 8

How speech is processed by the brain has long been of interest to researchers and clinicians. One method to evaluate how the two sides of the brain work when hearing speech is called a dichotic listening task. In a dichotic listening task two words are presented simultaneously to a participant’s left and right ears via headphones. One word is presented to the left ear and a different one to the right ear. These words are spoken at the same pitch and loudness levels. The listener then indicates what word was heard. If the listener regularly reports hearing the words presented to one ear, then there is an ear advantage. Since most language processing occurs in the left hemisphere of the brain, most listeners attend more closely to the right ear. The regular selection of the word presented to the right ear is termed a right ear advantage (REA).

Previous researchers reported different responses from males and females to dichotic presentation of words. Those investigators found that males more consistently heard the word presented to the right ear and demonstrated a stronger REA. The female listeners in those studies exhibited more variability as to the ear of the word that was heard. Further research seemed to indicate that women exhibit different lateralization of speech processing at different phases of their menstrual cycle. In addition, data from recent studies indicate that the degree to which women can focus on the input to one ear or the other varies with their menstrual cycle.

However, the previous studies used a small number of participants. The purpose of the present study was to complete a dichotic listening study with a larger sample of female participants. In addition, the previous studies focused on women who did not take oral contraceptives as they were assumed to have smaller shifts in the lateralization of speech processing. Although this hypothesis is reasonable, it needs to be tested. For this study, it was hypothesized that the women would exhibit a greater REA during the days that they menstruate than during other days of their menstrual cycle. This hypothesis was based on the previous research reports. In addition, it was hypothesized that the women taking oral contraceptives will exhibit smaller fluctuations in the lateralization of their speech processing.

Participants in the study were 64 females, 19-25 years of age. Among the women 41 were taking oral contraceptives (OC) and 23 were not. The participants listened to the sound files during nine sessions that occurred once per week. All of the women were in good general health and had no speech, language, or hearing deficits.

The dichotic listening task was executed using the Alvin software package for speech perception research. The sound file consisted of consonant-vowel syllables comprised of the six plosive consonants /b/, /d/, /g/, /p/, /t/, and /k/ paired with the vowel “ah”. The listeners heard the syllables over stereo headphones. Each listener set the loudness of the syllables to a comfortable level.

At the beginning of the listening session, each participant wrote down the date of the initiation of her most recent menstrual period on a participant sheet identified by her participant number. Then, they heard the recorded syllables and indicated the consonant heard by striking that key on the computer keyboard. Each listening session consisted of three presentations of the syllables. There were different randomizations of the syllables for each presentation. In the first presentation, the stimuli will be presented in a non-forced condition. In this condition the listener indicted the plosive that she heard most clearly. After the first presentation, the experimental files were presented in a manner referred to as a forced left or right condition. In these two conditions the participant was directed to focus on the signal in the left or right ear. The sequence of focus on signal to the left ear or to the right ear was counterbalanced over the sessions.

The statistical analyses of the listeners’ responses revealed that no significant differences occurred between the women using oral contraceptives and those who did not. In addition, correlations between the day of the women’s menstrual cycle and their responses were consistently low. However, some patterns did emerge for the women’s responses across the experimental sessions as opposed to the days of their menstrual cycle. The participants in both groups exhibited a higher REA and lower percentage of errors for the final sessions in comparison to earlier sessions.

The results from the current subjects differ from those previously reported. Possibly the larger sample size of the current study, the additional month of data collection, or the data recording method affected the results. The larger sample size might have better represented how most women respond to dichotic listening tasks. The additional month of data collection may have allowed the women to learn how to respond to the task and then respond in a more consistent manner. The short data collection period may have confused the learning to respond to a novel task with a hormonally dependent response. Finally, previous studies had the experimenter record the subjects’ responses. That method of data recording may have added bias to the data collection. Further studies with large data sets and multiple months of data collection are needed to determine any sex and oral contraceptive use effects on REA.