Man-made musical instruments are sometimes designed or played to mimic the human voice, and likewise vocalists try to mimic the sounds of man-made instruments.  If flutes and strings accompany a singer, a “brassy” voice is likely to produce mismatches in timbre (tone color or sound quality).  Likewise, a “fluty” voice may not be ideal for a brass accompaniment.  Thus, singers are looking for ways to color their voice with variable timbre.

Acoustically, brass instruments are close cousins of the human voice.  It was discovered prehistorically that sending sound over long distances (to locate, be located, or warn of danger) is made easier when a vibrating sound source is connected to a horn.  It is not known which came first – blowing hollow animal horns or sea shells with pursed and vibrating lips, or cupping the hands to extend the airway for vocalization. In both cases, however, airflow-induced vibration of soft tissue (vocal folds or lips) is enhanced by a tube that resonates the frequencies and radiates them (sends them out) to the listener.

Around 1840, theatrical singing by males went through a revolution.  Men wanted to portray more masculinity and raw emotion with vocal timbre. “Do di Petto”, which is Italien for “C  in chest voice” was introduced by operatic tenor Gilbert Duprez in 1837, which soon became a phenomenon.  A heroic voice in opera took on more of a brass-like quality than a flute-like quality.  Similarly, in the early to mid- twentieth century (1920-1950), female singers were driven by the desire to sing with a richer timbre, one that matched brass and percussion instruments rather than strings or flutes.  Ethel Merman became an icon in this revolution. This led to the theatre belt sound produced by females today, which has much in common with a trumpet sound.


Fig 1. Mouth opening to head-size ratio for Ethel Merman and corresponding frequency spectrum for the sound “aw” with a fundamental frequency fo (pitch) at 547 Hz and a second harmonic frequency 2 fo at 1094 Hz.

The length of an uncoiled trumpet horn is about 2 meters (including the full length of the valves), whereas the length of a human airway above the glottis (the space between the vocal cords) is only about 17 cm (Fig. 2). The vibrating lips and the vibrating vocal cords can produce similar pitch ranges, but the resonators have vastly different natural frequencies due to the more than 10:1 ratio in airway length.  So, we ask, how can the voice produce a brass-like timbre in a “call” or “belt”?

One structural similarity between the human instrument and the brass instrument is the shape of the airway directly above the glottis, a short and narrow tube formed by the epiglottis.  It corresponds to the mouthpiece of brass instruments.  This mouthpiece plays a major role in shaping the sound quality.  A second structural similarity is created when a singer uses a wide mouth opening, simulating the bell of the trumpet.  With these two structural similarities, the spectrum of tones produced by the two instruments can be quite similar, despite the huge difference in the overall length of the instrument.

Titze_Fig2_airway_ trumpet

Fig 2. Human airway and trumpet (not drawn to scale).

Acoustically, the call or belt-like quality is achieved by strengthening the second harmonic frequency 2fin relation to the fundamental frequency fo.  In the human instrument, this can be done by choosing a bright vowel like /ᴂ/ that puts an airway resonance near the second harmonic.  The fundamental frequency will then have significantly less energy than the second harmonic.

Why does that resonance adjustment produce a brass-like timbre?  To understand this, we first recognize that, in brass-instrument playing, the tones produced by the lips are entrained (synchronized) to the resonance frequencies of the tube.  Thus, the tones heard from the trumpet are the resonance tones. These resonance tones form a harmonic series, but the fundamental tone in this series is missing.  It is known as the pedal tone.  Thus, by design, the trumpet has a strong second harmonic frequency with a missing fundamental frequency.

Perceptually, an imaginary fundamental frequency may be produced by our auditory system when a series of higher harmonics (equally spaced overtones) is heard.  Thus, the fundamental (pedal tone) may be perceptually present to some degree, but the highly dominant second harmonic determines the note that is played.

In belting and loud calling, the fundamental is not eliminated, but suppressed relative to the second harmonic.  The timbre of belt is related to the timbre of a trumpet due to this lack of energy in the fundamental frequency.  There is a limit, however, in how high the pitch can be raised with this timbre.  As pitch goes up, the first resonance of the airway has to be raised higher and higher to maintain the strong second harmonic.  This requires ever more mouth opening, literally creating a trumpet bell (Fig. 3).


Fig 3. Mouth opening to head-size ratio for Idina Menzel and corresponding frequency spectrum for a belt sound with a fundamental frequency (pitch) at 545 Hz.

Note the strong second harmonic frequency 2fo in the spectrum of frequencies produced by Idina Menzel, a current musical theatre singer.

One final comment about the perceived pitch of a belt sound is in order.  Pitch perception is not only related to the fundamental frequency, but the entire spectrum of frequencies.  The strong second harmonic influences pitch perception. The belt timbre on a D5 (587 Hz) results in a higher pitch perception for most people than a classical soprano sound on the same note. This adds to the excitement of the sound.

Charisma is the set of leadership characteristics, such as vision, emotions, and dominance used by leaders to share beliefs, persuade listeners and achieve goals. Politicians use voice to convey charisma and appeal to voters to gain social positions of power. “Charismatic voice” refers to the ensemble of vocal acoustic patterns used by speakers to convey personality traits and arouse specific emotional states in listeners. The ability to manipulate charismatic voice results from speakers’ universal and learned strategies to use specific vocal parameters (such as vocal pitch, loudness, phonation types, pauses, pitch contours, etc.) to convey their biological features and their social image (see Ohala, 1994; Signorello, 2014a, 2014b; Puts et al., 2006). Listeners’ perception of the physical, psychological and social characteristics of the leader is influenced by universal ways to emotionally respond to vocalizations (see Ohala, 1994; Signorello, 2014a, 2014b) combined with specific, culturally-mediated, habits to manifest emotional response in public (Matsumoto, 1990; Signorello, 2014a).

Politicians manipulate vocal acoustic patterns (adapting them to the culture, language, social status, educational background and the gender of the voters) to convey specific types of leadership fulfilling everyone’s expectation of what charisma is. But what happen to leaders’ voice when they use different languages to address voters? This study investigates speeches of bilingual politicians to find out the vocal acoustic differences of leaders speaking in different languages. It also investigates how the acoustical differences in different languages can influence listeners’ perception of type of leadership and the emotional state aroused by leaders’ voices.

We selected vocal samples from two bilingual America-English/American-Spanish politicians that participated to the 2016 United States presidential primaries: Jeb Bush and Marco Rubio. We chose words with similar vocal characteristics in terms of average vocal pitch, vocal pitch range, and loudness range. We asked listeners to rate the type of charismatic leadership perceived and to assess the emotional states aroused by those voices. We finally asked participants how the different vocal patterns would affect their voting preference.

Preliminary statistical analyses show that English words like “terrorism” (voice sample 1) and “security” (voice sample 2), characterized by mid vocal pitch frequencies, wide vocal pitch ranges, and wide loudness ranges, convey an intimidating, arrogant, selfish, aggressive, witty, overbearing, lazy, dishonest, and dull type of charismatic leadership. Listeners from different language and cultural backgrounds also reported these vocal stimuli triggered emotional states like contempt, annoyance, discomfort, irritation, anxiety, anger, boredom, disappointment, and disgust. The listeners who were interviewed considered themselves politically liberal and they responded that they would probably vote for a politician with the vocal characteristics listed above.

Speaker Jeb Bush. Mid vocal pitch frequencies (126 Hz), wide vocal pitch ranges (97 Hz), and wide loudness ranges (35 dB)

Speaker Marco Rubio. Mid vocal pitch frequencies 178 Hz), wide vocal pitch ranges (127 Hz), and wide loudness ranges (30 dB)

Results also show that Spanish words like “terrorismo” (voice sample 3) and “ilegal” (voice sample 4) characterized by an average of mid-low vocal pitch frequencies, mid vocal pitch ranges, and narrow loudness ranges convey a personable, relatable, kind, caring, humble, enthusiastic, witty, stubborn, extroverted, understanding, but also weak and insecure type of charismatic. Listeners from different language and cultural backgrounds also reported these vocal stimuli triggered emotional states like happiness, amusement, relief, and enjoyment. The listeners who were interviewed considered themselves politically liberal and they responded that they would probably vote for a politician with the vocal characteristics listed above.  

Speaker Jeb Bush. Mid-low vocal pitch frequencies (95 Hz), mid vocal pitch ranges (40 Hz), and narrow loudness ranges (17 dB) 

Speaker Marco Rubio. Mid vocal pitch frequencies 146 Hz), wide vocal pitch ranges (75 Hz), and wide loudness ranges (25 dB)

Voice is a very dynamic non-verbal behavior used by politicians to persuade the audience and manipulate voting preference. The results of this study show how acoustic differences in voice convey different types of leadership and arouse differently the emotional states of the listeners. The voice samples studied show how speakers Jeb Bush and Marco Rubio adapt their vocal delivery to audiences of different backgrounds. The two politicians voluntary manipulate their voice parameters while speaking in order to appear as they were endowed of different leadership qualities. The vocal pattern used in English conveys the threatening and dark side of their charisma, inducing the arousal of negative emotions, which triggers a positive voting preference in listeners. The vocal pattern used in English conveys the charming and caring side of their charisma, inducing the arousal of positive emotions, which triggers a negative voting preference in listeners.

The manipulation of voice arouses emotional states that will induce voters to consider a certain type of leadership as more appealing. Experiencing emotions help voters to assess the effectiveness of a political leader. If the emotional arousing matches with voters’ expectation of how a charismatic leader should make them feel then voters would help the charismatic speaker to became their leader.

Older adults seeking hearing help often report having an especially hard time understanding women’s voices. However, this anecdotal observation doesn’t always agree with the findings from scientific studies. For example, Ferguson (2012) found that male and female talkers were equally intelligible for older adults with hearing loss. Moreover, several studies have found that young people with normal hearing actually understand women’s voices better than men’s voices (e.g. Bradlow et al., 1996; Ferguson, 2004). In contrast, Larsby et al. (2015) found that, when listening in background noise, groups of listeners with and without hearing loss were better at understanding a man’s voice than a woman’s voice. The Larsby et al. data suggest that female speech might be more affected by distortion like background noise than male speech is, which could explain why women’s voices may be harder to understand for some people.

We were interested to see if another type of distortion, speeding up the speech, would have an equal effect on the intelligibility of men and women. Speech that has been sped up (or time-compressed) has been shown to be less intelligible than unprocessed speech (e.g. Gordon-Salant & Friedman, 2011), but no studies have explored whether time compression causes an equal loss of intelligibility for male and female talkers. If an increase in playback speed causes women’s speech to be less intelligible than men’s, it could reveal another possible reason why so many older adults with hearing loss report difficulty understanding women’s voices. To this end, our study tested whether the intelligibility of time-compressed speech decreases for female talkers more than it does for male talkers.

Using 32 listeners with normal hearing, we measured how much the intelligibility of two men and two women went down when the playback speed of their speech was increased by 50%. These four talkers were selected based on their nearly equivalent conversational speaking rates. We used digital recordings of each talker and made two different versions of each sentence they spoke: a normal-speed version and a fast version. The software we used allowed us to speed up the recordings without making them sound high-pitched.

Audio sample 1: A sentence at its original speed.

Audio sample 2: The same sentence sped up to 50% faster than its original speed.

All of the sentences were presented to the listeners in background noise. We found that the men and women were essentially equally intelligible when listeners heard the sentences at their original speed. Speeding up the sentences made all of the talkers harder to understand, but the effect was much greater for the female talkers than the male talkers. In other words, there was a significant interaction between talker gender and playback speed. The results suggest that time-compression has a greater negative effect on the intelligibility of female speech than it does on male speech.

johnson & ferguson fig 1

Figure 1: Overall percent correct key-word identification performance for male and female takers in unprocessed and time-compressed conditions. Error bars indicate 95% confidence intervals.

These results confirm the negative effects of time-compression on speech intelligibility and imply that audiologists should counsel the communication partners of their patients to avoid speaking excessively fast, especially if the patient complains of difficulty understanding women’s voices. This counsel may be even more important for the communication partners of patients who experience particular difficulty understanding speech in noise.


The aging population is the fastest growing segment of the population. Some voice, speech and breathing disorders occur more frequently as individuals age. For example, lung capacity diminishes in older age due to loss of lung elasticity, which places an upper limit on utterance duration. Further, decreased lung and diaphragm elasticity and muscle strength can occur, and the rib cage can stiffen, leading to reductions in lung pressure and the volume of air that can be expelled by the lungs (‘expiratory volume’). In the laryngeal system, tissues can break down and cartilages can harden, causing more voice breaks, increased hoarseness or harshness, reduced loudness, and pitch changes.

Our study attempted to identify the normal speech and respiratory changes that accompany aging in healthy individuals. Specifically, we examined how long individuals could speak in a single breath group using a series of speeches from six individuals (three females and three males) over the course of many years (between 18 and 49 years). All speakers had been previously recorded in similar environments giving long, monologue speeches. All but one speaker gave their addresses at a podium using a microphone, and most were longer than 30 minutes each. The speakers’ ages ranged between 43 (51 on average) and 98 (84 on average) years. Samples of five minutes in length were extracted from each recording. Subsequently, for each subject, three raters identified the durations of exhalations during speech in these samples.

Two figures illustrate how the breath groups changed with age for one of the women (Figure 1) and one of the men (Figure 2). We found a change in the speech breathing, which might be caused by a less flexible rib cage and the loss of vital capacity and expiratory volume. In males especially, it may also have been caused by poor closure of the vocal folds, resulting in more air leakage during speech. Specifically, we found a decreased breath group duration for all male subjects after 70 years, with overall durations averaging between 1 and 3.5 seconds. Importantly, the point of change appeared to occur between 60 and 65. For females, this change occurred at a later time, between 60-70 years, with durations averaging between 1.5 and 3.5 seconds.

figure_Page_1 - speech breath

Figure 1 For one of the women talkers, the speech breath groups were measured and plotted to correspond with age. The length of the speech breath groups begins to decrease at about 68 years of age.

figure_Page_2 - speech breath

Figure 2 For one of the men talkers, the speech breath groups were measured and plotted to correspond with age. The length of the speech breath groups begins to decrease at about 66 years of age.

The study results indicate decreases in speech breath group duration for most individuals as their age increased (especially from 65 years onwards), consistent with the age-related decline in expiratory volume reported in other studies. Typically, the speech breath group duration of the six subjects decreased from ages 65 to 70 years onwards. There was some variation between individuals in the point at which the durations started to decrease. The decreases indicate that, as they aged, speakers could not sustain the same number of words in a breath group and needed to inhale more frequently while speaking.

Future studies involving more participants may further our understanding of normal age-related changes vs. pathology, but such a corpus of recordings must first be constrained on the basis of communicative intent, venues, knowledge of vocal coaching, and related information.

Loudspeakers for sound reinforcement systems are designed to project their sound in specific directions. Sound system designers take advantage of the “directivity” characteristics of these loudspeakers, aiming their sound uniformly throughout seating areas, while avoiding walls and ceilings and other surfaces from which undesirable reflections could reduce clarity and fidelity.

Many high-quality sound reinforcement loudspeaker systems incorporate horn loudspeakers that provide very good control, but these are relatively large and conspicuous.   In recent years, “steerable column arrays” have become available, which are tall but narrow, allowing them to better blend into the architectural design.  These are well suited to the frequency range of speech, and to some degree their sound output can be steered up or down using electronic signal processing.

Figure 1 - steerable column arrays - speech intelligibility

Figure 1. steerable column arrays

Figure 1 illustrates the steering technique, with six individual loudspeakers in a vertical array.  Each loudspeaker generates an ever-expanding sphere of sound (in this figure, simplified to show only the horizontal diameter of each sphere), propagating outward at the speed of sound, which is roughly 1 foot per millisecond.  In the “not steered” column, all of the loudspeakers are outputting their sound at the same time, with a combined wavefront spreading horizontally, as an ever-expanding cylinder of sound.  In the “steered downward” column, the electronic signal to each successively lower loudspeaker is slightly delayed; the top loudspeaker outputs its sound first, while each lower loudspeaker in turn outputs its sound just a little later, so that the sound energy is generally steered slightly downward. This steering allows for some flexibility in positioning the loudspeaker column.  However, these systems only offer some vertical control; left-to-right projection is not well controlled.

Steerable column arrays have reasonably resolved speech reinforcement issues in many large, acoustically-problematic spaces. Such arrays were appropriate selections for a large worship space, with a balcony and a huge dome, that had undergone a comprehensive renovation.  Unfortunately, in this case, problems with speech intelligibility persisted, even after multiple adjustments by reputable technicians, who had used their instrumentation to identify several sidewall surfaces that appeared to be reflecting sound and causing problematic echoes. They recommended additional sound absorptive treatment that could adversely affect visual aesthetics and negatively impact the popular classical music concerts.

Upon visiting the space as requested to investigate potential acoustical treatments, speech was difficult to understand in various areas on the main floor.  While playing a click track (imagine a “pop” every 5 seconds) through the sound system, and listening to the results around the main floor, we heard strong echoes emanating from the direction of the surfaces that had been recommended for sound-absorptive treatment.

Nearby those surfaces, additional column loudspeakers had been installed to augment coverage of the balcony seating area.  These balcony loudspeakers were time-delayed (in accordance with common practice, to accommodate the speed of sound) so that they would not produce their sound until the sound from the main loudspeakers had arrived at the balcony. With proper time delay, listeners on the balcony would hear sound from both main and balcony loudspeakers at approximately the same time, and thereby avoid what would otherwise seem like an echo from the main loudspeakers.

With more listening, it became clear that the echo was not due to reflections from the walls at all, but rather from the delayed balcony loudspeakers’ sound inadvertently spraying back to the main seating area.  These loudspeakers cannot be steered in a multifaceted manner that would both cover the balcony and avoid the main floor.

We simply turned off the balcony loudspeakers, and the echo disappeared.  More importantly, speech intelligibility improved significantly throughout the main floor. Intelligibility throughout the balcony remained acceptable, albeit not quite as good as with the balcony loudspeakers operating.

The general plan is to remove the balcony loudspeakers and relocate them to the same wall as the main loudspeakers, but steer them to cover the balcony.

Adding sound-absorptive treatment on the side walls would not have solved the problem, and would have squandered funds while impacting the visual aesthetics and classical music programming.  Listening for solutions proved to be more effective than interpreting test results from sophisticated instrumentation.