High fidelity on the line: please say “aah”
Prof. Sten Ternström
Dept of Speech, Music and Hearing
School of Computer Science and Communication
Kungliga Tekniska Högskolan
Popular version of paper 3aMUa7
Hi-Fi voice: observations on the distribution of energy in the singing voice spectrum above 5 kHz
Presented Wednesday, July 2, 2008
If you ever tried, you will know that the “treble” tone control on your hi-fi regulates the strength of the highest frequencies, those above 5,000 Hz. Turning it up makes the sound brighter, turning it down makes it more dull. Curiously, although the sounds of the human voice have been studied for decades, researchers have hardly looked at those highest parts of the voice spectrum ―until now.
In male speech, the vocal folds vibrate only some 100 times per second; in female speech, about twice that. But these vibrations generate harmonic overtones, which carry acoustic energy at higher frequencies as well. The overtones, or harmonics, are fairly strong up to about 4000 Hz, or 4 kHz for short. Going above 5 kHz, they diminish rapidly, and carry only about one ten-thousandth of the total energy of the sound, or even less. Still, the high harmonics can be heard and measured as high as 16 kHz, especially in song. Speech and song also contain fricative consonant sounds such as “s”, “f” and “sh”. These consist of high-frequency noise rather than periodic vibrations.
Because the voice signal is so weak above 5 kHz, it has been largely ignored by researchers. Long ago, when telephony was being developed, engineers soon discovered that speech can be transmitted cost-effectively yet intelligibly using only frequencies in the range 300-3500 Hz; and consequently, the structure of voice signals in that low range has been studied intensely. But the sound quality does suffer from this constraint, and everyone knows what is meant by “telephone” sound. Indeed, after a hundred years, that nasal, tinny timbre has become iconic.
Best heard with young ears, the treble region of the voice is still regarded mostly as a subtle curiosity; and acoustically, it’s admittedly rather messy. At low frequencies, below 5 kHz, things stay fairly simple, because the sound waves are much longer than any dimension of the vocal tract. This means that the relatively slow pressure changes in the air travel neatly and only lengthwise from the vocal folds to the lips. The wave propagation inside the vocal tract can be decently described using the textbook maths for, say, a round pipe whose diameter varies along its length. But the higher the frequency, the shorter the sound waves. Above 5 kHz, the waves become short enough that the sounds can start resonating sideways inside the mouth: between the cheeks, tongue, teeth and palate.
These so-called cross-modes make things much more complicated. The vocal tract conducts high-frequency sounds as if it were a tiny but elaborate cave with intricate acoustics, rather than just a pipe with a few resonances. To make things worse, this little room changes shape continuously as we speak. And the slightest displacement of tongue, larynx or jaw can make a huge difference to details in the high part of the sound spectrum. Perhaps this does not matter very much – our sense of hearing shrewdly concentrates on the low frequencies below 5 kHz, while resolving much less spectral detail at high frequencies (which, incidentally, has been exploited for compression of audio signals).
So what’s the big deal? Can’t we just go on ignoring these high frequencies? Well, as recently as in the year 2000, a new “wide-band” standard for telephony was defined, up to 7 kHz. That’s not perfect, but it’s a big improvement on the old “telephone sound.” Hopefully, your cellphone calls will sound much better in a few years, and internet telephony already does. Audio engineers, music producers and broadcasters invariably crank up the treble for the vocals, because doing so is said to increase “crispness,” “intimacy,” and “openness.” According to this fairly recent aesthetic, especially in popular music, live voices now sound almost a bit dull and faded -- because, sitting in the fifteenth row, we are not close-up, like the microphone, and auditorium acoustics tend to penalize the highest frequencies. Medical experts who attend to our voices now know that well-functioning vocal folds generally produce stronger high frequencies. Signs are also that micro-fluctuations at high frequencies could be important for making natural-sounding synthesis of voices. So, it is high time to describe the treble part of the voice signal in greater detail.
This study is a sort of reconnaissance mission into a fairly unknown territory, and it is the first in a new three-year project dedicated to the treble region of the voice. Eight people, mostly singers, were recorded as they did different speech-like and song-like tasks with their voices, and various aspects of the high frequency range of vowel sounds were analyzed. Reassuringly, it was found that the acoustics of the vocal tract behave as expected from room acoustics, in that the resonances in the vocal tract at higher frequencies become very profuse, and very sensitive to small shape changes. The relative amount of energy at 5-10 kHz and 10-20 kHz for different vowel sounds was measured, and some gross features of the high spectrum were described. Among these features were a pronounced dip at 4-5 kHz; a cluster of resonances at 5-10 kHz, and a lesser trough at 10-13 kHz. In a few loud sung tones, harmonics were observed all the way up to 20 kHz. Listening tests showed that very small level changes of less than one decibel in the high region can be discernible for long sung vowels, which may have a fair amount of high-frequency harmonics. In running speech, the fricative consonants dominate at high frequencies, and the rapid modulations mean that small changes of the treble control are harder to hear.
These findings will show the way into more detailed studies on the high-frequency region of voice signals in speech and song. Once such studies have unfolded, the results will find applications in speech communication, voice health care, and the performing arts. So, the next time you play your favorite singer, or an audio book, try turning down the treble for a while. Chances are you’ll soon want to turn it up again.
[Supported by VR, the Swedish Research Council, contract 2007-4460.]