151st ASA Meeting, Providence, RI


"But Operator, I Need to Repeat This Again?"

Harsha M. Sathyendra - hsathyen@cnel.ufl.edu
Ismail Uysal - uysal@ufl.edu
John G. Harris harris@cnel.ufl.edu
Department of Electrical & Computer Engineering
Computational Neuro Engineering Lab (CNEL)
University of Florida
Gainesville, FL 32611

Popular version of paper 3aSC5
Presented Wednesday Morning, June 7, 2006
151st ASA Meeting, Providence, RI

Introduction

The telephone bandwidth standards in use today were established more than 60 years ago. With the introduction of 3G wireless networks, we are able to send pictures and movies to our friends, have access to the latest news, download and listen to our favorite music files and play video games, all using our cell phones. However, due to the existing telecommunications infrastructure only part of the useful frequency information (300-3400 Hz) is passed along the telephone lines. The other information is lost, which is why telephone speech is at times hard to understand, especially for consonants such as b vs. d, n vs. m. This band-limited speech is also more robotic sounding and less natural compared to in-person conversional speech, which has most of its frequency content in the range of 50-8000 Hz.

Methodology

The simple but impractical solution for introducing this lost useful information is to change the whole telephone communications infrastructure around the globe and then implement a system, which allows for a higher bandwidth. However, this would cost billions of dollars and deal with an enormous amount of physical labor. Nonetheless, the promise of having “true” voice conversation over the telephone attracted many researchers to find alternative solutions, commonly called bandwidth extension (BWE).

Recent developments in the field of statistical signal processing equipped engineers with a wide array of tools for applications such as pattern recognition and estimation. Our algorithm employs such techniques to extend both the low and high frequencies that are missing from the narrowband telephone signal. In other words, using only the information contained within 300-3400 Hz, we guess the frequency content that is most likely to be seen in the desired wideband version, 50-8000Hz.

Results

It is a commonly known fact that a person’s voice sounds less natural and more artificial through a telephone. The loss of low frequencies mostly attributes to this problem. The loss of high frequencies is more difficult to notice, however it results in a degradation of speech intelligibility. A simple example is the following sentence:

“My cousin failed in college.”

If you listen to this sentence in a telephone conversation, the word “failed” sounds like the word “sailed”. This is not the case for in-person conversations, which indicates that whatever information is used to distinguish these two words is lost in the telephone speech. In order to isolate the intelligibility increase, below you can find three different versions of the sentence “My cousin failed in college.”

WB (original version); NB (telephone version); RWB (BWE algorithm output)

Because the words “sailed” and “failed” are mixed up, the following is a series of utterances of these two words.

WB (original version); NB (telephone version) ; RWB (BWE algorithm output)

The next sentence is taken from the TIMIT database, and is of the sentence “How do you redefine it?”

WB (original version); NB (telephone version); RWB (BWE algorithm output)

As clearly observed, for the first sentence, the missing information in the narrowband signal is put back in the reconstructed wideband signal and the word “failed” is better recognized. This same trend is seen with the consecutive utterances of “failed”, “sailed”, sailed”, “failed.” Similarly, the reconstructed (RWB) version of the TIMIT sentence sounds closer to the wideband (WB) version than is the narrowband (NB) version.

A similar approach is used to isolate the increase in naturalness, whereby information is only placed for low frequencies. Below are the three different versions of the following sentences:

“Object -- a village crossroads.”

WB (original version); NB (telephone version); RWB (BWE algorithm output)

Junior, what on earth's the matter with you?

WB (original version); NB (telephone version) ; RWB (BWE algorithm output)

The narrowband (NB) versions sound more artificial and lack the naturalness inherent in the wideband (WB) as well as the reconstructed (RWB) version.

The addition of both the high and low frequency information is seen below with the sentences:

“Not without good reason has the anatomical been called jocular journalese.”

WB (original version) ; NB (telephone version) ; RWB (BWE algorithm output)

“Scientific progress comes from the development of new techniques.”

WB (original version); NB (telephone version); RWB (BWE algorithm output)

Discussion

We also conducted a subjective listening test among 64 college students, who on average spend 30 minutes a day on their cell phones. The test consisted of 10 trials where each trial had 2 different versions of the same utterance. For each trial the students were asked to choose which sound file they preferred on the cell-phone. On average, 93.1% of the test subjects preferred reconstructed (RWB) speech over narrowband (NB) telephone speech.

Another measure to gauge the performance of our algorithm is to use spectrograms. Spectrograms are used to look at the change of frequency information as a function of time. Below is a figure, which shows the spectrograms of wideband, narrowband and reconstructed wideband signals. As you can see, relevant frequency information has been added to the missing frequencies of narrowband telephone speech.

We have also tested our algorithm with different speakers, genders, accents and languages and in all cases we were able to reconstruct a wider band signal. Future work will concentrate on the effects and compensation for noise because cell-phone speech often occurs in different noisy situations.

 


[ Lay Language Paper Index | Press Room ]