Real-Time Processing of Audio to Improve Comprehension in Networks

Long Term Goals

Process transmitted speech in real time to improve the comprehension in legacy communication networks. A secondary goal is to improve these exchanges when at least one leg traverses a conversion that at some point uses low-fidelity codecs, particularly ITU-T Recommendation G.711.

The fundamental question we are asking is What makes speech more intelligible?

Background for Long Term Goals

Communication networks that use ITU-T Recommendation G.711 limit transmitted audio to a frequency band of 300 Hz to 3,400 Hz. While the frequency band of 300 – 3.4 kHz is sufficient to transmit most of the voice energy of human speech, the high frequency component of human speech (frequency above 3,400 Hz) are lost. The loss of this high frequency spectrum can hinder the comprehension of the transmitted speech. Examples where the comprehension of transmitted speech is hindered can be seen when the transmitted speech includes phonemes that share a low spectrum frequencies and differ in their high frequency spectrum components. For example, the spoken word “fail” and “sale”, share a common low frequency spectrum (300 Hz – 3,400 Hz), but differ in the high spectrum frequency. In the aforementioned example, loss of the high frequency spectrum can cause “fail” and “sale” to be become indistinguishable to the receiver.  In a call center environment, the difference between a fail and a sale can be huge for the company.

Incremental improvements in comprehension in a call center environment can have significant impacts.[1] Besides turning a ‘fail’ into a ‘sale,’ not having to repeat information or being able to assist a caller the first time they say something to an agent can trim precious seconds from an interaction. In a call center scenario, it can take 12 seconds to ask for and receive a credit card number. Avoiding a single repeat can take 2% off the time of a typical call. Over thousands of calls, this can be a significant savings. As well, the caller’s experience and thus their satisfaction with the interaction, can improve significantly if they are not repeatedly asked to repeat themselves, over and over again, repetitively.

There are a lot of real-time speech analytics available from Nice, Verint, CallMiner, etc. These tools indicate to the call center employee whether the caller is mentioning competitors, saying words that indicate they are at risk for leaving, are getting unhappy with the employee, etc. Likewise, there are real-time emotional analytics available from Beyond Verbal, Affectiva, and Augsburg’s EmoVoice program. Many of these provide an indication as to whether the caller is happy, sad, angry, upset, bored, and so on. There is even a W3C effort to standardize how emotional states are annotated. See

[1] For convenience, we generically refer to the party interacting with a call center agent a ‘caller,’ even if the agent called the party.

Intermediate Term Objectives

We will construct a model to test our hypotheses that processing transmitted speech can improve comprehension. We will explore other methods of improving the comprehension of transmitted speech. Most approaches to coding speech focus on fidelity: how to make the received speech sound as close to the transmitted speech as possible. We expect that this may be misplaced effort, particularly if we have narrow-band transmission facilities, such as through G.711.

Our hypothesis is that one or more signal processing mechanisms will make speech more intelligible. Conventional voice transmission and processing systems do not use these methods because even though they may make the speech more intelligible, they also may make the speech sound unnatural. The two avenues we propose are frequency shifting and dynamic equalization.

The first task will be to analyze actual call center interactions to see what processing will be most fruitful. We expect a data set of at least 20 hours worth of interactions (300 calls at 4 minutes/call) to look for patterns of misunderstanding. One challenge is this is a very small data set. Small data sets may have biases, such as agents who are naturally adept at understanding marginal speech or, conversely, agents who have trouble understanding clear speech. Larger data sets will allow us to make generalizations. However, larger data sets will take correspondingly longer to examine. As such, for now the limited data set should suffice to make an initial cut at the kinds of processing we need to do.

This project neatly maps into the distinction between phonetics (how the speech is transmitted) and phonology (how the speech is interpreted into phonetic units). As such, we will work with the linguistics department on models for how sounds are distorted and how we can restore the relevant features for a human to properly interpret the most likely intention of the source sounds.

Schedule of Major Steps:

Phase 1

Process working data sets of recorded speech for use to determine common audio situations where misunderstandings or escalations occur. This includes listening and annotating the data set, identifying situations where there was misunderstanding, situations where it sounds like there would be an expectation of misunderstanding but the agent or caller got it right. [4 weeks] {Expectation is this will be provided by Ontario Systems}
Analyze hypothesis for improving the comprehension of a call center interaction.  [6 weeks]
Write initial summary report for phase 1 [1 week]

Phase 2

Work with psycholinguistics and signal processing to develop appropriate algorithms for noise injection or dynamic EQ or other DSP method to improve comprehension. [3 weeks]
Pick out non-PII snippets for testing with human subjects [1 week]
Get Georgetown IRB approval for human subject testing [3 weeks wall time]
Limited testing of algorithms (small scale) [4 weeks]

Write report summarizing results of phase 2  [2 weeks]


Comprehendable Voice Project