Comprehendible Voice

Long Term Goals

This project fits in the PSTN Transition program in the S2ERC. The PSTN Transition refers to the transition from the legacy, time-domain multiplexed, SS7-signaled voice network to the new, IP packet-switched, SIP-signaled multimedia network. We will call the legacy network the PSTN and the new network the all-IP network.

Ultimately, we would like to see a telecommunications network that is more secure, does a better job at preserving privacy, and is more usable than the network we have today.

Background for Long Term Goals

One of the promises of the all-IP telecommunications network is creating a better experience for users of the network. In addition, basing communications on packet technology opens up the world of general purpose computing for application to communications problems.

The legacy public switched telephone network (PSTN) uses an 8,000 sample per second, 256-bit quanta encoding for voice, as specified by ITU-T Recommendation G.711.[1] This limits the transmitted frequency band to 300 Hz to 3,400 Hz. While this is where the transmission of most of the voice energy is, a lot of aural information is lost. Most people can hear sounds from 20 Hz to 20,000 Hz. For example, 300 Hz is just above middle D on the keyboard. That is, there is a lot of voice information (baritone and bass) that never gets transmitted. 3,400 Hz is just above the highest G# on the piano, which is still well within human hearing. However, what is more important is the loss of high frequency components is believed to adversely impact human speech comprehension. For example, the frequency components for the English consonant ‘f’ and ‘s’, which would seem to be very different sounds, actually share the same spectrum below 3,400 Hz. In other words, the sentence, “I am failing” and “I am sailing” are indistinguishable over a legacy voice connection. The reason humans do not fail to sail is because our brains take the context into consideration, and if I was talking about my vacation to the lake, you listen to me say ‘sailing,’ even though you did not hear me say it.

The spectrographic analysis of phonemes is important, as phonemes are the critical components of speech comprehension.[2]

The advent of the all-IP network brings new coding schemes, combined with more available bandwidth, to offer much higher fidelity voice transmission. High quality voice codecs, such as the IETF Opus Codec[3] and AMR-WB,[4] capture the audio well above and below the limited 300 Hz to 3,400 Hz of the legacy PSTN.

We expect more consumers to have access to high quality codecs. Cellular LTE-based voice uses AMR-WB, and Skype and WebRTC-based VoIP clients use Opus. These two technologies are gaining rapid adoption. Conversely, many enterprises have legacy voice equipment. Even if the enterprise uses VoIP, it is likely to be using older codecs such as G.711 or even lower-quality codecs. What happens today is the high quality voice gets transcoded to the legacy voice codecs.

To date, people constructing transcoders focus on faithfully reconstructing the original signal. This is so the voice interaction sounds as natural as possible. However, such reconstruction, particularly if there is such severe truncation of the high-frequency components of speech, virtually guarantees a reduction in comprehension of the transcoded voice.

We do have some experiences with manipulating the spectra within the constraints of the legacy codec. For example, in the 1990’s, AT&T experimented with True Voice, where the bass components (100 Hz – 300 Hz) were boosted into the 300+ Hz band. It sounded better, but never took off.[5]

One opportunity for call centers or other enterprises is we can relax the requirement for natural sounding speech. What is needed is a method to manipulate the speech in such a way that when transcoded to a low-quality codec, the enterprise user’s comprehension of the original speech is better than if simply the low-frequency components of the speech are transmitted.

Intermediate Term Objectives

We want to understand the characteristics of existing voice codecs, human perception of comprehension, and come up with methods of improving human perception of speech using low-quality codecs.

Schedule of Major Steps:

Catalog and profile high-definition voice codecs [8 weeks]
Study audio attributes of voice and comprehension [4 weeks]

Dependencies:

None.

 

[1] ITU-T, Pulse Code Modulation (PCM) of Voice Frequencies, ITU-T Recommendation G.711, 1988.

[2] Liberman, A. M., Cooper, F. S., Shankweiler, D. P., and Studdert-Kennedy, M., Perception of the Speech Code, Psychological Review, v. 74, n. 6, pp. 431 – 461, November 1967.

[3] Valin, JM., Vos, K., and Terriberry, T., Definition of the Opus Audio Codec, IETF RFC 6716, September 2012.

[4] ITU-T, Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB), ITU-T Recommendation G.722.2, July 2003.

[5] Isenberg, D., Rise of the Stupid Network, retrieved from http://www.hyperorg.com/misc/stupidnet.html on 6 March 2014.