NEW YORK – IBM unveiled new speech recognition technology on Tuesday that can comprehend the nuances of spoken English, translate it on the fly and even create on-the-fly subtitles for foreign-language television programs.
Historically, speech technology required the user to limit his speech to a fixed set of phrases in order to interact with a device. With IBM's Embedded ViaVoice 4.4 software package, introduced on Tuesday, the company hopes to allow users to speak commands using phrasing that is natural to them.
In a demonstration today at IBM's headquarters here, for example, users changed a simulated radio station, by speaking any of the following phrases: "Play 92.3," "Tune to 92.3," or "Tune the radio to 92.3."
Though speech recognition is already built into products like Microsoft's Office XP, many users still prefer to use their keyboards.
Speech recognition can be trained to recognize a particular user's voice. But interpreting sounds from a variety of speakers can be even more challenging, unless a limited library of sounds, or phonemes, is used.
Still, though speech recognition by a computer is still far from perfect, the future is bright, according to David Nahamoo, a manager in the human language technologies department at IBM Research.
"At IBM, we have this superhuman speech recognition [initiative in which] the goal is to get performance comparable to humans in the next five years," Nahamoo said.
Understanding more than speech
Translating a command such as "Play 92.3" requires the device to understand the basic context of the command, a feature known in Embedded ViaVoice 4.4 as Free-Form Command.
For Free-Form Command to work successfully, the system must recognize two things: First, that the user is referring to the radio, even if he doesn't use the actual term "radio".
Secondly, the software has to be programmed to understand that the term "play" also is a command to tune the radio to the desired station.
But IBM can also make the process simpler by limiting the context of the speech to something relatively simple, such as the command, nouns and phrases associated with just a car's dashboard, according to Nahamoo. By limiting the domain, the system can make assumptions or inferences about what the user would like to accomplish, he said.
IBM partner VoiceBox Technologies implemented ViaVoice in its VoiceBox Navigation system, found in Scion automobiles.
With this system, the driver can control XM Satellite Radio via conversational speech. Specifically, the driver can change stations, increase or decrease volume, as well as control other basic functionality.
The user can also search XM content by speaking phrases such as "Who is this artist?" Mike Kennewick, chief executive of VoiceBox, explained.
The system must then determine the context dynamically, not only recognizing that a particular song is playing, but that the driver wants to know the artist who recorded the song.
"Algorithms exist that can determine this context on the fly, so you don't have to use predetermined sentence structure," Kennewick explained. "[It's accomplished by] tying speech content to some contextual cues by using environmental information," such as a particular song playing on the XM receiver, he said.
The Free-Form Command functionality was also demonstrated on a simulated GPS navigation system where the user could interact with the GPS navigation system using speech rather than by navigating menus by touch — a boon to drivers who prefer to keep their eyes focused on the road.
From English to Mandarin Chinese, on the fly
Speech technology can be used to control computers and devices, but it can also be used to communicate with their flesh-and-blood counterparts, too.
MASTOR, the Multilingual Automatic Speech-to-Speech Translator application, another IBM research project demonstrated today, dynamically translates English speech to Mandarin Chinese speech.
For example, the user can speak English into a microphone, and the system will translate the sentence into Mandarin Chinese, and reply out loud.
The goal of the system, according to Nahamoo, is for someone to be able to "have a conversation with someone who is Chinese, [even if] I don't know Chinese and he doesn't know English."
MASTOR's translations are based on statistical analysis of the language, where the source sentence is first decompiled into a set of conceptual ideas. Then, the translated sentence is constructed in the target language, based upon these conceptual ideas.
IBM's current MASTOR prototype is a PC application that runs on Windows XP and Windows CE, which also means it can be run on a PDA.
The software development kit (SDK) is available now, but no final products exist yet for consumers to purchase. A product will probably not be available to consumers for at least another 6 months, an IBM representative said.
Want a real-time perspective on the Middle East, from someone who lives and understands the native culture? As globalization continues across the world, it's become a virtual necessity to be up-to-date on news reported in other countries. Tales, another project demonstrated by IBM, hopes to accomplish that goal.
Tales is a server-based system that perpetually monitors Arabic television stations, dynamically transcribing and translating any words spoken into English subtitles.
That means that a user can watch Al Jazeera, an Arabic news station, with subtitles dynamically created by the Tales system displayed below the video, IBM officials explained. Videos can then also be viewed via a web browser, with all transcriptions indexed and searchable.
According to Salim Roukos, the project lead for Tales, translations of speech require quite a bit of processing time, meaning that real-time translations are impossible. For now, all video processed through Tales is delayed by about four minutes, with an accuracy rate of between 60 and 70 percent.
The accuracy rate could be increased to 80 percent, Roukos added, if the delay were also increased. However most users of the system felt timeliness was more important than accuracy, especially considering the subject matter was breaking news. By comparison, a human translator can achieve a 95 percent translation rate, he estimated.
Tales is up and running, and users can subscribe to the " Iraq" package, which includes Al Jazeera and other Arabic-language news stations, for an undisclosed price.
Don't expect to tune it in during lunchtime, however; Roukos hinted that the price will to be in the hundreds of thousands of dollars.
Computers to rival a human
In addition to improvements in accuracy, Nahamoo explained that IBM will work diligently to index and make searchable other forms of content besides text. The Tales project is a tangible step in this direction. "We search on text today, but how about searching on speech and visual content?" Nahamoo said.
If the accuracy of speech recognition nears that of human performance, then will we inevitably encounter more interactive voice response systems, such automated voice mail? People may cringe at the thought, Nahamoo acknowledged, but there's an upside.
"Machines are not judgmental," Nahamoo said. "Some people feel like they are being judged when they phone call centers. That doesn't happen with a machine."
Copyright © 2006 Ziff Davis Media Inc. All Rights Reserved. Reproduction in whole or in part in any form or medium without express written permission of Ziff Davis Media Inc. is prohibited.