IBM Strives for 'Superhuman' Speech Tech

NEW YORK – IBM unveiled new speech recognition technology on Tuesday that can comprehend the nuances of spoken English, translate it on the fly and even create on-the-fly subtitles for foreign-language television programs.

Historically, speech technology required the user to limit his speech to a fixed set of phrases in order to interact with a device. With IBM's Embedded ViaVoice 4.4 software package, introduced on Tuesday, the company hopes to allow users to speak commands using phrasing that is natural to them.

In a demonstration today at IBM's headquarters here, for example, users changed a simulated radio station, by speaking any of the following phrases: "Play 92.3," "Tune to 92.3," or "Tune the radio to 92.3."

Though speech recognition is already built into products like Microsoft's Office XP, many users still prefer to use their keyboards.

Speech recognition can be trained to recognize a particular user's voice. But interpreting sounds from a variety of speakers can be even more challenging, unless a limited library of sounds, or phonemes, is used.

Still, though speech recognition by a computer is still far from perfect, the future is bright, according to David Nahamoo, a manager in the human language technologies department at IBM Research.

"At IBM, we have this superhuman speech recognition [initiative in which] the goal is to get performance comparable to humans in the next five years," Nahamoo said.

Understanding more than speech

Translating a command such as "Play 92.3" requires the device to understand the basic context of the command, a feature known in Embedded ViaVoice 4.4 as Free-Form Command.

For Free-Form Command to work successfully, the system must recognize two things: First, that the user is referring to the radio, even if he doesn't use the actual term "radio".

Secondly, the software has to be programmed to understand that the term "play" also is a command to tune the radio to the desired station.

But IBM can also make the process simpler by limiting the context of the speech to something relatively simple, such as the command, nouns and phrases associated with just a car's dashboard, according to Nahamoo. By limiting the domain, the system can make assumptions or inferences about what the user would like to accomplish, he said.

IBM partner VoiceBox Technologies implemented ViaVoice in its VoiceBox Navigation system, found in Scion automobiles.

With this system, the driver can control XM Satellite Radio via conversational speech. Specifically, the driver can change stations, increase or decrease volume, as well as control other basic functionality.

The user can also search XM content by speaking phrases such as "Who is this artist?" Mike Kennewick, chief executive of VoiceBox, explained.

The system must then determine the context dynamically, not only recognizing that a particular song is playing, but that the driver wants to know the artist who recorded the song.

"Algorithms exist that can determine this context on the fly, so you don't have to use predetermined sentence structure," Kennewick explained. "[It's accomplished by] tying speech content to some contextual cues by using environmental information," such as a particular song playing on the XM receiver, he said.

The Free-Form Command functionality was also demonstrated on a simulated GPS navigation system where the user could interact with the GPS navigation system using speech rather than by navigating menus by touch — a boon to drivers who prefer to keep their eyes focused on the road.

From English to Mandarin Chinese, on the fly

Speech technology can be used to control computers and devices, but it can also be used to communicate with their flesh-and-blood counterparts, too.

MASTOR, the Multilingual Automatic Speech-to-Speech Translator application, another IBM research project demonstrated today, dynamically translates English speech to Mandarin Chinese speech.

For example, the user can speak English into a microphone, and the system will translate the sentence into Mandarin Chinese, and reply out loud.

The goal of the system, according to Nahamoo, is for someone to be able to "have a conversation with someone who is Chinese, [even if] I don't know Chinese and he doesn't know English."

MASTOR's translations are based on statistical analysis of the language, where the source sentence is first decompiled into a set of conceptual ideas. Then, the translated sentence is constructed in the target language, based upon these conceptual ideas.

IBM's current MASTOR prototype is a PC application that runs on Windows XP and Windows CE, which also means it can be run on a PDA.

The software development kit (SDK) is available now, but no final products exist yet for consumers to purchase. A product will probably not be available to consumers for at least another 6 months, an IBM representative said.

Translating television

Want a real-time perspective on the Middle East, from someone who lives and understands the native culture? As globalization continues across the world, it's become a virtual necessity to be up-to-date on news reported in other countries. Tales, another project demonstrated by IBM, hopes to accomplish that goal.

Tales is a server-based system that perpetually monitors Arabic television stations, dynamically transcribing and translating any words spoken into English subtitles.

That means that a user can watch Al Jazeera, an Arabic news station, with subtitles dynamically created by the Tales system displayed below the video, IBM officials explained. Videos can then also be viewed via a web browser, with all transcriptions indexed and searchable.

According to Salim Roukos, the project lead for Tales, translations of speech require quite a bit of processing time, meaning that real-time translations are impossible. For now, all video processed through Tales is delayed by about four minutes, with an accuracy rate of between 60 and 70 percent.

The accuracy rate could be increased to 80 percent, Roukos added, if the delay were also increased. However most users of the system felt timeliness was more important than accuracy, especially considering the subject matter was breaking news. By comparison, a human translator can achieve a 95 percent translation rate, he estimated.

Tales is up and running, and users can subscribe to the " Iraq" package, which includes Al Jazeera and other Arabic-language news stations, for an undisclosed price.

Don't expect to tune it in during lunchtime, however; Roukos hinted that the price will to be in the hundreds of thousands of dollars.

Computers to rival a human

In addition to improvements in accuracy, Nahamoo explained that IBM will work diligently to index and make searchable other forms of content besides text. The Tales project is a tangible step in this direction. "We search on text today, but how about searching on speech and visual content?" Nahamoo said.

If the accuracy of speech recognition nears that of human performance, then will we inevitably encounter more interactive voice response systems, such automated voice mail? People may cringe at the thought, Nahamoo acknowledged, but there's an upside.

"Machines are not judgmental," Nahamoo said. "Some people feel like they are being judged when they phone call centers. That doesn't happen with a machine."

Recommended Videos

Recommended Articles

Planned $45M Pulse memorial faces resistance by some shooting victims

Where in the World is FOX?

Daewoo Founder Sentenced to 10 Years in Prison

On Both Sides of the Atlantic, a Debate Over Quality of Life

Fox 411: 'Hanoi Jane': An Urban Myth Re-Examined

Fox 411: Is Jive Records Jive Talkin'? Songwriter Says He's Never Been Paid

Salman Rushdie Steals Film from Renée Zellweger… Almost

Fox 411: The Best of 2000, Part 2

Winona Ryder Gets Hypnosis, Adam Sandler

Ten Things Your Pharmacist Won't Tell You

The Importance of Networking

Ten Steps to Wise Decision-Making

Jenna Bush Weds Henry Hager at President's Ranch

Fox 411: 'Party of Five' Star Finally Eats Something

Timeline: Tracking the Sniper's Trail

Iraqi Oil Well Fires Not a Major Health Threat

10 Things Your Assisted-Living Facility Won't Tell You

Missing 9-year-old Mo. Girl Found Dead

Fox on Sex: Taming the Green-Eyed Monster

FOXSexpert: Lasting Dangerously Too Long

Trace Gallagher: There is very little protesting happening here

New Jersey anti-ICE protest is a communist mosh pit, independent journalist says

Laura: They want to provoke a violent confrontation

Laura: Celebrities used to be really cool...

Fox News Highlights - May 29th, 2026

Greg Gutfeld: Why is it so hard for Dems to admit they screwed up?

Sean Hannity: James Talarico is more radical than Beto O’Rourke

New Yorkers lambast former leader Eric Adams as ‘crooked as hell’

Mark Levin: We have our foot on the enemy’s throat

UFO expert illustrates the ‘tug of war’ within the US government over file releases

Jesse Watters: We have Iran by the bitcoins

Clay Travis: Democratic policies have ruined much of what makes Los Angeles fantastic

These Democrats 'look in the camera' and 'lie to you,' Jason Chaffetz says

Actor Scott Baio: 'These people are crazy'

Everything needs to be on the table for negotiations with Iran, Ben Domenech says

Deadly Virginia bus crash sparks outrage over CDL licensing failures

White House sources say they feel closer to a deal than they have been, Philip Wegmann reports

Jesse Watters: Biden 'broke' the Democratic Party

'Friday Follies': Bruce Springsteen offers political commentary during concert

Former Attorney General Pam Bondi testifies on Epstein files