November 13, 2003

Voice Authentication Trips Up the Experts

By ANNE EISENBERG

It’s not easy to recognize speakers solely by their voices. A voice on an audiotape might be that of Osama bin Laden, but it might also be that of a skilled imposter.

It turns out, though, that computers can help humans in the tricky task of speaker recognition, using their huge memories, pattern matching and fast processing to search a database and pair the sound of a voice with its owner.

Computer-based applications of speaker recognition are gradually expanding, and already include intelligence gathering and telephone transactions, in which sellers reduce the risk of fraud by making sure that the voice on the line is in fact that of the credit card's owner.

But the underlying technology is still far from foolproof, so a small band of researchers is working to refine its accuracy and consistency.

Some of the researchers say they are finding the task harder than they had expected. "In principle, this is a very simple problem," said George Doddington, a speaker recognition expert in Orinda, Calif., who is a consultant to the federal government. "It's a binary decision - is this person who he claims to be or not?"

But accuracy has been hampered by many difficulties, from the differences in telephone connections and microphones to the inherent variability of the human voice.

"Voice characteristics vary with your age, your metabolic state, your emotional state and all the ways you can say 'yeah,' " Dr. Doddington said.

"You'd think we could exploit the differences for recognition," he said. "But people's voices are different at different times."

Adding to the difficulty are the varied circumstances in which voices are recorded. People talking in a studio sound different from people talking on a car phone. To do a good job, authentication programs must account for all these sources of variation.

To spur research in the technology, the National Institute of Standards and Technology in Gaithersburg, Md., enlists many scientists in annual evaluations to see how their computer programs stack up in matching unidentified voices with the actual speakers.

The task set for the competition is a tough one, said Alvin F. Martin, a mathematician at NIST and one of the organizers of the event, which uses recordings of telephone conversations. The contestants' programs have to determine if two speech excerpts are from the same speaker. "But they don't know in advance what words they will be dealing with," Dr. Martin said.

This is a far harder task for programs to handle than word-for-word matching of passwords or code numbers, a standard approach of traditional authentication software. "Instead, they are matching speech on different things at different times," he said.

Dr. Martin said that the technology had improved strikingly in the last few years, particularly in mining other characteristics of voice beyond the physical ones related to a speaker's vocal apparatus. "We are learning to take advantage of many other kinds of information from the speech signal," he said, "like word combinations that speakers typically use."

Douglas Reynolds, a senior staff member at the Massachusetts Institute of Technology's Lincoln Labs in Lexington, Mass., is among the researchers who have worked on extending the traditional range of acoustic information analyzed, adding characteristics like pitch, pauses and pronunciation style. Information like this should prove highly useful in applications like audio mining, in which computers search tapes to identify particular speakers.

"If you have archived meeting minutes or news broadcasts and you want to know who is speaking, you want to squeeze as much information as you can from the speech signal, because you can't get more," Dr. Reynolds said.

At I.B.M.'s Thomas J. Watson Laboratory in Hawthorne, N.Y., Ganesh Ramaswamy and his group of researchers are using multiple sources of information from a conversation to develop their technology, which they call conversational biometrics.

"We look not just at the voice," Dr. Ramaswamy said, "but at what you say and how you say it."

The I.B.M. technology is intended for use in authenticating transactions like gaining access to credit card account information over the telephone. I.B.M. enrolls people in the program by asking them to read from a magazine for 30 seconds. "Any magazine is fine," Dr. Ramaswamy said. "When people speak this long, you get enough of an idea of the frequency content of the various sounds in their voices." The program also creates a model of other details like pronunciation.

This might suffice to authenticate a voice in simple cases, he said. "If someone is calling from a home phone and the voices match along with the phone numbers, that might be the end of it."

But the system has programming to deal with more complicated situations; it asks questions of the speaker and decides whether the answers are adequate. "The acoustic verification runs in parallel with the speech recognition," Dr. Ramaswamy said. "It will ask a lot more questions for a $1,000 transaction than for a $10 one."

Applications of voice verification research are gradually showing up commercially. A recent survey of voice-based biometrics by Judith Markowitz, a consultant based in Chicago, listed more than 50 companies providing goods and services.

The use of voice in biometrics may turn out to have a significant advantage, said Joel S. Lisker, senior vice chairman of a lobbying and consulting firm in Washington. "For other biometrics like face or dynamic signature, you have to go someplace to do it, like a bank," he said. "Here you can do the enrollment in comfort at home or at your desk - a huge plus."

Dr. Doddington hopes that whatever comes, future vendors of voice authentication systems will be wary of making facile comparisons to fingerprints, less they offer false assurances.

"Fingerprints are physical," he said. "Speech is a completely different animal. It's something you do as opposed to what you are. It's a performance."