By ANNE EISENBERG
It’s not easy to recognize speakers solely by their voices.
A voice on an audiotape might be that of Osama bin Laden, but
it might also be that of a skilled imposter.
It turns out, though, that computers can help humans in the
tricky task of speaker recognition, using their huge memories, pattern matching
and fast processing to search a database and pair the sound of a voice with its
owner.
Computer-based applications of speaker recognition are
gradually expanding, and already include intelligence gathering and telephone
transactions, in which sellers reduce the risk of fraud by making sure that the
voice on the line is in fact that of the credit card's owner.
But the underlying
technology is still far from foolproof, so a small band of researchers is
working to refine its accuracy and consistency.
Some of the researchers say they are finding the task harder
than they had expected. "In principle, this is a very simple problem,"
said George Doddington, a speaker recognition expert in Orinda, Calif., who is
a consultant to the federal government. "It's a binary decision - is this
person who he claims to be or not?"
But accuracy has been hampered by many difficulties, from
the differences in telephone connections and microphones to the inherent
variability of the human voice.
"Voice characteristics vary with your age, your
metabolic state, your emotional state and all the ways you can say 'yeah,'
" Dr. Doddington said.
"You'd think we could exploit the differences for
recognition," he said. "But people's voices are different at
different times."
Adding to the difficulty are the varied circumstances in which voices are recorded. People talking in a studio sound different from people talking on a car phone. To do a good job, authentication programs must account for all these sources of variation.
To spur research in the technology, the National
Institute of Standards and Technology in Gaithersburg, Md., enlists many
scientists in annual evaluations to see how their computer programs stack up in
matching unidentified voices with the actual speakers.
The task set for the competition is a tough one, said Alvin
F. Martin, a mathematician at NIST and one of the organizers of the event,
which uses recordings of telephone conversations. The contestants' programs
have to determine if two speech excerpts are from the same speaker. "But
they don't know in advance what words they will be dealing with," Dr.
Martin said.
This is a far harder task for programs to handle than
word-for-word matching of passwords or code numbers, a standard approach of
traditional authentication software. "Instead, they are matching speech on
different things at different times," he said.
Dr. Martin said that the technology had improved strikingly
in the last few years, particularly in mining other characteristics of voice
beyond the physical ones related to a speaker's vocal apparatus. "We are
learning to take advantage of many other kinds of information from the speech
signal," he said, "like word combinations that speakers typically
use."
Douglas Reynolds, a senior staff member at the Massachusetts
Institute of Technology's Lincoln Labs in Lexington, Mass., is among the
researchers who have worked on extending the traditional range of acoustic
information analyzed, adding characteristics like pitch, pauses and
pronunciation style. Information like this should prove highly useful in
applications like audio mining, in which computers search tapes to identify
particular speakers.
"If you have archived meeting minutes or news
broadcasts and you want to know who is speaking, you want to squeeze as much
information as you can from the speech signal, because you can't get
more," Dr. Reynolds said.
At I.B.M.'s Thomas J. Watson Laboratory in Hawthorne, N.Y.,
Ganesh Ramaswamy and his group of researchers are using multiple sources of
information from a conversation to develop their technology, which they call
conversational biometrics.
"We look not just at the voice," Dr. Ramaswamy
said, "but at what you say and how you say it."
The I.B.M. technology is intended for use in authenticating
transactions like gaining access to credit card account information over the
telephone. I.B.M. enrolls people in the program by asking them to read from a
magazine for 30 seconds. "Any magazine is fine," Dr. Ramaswamy said.
"When people speak this long, you get enough of an idea of the frequency
content of the various sounds in their voices." The program also creates a
model of other details like pronunciation.
This might suffice to authenticate a voice in simple cases,
he said. "If someone is calling from a home phone and the voices match
along with the phone numbers, that might be the end of it."
But the system has programming to deal with more complicated
situations; it asks questions of the speaker and decides whether the answers
are adequate. "The acoustic verification runs in parallel with the speech
recognition," Dr. Ramaswamy said. "It will ask a lot more questions
for a $1,000 transaction than for a $10 one."
Applications of voice verification research are gradually
showing up commercially. A recent survey of voice-based biometrics by Judith
Markowitz, a consultant based in Chicago, listed more than 50 companies
providing goods and services.
The use of voice in biometrics may turn out to have a
significant advantage, said Joel S. Lisker, senior vice chairman of a lobbying
and consulting firm in Washington. "For other biometrics like face or
dynamic signature, you have to go someplace to do it, like a bank," he
said. "Here you can do the enrollment in comfort at home or at your desk -
a huge plus."
Dr. Doddington hopes that whatever comes, future vendors of voice authentication systems will be wary of making facile comparisons to fingerprints, less they offer false assurances.
"Fingerprints are physical," he said. "Speech is a completely different animal. It's something you do as opposed to what you are. It's a performance."