Dr. Rita R. Colwell
Director
National Science Foundation
Knowledge Discovery and Dissemination Meeting
Herndon, Virginia
September 4, 2002
See also slide
presentation.
If you're interested in reproducing any of the slides,
please contact
The Office of Legislative and Public Affairs: (703)
292-8070.
Thank you, Gary, and good morning, everyone. It's a
pleasure to be here to open the second day of this
NSF Kickoff Workshop for our program on Knowledge
Discovery and Dissemination, better known as "KDD."
I understand that there was much stimulating interaction
yesterday and I know that today promises even more.
I'm glad to be part of it.
This meeting is an extremely timely opportunity for
two communities to interact--the information technology
researchers and the intelligence analysts--an opportunity
to connect NSF grantees doing cutting-edge research
with those in other agencies who need the eventual
applications of your research.
As context for this interaction, I would like to speak
about why the National Science Foundation is supporting
KDD.
This research program seeks to harness information
technology to improve our ability to synthesize and
use information culled from many different sources.
This fundamental research, an unclassified program
like everything NSF supports, holds great potential
to contribute significantly to our national security,
while very much in keeping with our overall research
goals.
Almost a year ago, September 11 thrust a new reality
upon us. As we look back with the inevitable hindsight
to who knew what and when, we realize that we vitally
need better tools to assemble a comprehensive picture
of threats that may face us.
One obstacle to synthesis is information overload.
A second challenge, as described by intelligence scholar
Gregory Trevorton, is structural divides--"distinctions
between intelligence and law enforcement, between
foreign and domestic, and between public and private."
1
KDD offers some promise in surmounting both
challenges posed by information--overload and synthesis.
In the past year, the National Science Foundation--responsible
for supporting science and engineering research across
the entire range of disciplines--has funded a number
of very specific efforts related to the attacks of
September 11 and their aftermath. The NSF Act of 1950,
in fact, expressly authorizes us to support science
and engineering related to national security.
Many of these efforts build upon our existing investments
in fundamental research. Right after the 9/11 attacks,
for example, we supported the use of small experimental
robots to search the WTC site for remains. We also
sponsored engineering studies into what caused the
buildings to collapse. Other grantees are studying
the geographic dimensions of terrorism and the short
and longer-term societal responses to 9/11. Still
others have sequenced the anthrax genome and developed
sensors for detecting bioterrorism.
While not supporting classified research, NSF contributes
to homeland security in a number of ways. We may bring
together key researchers from any number of fields
with other federal agencies--such as at this meeting.
NSF may also contribute by sponsoring a workshop on
a critical topic, such as the one we held on chemical
and biological sensors.
NSF's role, from information technology to engineering,
and from social science to bioscience, is to support
the fundamental research needed by other agencies
and industry for applications.
A few numbers will help to make my case about what
NSF can offer. We currently account for about half
of the Federal non-medical support for fundamental
research at U.S. colleges and universities.
Each year 50,000 reviewers--the brightest minds in
science and engineering--competitively review the
32,000 funding requests we receive. We are able to
fund only around a third of these, or about 10,000
new awards annually. In any case, we have access to
communities with expertise on a wide spectrum of areas
related to homeland security.
NSF is also about people--building the future science
and technology workforce. Since 1952, we've supported
36,000 graduate research fellows across the disciplines.
More broadly, we calculate that we directly support
nearly 200,000 people each year--teachers, students,
researchers, post-doctorates and trainees.
NSF highlights support for research at the intersections
of disciplines. The ideas and technologies of life
science, physical science and information science
are merging. Increasingly, it is at these frontiers,
where disciplines converge, that new knowledge is
being generated to meet the complex challenges we
face as a society.
In the past few years we have made it a deliberate
part of our strategy to demarcate areas of converging
discovery for special investment. These areas are
information technology, nanotechnology, biocomplexity,
mathematics, and the study of how we learn.
We lead the Federal investment in information technology,
a joint effort among Federal agencies. We also lead
the National Nanotechnology Initiative, a coalition
of organizations from government, academe and the
private sector. Because we encompass all the disciplines
of science and engineering, we naturally seek the
synergy of partnering with other agencies.
Not constrained by a narrow mission, we can be flexible
about responding to emerging needs. In fact, our founding
act of 1950 directs us to support unclassified research
and education through support from other federal department
and agencies. KDD is just a current example of that.
We consider it critical to nurture new research communities
focused on emerging challenges. Many of these challenges
have dual payoffs.
Take the Incorporated Research Institutions for Seismology--the
worldwide network for monitoring natural seismic activity
and earthquakes that has been equally valuable for
monitoring nuclear tests, surreptitious or not.
Another example: the proposed National Ecological Observation
Network, which will provide real-time monitoring of
complex ecological systems. NEON promises much more
detailed understanding of how the environment works.
It could also be used to track the health of our environment,
for monitoring invasive species or diseases such as
West Nile virus.
One more example: NSF is supporting the development
of a new research area, computational epidemiology.
The sheer scale and complexity of epidemiological problems
today--and the large data sets they engender--call
for powerful computational tools and mathematical
analysis. We're supporting groups that will study
specific topics, such as data mining and epidemiology.
Tutorials will also bring epidemiologists and biologists
together with computer and mathematical scientists
to learn about each other's fields.
There is another dimension to how we support research.
We believe that high-risk research with the payoff
of discovery needs the time and resources to flourish.
I have often spoken about the need to increase both
the size and duration of NSF awards. Our average grant
currently runs three years. However, a recent survey
of our principal investigators showed that five-year
grants would be more effective. Also, the survey suggested
that larger grants would encourage more innovative
ideas and greater collaboration with other researchers.
You can find the detailed report of the survey on
our website, but I cite it to assure you that those
of us at NSF consider its findings very important.
With this context on NSF's mission and style of working,
I'd like to focus on the KDD program. We all know
the term "data mining"--combing through a huge data
set for hidden insights, sort of like searching for
a needle in a haystack.
KDD aims beyond this, not only to discover vital bits
of information from many types of sources, but also
at rudimentary synthesis and sharing the information
with those who need it. We expect the results to be
valuable in both the intelligence and law enforcement
arenas. KDD augments research by NSF grantees that
is already underway.
Researchers are given the resources to accelerate their
work. It can enable them to take on new students,
collaborate with other faculty, buy new tools, or
even collaborate with each other, as we hope could
emerge from this meeting.
I'll turn now to some specific KDD projects, with graphics
provided by several of you here today, and I thank
each of you.
I plan to note just a few research highlights because
the real experts are here and they'll be presenting
their work in detail later on, so please save questions
on the projects for them. Another caveat is that this
is all very much work-in-progress.
[Slide up: Speaker
differences: Feedback]
(Use "back" to return to the text.)
The first project is "talk printing"--aimed at enabling
machines to automatically recognize a person by the
way he or she talks. This is Elizabeth Shriberg's
work, and she is from SRI International. Talk printing
goes beyond current approaches that tend not to differentiate
between speakers with similar vocal tracts. Instead,
the new method looks for identifying clues in word
sequence, intonation, pausing, and interruption behavior,
to name just a few.
Let's listen to a few examples of how different speakers
use unique "feedback" in conversation--little phrases
like "uh-huh" and "right" that we use to show we're
listening. We'll hear four examples, as shown on the
slide:
[NOTE: actual speech samples are not available]
The first is low in pitch and flat.
The next is a different speaker with a similar style,
but he uses another feedback word--"right."
Here's a higher pitch range.
The last speaker also rises in pitch but draws out
the phrase longer.
How a speaker closes a conversation is also very individual.
[change to next
slide: speaker differences/conversation closings]
(Use "back" to return to the text.)
They can all use the same phrase--"It was good talking
to you"--but sound very different. Let's listen to
a slow example.
Now a fast one.
Here's one that goes up and down in intonation.
Now a speaker with high energy and pitch range.
Talk printing will automatically let a computer distinguish
between these different habitual patterns of speech,
and I'm sure Elizabeth will explain the nuts and bolts
of the method in her presentation today.
I'll just add that the technique offers interesting
features for intelligence gathering, law enforcement,
and speech technology. For example, it can distinguish
between a casual chat and conversation planning an
event. It can also suggest who is dominating a conversation,
or when someone is departing from their usual speaking
style--like disguising their voice.
[CMU Informedia:
screen capture/display of results on sightings of
Bin Laden couriers]
(Use "back" to return to the text.)
Here's another KDD example, the Informedia Digital
Video Library, provided by Howard Wactler of Carnegie
Mellon University. In this case, the idea is to render
the vast amounts of available, open-source multimedia
data streams useful for intelligence analysts.
This research expands the ability to discover and track
relationships from video sources, using extracted
textual and visual information. Here, for example,
the analyst queries a large video database--broadcast
radio and television and surveillance video--for identifications
of Bin Laden couriers. The selected video samples
can actually be played. At the same time, a map is
shown that plots courier sightings at corresponding
times and places. Eventually the display will provide
material in multiple languages.
[2nd
CMU Informedia: relationships among Al-Queda terrorists]
(Use "back" to return to the text.)
Here's a second, simulated example showing the capabilities
of Informedia. This display illustrates relationships
among five Al-Queda members, again culled from visual
media reports. We see dots plotted between the individuals.
[closeup]
(Use "back" to return to the text.)
The more dots or "hits" on a line between two people--such
as between Atta and Zawahiri--the more frequently
they occur together in a news story.
Different colors show time--when the news stories appeared.
The analyst can specify the period sampled--making
news stories of a specific time-period appear in a
certain color. For example, if you look closely here,
blue dots denote older reports, and pinker dots show
more recent stories.
[Roukos: the difficulty
of automatically detecting the first time a topic
is reported: graph of blue dots]
(Use "back" to return to the text.)
Another KDD example comes from Salim Roukos [Pr: Saleem
Roo-cohs] of IBM, who is developing technology to
automate the extraction of text with meaning--not
just the extraction of individual words. The problem
is how to automatically detect the first reporting
of a topic or event in the media. Present technology
using a word search alone does not work.
This graph depicts all the stories that ran on a newswire
service over a given time period, perhaps on a given
day. Each blue dot is a story.
[same graph with
diagonal line]
(Use "back" to return to the text.)
If we ask for all the news stories that are the first
reports on new topics, the current method draws
this diagonal line through the data and tells us to
look at the stories to the right of the line.
[same graph with
diagonal line and red dots]
(Use "back" to return to the text.)
Now we see the first reports highlighted in red. The
current method, based on searching for words, did
not work--it failed to separate old topics, blue dots,
from new topics, the red dots.
[second Roukos
graphic: English/Arabic text correlations]
(Use "back" to return to the text.)
Computers need to develop a more semantic way to represent
text--to sense not just the words but what they mean
in their context. This graphic suggests how this could
be done. We see that news stories in English and Arabic
contain several similar phrases, color-coded for similarity,
strengthening the assumption that the same event in
being reported in both. Eventually, computer inference
of statistical patterns will automate the extraction
of knowledge in several languages.
One more example: Here is work by Hsinchun Chen, of
the University of Arizona, that is rooted in earlier
NSF programs called digital libraries and digital
government. It shows how KDD builds on previous NSF
support. In this case, police data, at left, show
links between people, places and entities in a criminal
network. The same data network has been automatically
adjusted--at right--to show particular relationships:
subgroups and central criminal figures.
[one person's criminal
associations]
(Use "back" to return to the text.)
This graphic shows how an analyst has zeroed in on
one entity in the network--such as a person-to view
crime associations of that person.
Under KDD, Chen is obtaining the data from two police
departments--in Tucson and Phoenix--and scrubbing
them of references to an identifiable person, while
retaining the integrity of relationships between the
database objects.
The idea is to create large law-enforcement databases
that can be used for intelligence analysis research.
Privacy is preserved but a research resource is available
that reflects real-world patterns of criminal activity.
[slide off]
(Use "back" to return to the text.)
You'll be hearing much more detail soon about these
and other intriguing projects-in-progress, so I'll
sum up now with the general observation that cutting-edge
science and technology must be integrated into homeland
security efforts.
KDD is a superb example of this, because having the
right information at the right time--in the hands
of those who need it--is a critical capability to
foiling terrorist plots.
We all bear the responsibility to make our nation and
our world more secure. I think it is a privilege that
in many cases, the work that scientists and engineers
already do, and want to do, can be harnessed to meet
a current and pressing national need.
In KDD we see once again how fundamental research pays
off in unexpected ways--in this case, for the well-being
of our nation. I look forward to hearing what emerges
from your meeting, and I now welcome questions and
comments you might have.
Gregory F. Trevorton, Government Executive, Sept.
2002, p.64
|