Skip Navigation
Staff Directory | Contact CISE | Contact Web Master | Site Map
National Science FoundationCISE - The Directorate for Computer and Information Science and Engineering
Graphic Line
Home | About CISE | Funding | Discoveries | News & Events | FAQs
Graphic Line
Header Graphic
Graphic

Divisions
Computing &
Communication
Foundations
  News
Bullet Computer &
Network Systems
  News
Information &
Intelligent Systems
  News
Bullet Shared
Cyberinfrastructure
  News
News and Events Links
Bullet CISE News Highlights
Bullet CISE Public Notices
Bullet Lectures
Bullet Workshops
Bullet CISE Event Calendar
Bullet Assistant Director's Presentations
Bullet Vacancies


CISE Distinguished Lecture Series

CISE Lectures

How to Crawl the Web

Hector Molina-Garcia

12/12/2000
0:00 pm - 3:00 pm

4201 Arlington Boulevard
NSF 110
Arlington, VA 22230

A crawler collects large numbers of web pages, to be used for building an index or for data mining. Crawlers consume significant network and computing resources, both at the visited web servers and at the site(s) collecting the pages, and thus it is critical to make them efficient and well behaved. In this talk I will discuss how to build a "good" crawler, addressing questions such as:

How can a crawler gather "important" pages only?

How can a crawler efficiently maintain its collection "fresh"?

How can a crawler be parallelized?

I will also summarize results from an experiment conducted on more than half million web pages over 4 months, to estimate how web pages evolve over time.

View the presentation slides. 

Hector Garcia-Molina

 

(http://www-db.stanford.edu/people/hector.html) is the Leonard Bosack and Sandra Lerner Professor in the Departments of Computer Science and Electrical Engineering at Stanford University, Stanford, California. From August 1994 to December 1997 he was the Director of the Computer Systems Laboratory at Stanford. From 1979 to 1991 he was on the faculty of the Computer Science Department at Princeton University, Princeton, New Jersey. His research interests include distributed computing systems and database systems. He received a BS in electrical engineering from the Instituto Tecnologico de Monterrey, Mexico, in 1974. From Stanford University, Stanford, California, he received in 1975 a MS in electrical engineering and a PhD in computer science in 1979. Garcia-Molina is a Fellow of the ACM, received the 1999 ACM SIGMOD Innovations Award, and is a member of the President's Information Technology Advisory Committee (PITAC).

 

 

 

Graphic Space Graphic Space Graphic Space
Top
Bottom Corner Graphic Space Bottom Corner