Re: How do web search engines work?

Date: Fri Sep 22 10:15:40 2000
Posted By: David Ehnebuske, Sr. Technical Staff Member, Software, IBM Corporation
Area of science: Computer Science
ID: 967834878.Cs

Message:

Mycroft,

Your question about how search engines work is an interesting one. Your understanding of what happens when you ask for a search is essentially correct: When you make a request, the engine uses a database that's essentially a really big index of web pages to look for entries that match your query. Summary information and a link to each of the ones that "best" match your query are returned to you in the list that comes back.

Search engine sites develop their databases using two never-ending processes: "Web crawling" and "pruning". Crawling consists of reading a list of web pages one by one with computers. Pages read in this way are processed to make new database entries. They are also processed to extract links to pages that haven't yet been crawled, which are put on the list of web pages still needing to be crawled. Pruning consists of throwing away entries that are obsolete either because the pages they index are no longer there or because their content has changed.

Large as these indexes are, they are not anything like as huge as complete copies of the web. There are basically two reasons for this.

First, no search engine indexes anything close to all of the web. Since no one knows with any degree of accuracy how big the web actually is, no one really knows what fraction of the web is indexed by the big search sites. I've heard estimates of ranging from less than 1% to somewhere around 20%. One thing is certain, though: As the web expands, the fraction that even the biggest search engines cover goes down. The main reason is that the overall size of the web is growing exponentially while the average lifetime of a web page is going down. Both of these make it really tough on search engines; they have more to crawl and what they do manage to crawl quickly becomes obsolete. Since the amount of time they have to do the work remains the same, the fraction of the web they can keep track of goes down.

The second reason the database isn't the same size as the web is that search sites do all kinds of tricks to keep the amount of data they have to keep to an absolute minimum. Some are the tricks people have used for years to keep the size of indexes in books to a small fraction of the size of the book itself. For example, ignoring graphics, and words that are not useful in queries. They also do really clever computer-type tricks such as assigning numbers to the normal but still useful words that they do index. These numbers take much less space to store than the words they are assigned to.

I hope this helps answer the question. If you need more, I'd bet a carefully formed query to any of the big search engine would find pages of good references!

Current Queue | Current Queue for Computer Science | Computer Science archives

Try the links in the MadSci Library for more information on Computer Science.