Google’s Secret revealed: How does Google search query work

Aug 13

08:58

2007

Hari Qumar

Process behind the working of Google search engine

You must be thinking at times as to how Google is able to give out the correct or the required result in such a short span of time. I myself use to ponder over it for a long time.

Let me explain you the process behind the search result delivered in Google.

The life span of a Google search query normally lasts less than half a second, yet involves a number of different steps that must be completed before results can be delivered to a person seeking information.

How Google spider crawls: In Google, the web crawling (downloading of web pages) is done by several distributed crawlers (called as Googlebots). There is a URLserver that sends lists of URLs to be fetched to these crawlers. The web pages that are fetched by the googlebots are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository (a kind of storage facility). Every web page has an associated ID number called a docID that is assigned whenever a new URL is parsed (analysed) out of a web page.

The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses (analyses) them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization.

The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses (analyses) out all the links in every web page and stores all the important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links that are pairs of docIDs.

The links database is used to compute Page Ranks for all the documents. The pages that get into the barrel first will be most likely to be included in the search results. The question is, how does Google decide which pages go into the barrel first? The answer (as some of you probably already guess) is based on the fact that the docID is an integer.

For viewing how Google search engine works,

Click here

Article "tagged" as: