My Published Articles: March 2004

How bots work
Explore the key components of a search engine
by Salman Siddiqui

Each day, the Internet adds some 60 terabytes of data to its already-massive amount of storage, increasing the number of web pages available, which according to some estimates, have crossed the 10-billion figure. With this huge amount of information online, the need to find relevant information in the shortest possible time is a key issue. This is where search engines play a critical role.
SpidersA search engine makes use of specialized software robots (or bots) called spiders to scan through thousands of pages in a second. This process of scanning is called ‘crawling’; in it, the spider program copies the keywords, titles, phrases and other descriptive information about the webpage into a search index.According to a famous search engine programmer, Tim Bray, it doesn’t require rocket science to implement a spider program. Tim provides seven simple steps at his site to build one such simple bot: 1. In the beginning, when the search engine has no pages in its index, the spider program must make use of popular websites as a starting point, and crawl through each link within that site. Therefore, in the first step, select a URL from one such famous site.2. Crawl through that URL and get the headers/data from the servers of that site.3. The data received from the site may include HTML, graphics, video and audio files etc. Store all the files you want to index and discard the rest. Typically you will need only the HTML, the PDF and other document files. If you cannot index the type of data received from the URL, return to step one.4. Keep updating your search-engine index with what you just fetched from the site.5. Now, extract the hyperlinks from those files which you have just fetched.6. Add the URLs from these hyperlinks to the list of links through which the program needs to crawl for indexing.7. Go to step one.
Although the most popular programming languages used to code a spider program are Perl and Python, the same can also be coded in Java or C but requires more time and effort.There are usually two types of ‘crawls’ associated with spider programs: Fast Crawl and Deep Crawl. In a fast crawl the spider bot visits sites that are updated frequently and looks only for new content there. In a deep crawl, the program not only checks out new material in a site but also revisits every link within it in a random order. Also a number of spider programs from the same search engine can run in parallel at the same time.
Meta TagsSince meta tags allow a website owner to specify the keywords and phrases that describe the information contained in a website, they are sometimes thought of as ‘data about the data’. These keywords are used by search engines to store information about websites in their index databases.The tags are generated either by hand, as Yahoo! did by employing a team of editors, or automatically like Google does, by using thousands of servers at its Googleplex compound to compute the ranking of the websites according to its Page Rank algorithm.
Site RankingsDifferent search engines employ different strategies to rank websites in response to the keywords searched by a user. These differences in searching strategies eventually determine the success and failure of a particular searching brand.Most search engine algorithms take into account the frequency of the keyword found in a particular website. The higher the number of exact matches found, the higher the rank. Other factors that influence the ranking include whether the keyword is in the title or subtitle, the depth in which it resides within a document etc. Many popular search engines such as AltaVista are based on these considerations.But a search engine that takes into account only the frequency of the keywords can be easily manipulated. A smart website programmer might be tempted to optimize his website by (mis)using Meta tags in order to generate more traffic and increase the ranking of his site in a search engine. This can be done by including popular keywords that are completely unrelated to the content of the website in the Meta tags of web pages. The spider programs of a search engine should ideally not rely on a site’s meta tags to gather information and should include checks to match meta tags with that of the content of the site.However even these tweaks in the spider bots may not be enough to thwart artificial methods of page hit inflation, since one can easily also publish thousands, if not millions, of dummy web pages with popular keywords matching those in the content. Hence other factors such as, say, the frequency of updates in a webpage should also be considered in determining the proper rank of a site.
Google’s Page RankWhat set (and continues to do so) Google apart from other current (and popular) search engines is its unique “Page Rank” algorithm, which ranks websites on the basis of their popularity on the Internet. According to this rule the greater the popularity of a web page, the higher the ranking. The popularity of a page is judged by taking into account the number of references or links pointing to the page. It follows from the premise that greater the citations or referrals of a page by other web pages, the more likely it will contain relevant information. The referrals to the page are calculated by using the huge web page index that is continuously updated and maintained by the 10,000 Google Web servers.But is it necessary that the most popular site would also contain the most useful information to a user? Not really, especially if you consider the fact that almost any type of search engine does not contain an index to all the web pages of the ever-expanding Internet but only to a fraction of them. Google has the largest Internet index of all the search engines, with about 4.28 billion web pages and 800 million images fully indexed. However even this huge index falls short of the Internet by about 6 billion pages. Despite this, Google is probably the best bet when it comes to searching the Net…at least for now.In Google, a devious site administrator may artificially improve his site rankings by publishing thousands of junk pages that contain links to his site. This feature can be regularly observed in the Google bombing phenomenon, where bloggers make a humorous political statement by creating many links to a given site intentionally so that it comes up first in response to a degrading keyword. For example, if you type in the keywords ‘miserable failure’ in Google and press the ‘I’m feeling lucky’ button the biography of George Bush loads up. Also, if you search for the keywords ‘Daniel Pearl Videotape’ a web page loads up denouncing your actions, and urging you not to look for the tape as it helps the objective of terrorists.
Robot exclusion protocolsNow, if you want a particular webpage or the entire website not to be indexed or visited by a search engine spider, you can use Robot Exclusion protocols, by means of which a website administrator can indicate which parts of a website may not be visited by a search engine. One way of doing this is to include a file named nameofyoursite.com/robots.txt in the web server where you’re site is hosted. The file robots.txt typically contains the following instructions:User-agent: * Disallow: / Also, through the use of a Robots Meta tag such as , the HTML author of a webpage can specify to a spider bot that the page is not to be indexed.
Industry trendsOver 550 million search requests are entered worldwide, every day, out of which 245 million come from the US alone. As of December 2003, 35 percent of all search requests were handled by Google, 27 percent by Yahoo!, 15 percent by Microsoft, and 16 percent by AOL-Time Warner-owned sites. Additionally, with US$2 billion in paid placement revenues in 2003, the searching industry is projecting revenue as high as US$7 billion dollars by 2007. Even though Google reigns supreme in the current situation, the question is: “How long can it sustain market dominance?” With US$7 billion involved, every wannabe search engine wants a share of the pie, thereby brewing stiff competition for the coveted position of search engine favorite in the near future. The grounds of competition have been opened to recent upstarts like Mooter and Dipsie, which have developed search engines on new grounds. Dipsie, which is planning to launch this coming summer, will index over 10 billion documents through its unique search engine spiders, and claims that it will give Google a run for its money. Mooter, on the other hand, creates clusters that group results from the keywords searched, and then attempts to tailor the results according to the interests of the user. However, none of these are along the lines of a revolutionary change in the way we search the Web today except for the unreleased Microsoft Research’s AskMSR program. Microsoft is planning to include a built-in search engine as part of its future operating system, LongHorn, which is scheduled to be released in 2005. The search engine is based on the principle that when we search the Web we are in fact looking for answers. These answers may not necessarily be only on the Web and could very well be residing inside the files such as word, excel sheet etc of our system. The objective of Microsoft is to develop a search engine that searches not only for relevant information on the web but also in the files that resides on our systems. For this purpose it is developing an entire new file management system in its upcoming OS, called WinFS, which would allow one to link a lump of information such as the contact detail of an individual in Outlook with that of the author of a word document. It is also incorporating natural language processing capabilities and other advanced artificial-intelligence methodology to revolutionize the very nature of present day search engine technologies. With a 97% share in the PC OS market, it won’t be surprising if the next generation of search engines will end up eventually coming from Microsoft as well. But whatever the case, one thing is for sure: the users of tomorrow will have better options to choose from and more relevant information at their fingertips.

My Published Articles

About Me

Thursday, December 15, 2005

March 2004

0 Comments: