Insight to Search Engine's Functions & Methods
Before the advent of search engines, the web was a strange place. Back then, content was growing at a startling pace, with no means to filter information or direct people to content they were unaware of. Users typically used forums, IRC or email to find links. Any service that could make sense of the range of information, make it coherent, and allow users to find the content they wished for, was poised to grow spectacularly.
What was necessary was a front door to the internet, a structure that allowed users to “surf”. Within a decade of their introduction, search engines changed the entire face of the web. Before search engines, the web was mostly used for academic purposes and communication – primarily through electronic mail. Search engines allowed users to create content, market products, establish e-commerce and offer a wide variety of services. The technologies behind search engines have evolved with time, and are constantly mutating. The innovations, takeovers and competition among search engines were intense in the beginning of this century because the winner might replace Microsoft as the chief technology service provider. No guesses for who won, but the struggle is still on. There are a whole range of technologies driving a new generation and breed of search engines.
The internet is a connection of different servers and computers. There is a range of cable types that make these connections. These servers may belong to individuals, or be owned by a company. Stored on these servers are a number of documents written in a markup language (HTML and variants) that are interpreted by browsers. These documents invariably have hyperlinks (text that leads viewers to another page when clicked). This network of hyperlinks forms the World Wide Web, known simply as the web. While the internet is the hardware, the web is the information stored on this hardware. Search engines keep track of developments and changes on the web, and allow users to track down documents they want to access. This is a never ending process because the web is constantly changing with every passing moment.
The history of internet search
HotBot and AltaVista were among the early search engines. Both Yahoo! and MSN have had popular email and messenger services, both augmented by search. Yahoo! had the additional benefit of having a larger database of web sites stored in a directory structure. AskJeeves was also a very popular web service that was one of the first search engines to allow users to look for information as if they were talking to an actual person. AltaVista was one of the first search engines to have a wide audience, and growth in popularity because of the number of pages it had indexed. HotBot and Yahoo! dethroned AltaVista. While this race had been going on since 1996, Google entered the scene only in 1998 with a small index and minimalist interface and vision.
There is a whole range of highly specialized search engines developed in recent times, but essentially, the status quo has been maintained since 2000, with the top three search engines being Google, Yahoo! and MSN, in that order. Another important player is Inktomi that only provided back-end search technologies, and were used by both Yahoo! and MSN. Inktomi never had its own web portal for providing search services to users. They allowed their indexes and search technologies to be used by other search service providers. Before Google, Inktomi-based search providers were the most used, and displaced HotBot and AltaVista as the leading search service provider. It also had a large database of business and listed ads from these businesses next to the search results. Inktomi pioneered the pay-per-click model, which was later perfected by Google with its AdSense. Yahoo! switched to Google’s databases and back to Inktomi within the five years of the millennium. MSN started using its own search technologies since 2004. Inktomi was acquired by Yahoo! in 2002.
With the web growing, search engines started increasing their services. Google was the market leader in innovation, buying out promising startups or stressing on developing fresh services in its own labs. AskJeeves.com was the first to make money by asking pages to pay for being indexed. Multimedia content provisions were made by early search engines such as Inktomi, but never really took off on a large scale. Search engines now make money by displaying advertisements related to a search query. In most engines, these ads occupy a separate space, but this distinction is not always clear. Multimedia search including video and images is offered by leading search providers. Locations, maps and people are other common search services offered by search engines, but are now feasible only in a few countries.
Spiders crawling the web
One of the fundamental tasks of a search engine is to find out that documents exist on the web at all. This was initially done by looking at known file servers, but the process is much more complicated now. Search engines use programs or scripts that automatically search the web, following hyperlinks. These programs or scripts are known as robots (colloquially called bots) or spiders. Spiders go to a particular web page, save it on their servers and follow all the links on that page. Going over the web is called “crawling”, which is a continuous process. There are a number of approaches that these spiders can make when crawling the web. A wide crawl will focus on as many different URLs as possible, from as many different domains as possible. A depth crawl first goes deep into a single site. Practically, it is essentially a compromise between these approaches. There are also random crawls, which do not give any preference to a single approach. Spiders continuously track the creation of new content, and hence have a never-ending job. Search engine crawlers, typically, have a huge queue of URLS waiting to be crawled. Spiders will never be able to crawl the entire web, and no search engine has crawled more than 20 per cent of the web. The number of pages crawled by a spider is no measure of the quality of the search engine. Google had a very small cache when it started out, but still gained popularity because of the relevance of its results. Google uses a spider called Googlebot. Due to the different approaches a spider can make, some search engines use multiple spiders. Yahoo! uses Slurp or Yahooseeker. MSN uses Msnbot, Msrbot, Msiecrawler and Lanshanbot.
A spider often visits the same page several times over. This is essential to keep the search engine up-to-date with the web site. A news web site or a web log tends to change rapidly, and so spiders have to keep crawling them over and over again. Static pages are not crawled as frequently. Most sites on the internet strive hard to be dynamic, and offer new content for every visit, which drains a considerable share of the spider’s time from new web pages. It takes some time, and a lot of effort on the part of the web designer to get the page to be crawled, which is why many search engines take payments for a site to be crawled, and updated frequently.
Site owners have a love-hate relationship with spiders. While getting their page crawled is important to them for hits, a number of spiders can easily drain resources from a small web site with limited bandwidth. The speed at which spiders can make demands on web sites can take them down and deny the web site’s service to legitimate visits. Spiders also use a multi-threaded approach to cache the web pages. This is akin to a download manager making a number of requests to the same site in parallel. Many website owners prefer to disallow some spiders from crawling their content. Therefore, a robot.txt file is used by webmasters to deny some or all spiders access to the page. This is also used so that a web master has the liberty to withhold his data from being crawled. A robot.txt file can be used to allow only some crawlers to search, and disallow other crawlers from searching. A robot.txt file can also be used to protect private content on the web. However, this approach is not always safe, as a number of aggressive crawlers may ignore the robot.txt file and crawl the web page anyway.
The flooded server farms
Search engines require a large amount of storage space. An index of all the web pages crawled by the spider of the search engine is stored. Initially, this index was stored on a single server. Google innovated a technique of storing data in an array of machines, which allowed faster access. This was a cheaper option for Google when it started up in a garage. This technique was soon picked up by other search engines.
The index is constantly updated, and there needs to be several technologies that allow data retrieval from the index. The most common method for this is hashing. Hashing allows data retrieval from incredibly large databases such as those used by search engines. Hashing is the process of creating a relatively small numerical value from a text string. This is used to make retrieval from the index much faster and easier.
For a simplistic explanation of how hashing works, consider a user looking for a particular Linux distribution. In the database of the search engine, there will be a number of distributions such as Fedora, Mandriva, Sabayon, OpenSUSE and so on. If a user searches for “Fedora”, the index will have to match the alphanumeric values of all the distributions, of different lengths. A much shorter approach would be the numbers 1, 2, 3, 4 and so on assigned to each of the distributions. When a search is made, the search engine will just match the numerical values of shorter length, instead of the much longer alphanumeric values. This is just an illustration, and hashing often relies on pretty lengthy (often 32-bit) alphanumeric strings, but the speed of the search is increased in very large databases.
When a spider crawls a page, it reads the page in many ways, and interprets and stores a number of aspects of the page for later use. The spider looks at all the words in the page, their proximity to each other, their presence in bold or in headers, the title of the page and stores the entire page in a cache. If it encounters a non-HTML document such as a PDF, a spreadsheet or a Word file, it stores the same content in XHTML format. A copy of the web page is made on the servers of the crawler, called a cache. This cache is the state of the web page when it was crawled, and is useful when the page has been updated, and has lost the content that was relevant to the search. The qualitative information about the page is stored in the servers of the search engine, often after compression. This data is used later to display the search results.
There are three kinds of servers that every major search engine employs. The first kind is the spidering server, that indexes the web. Servers across the globe store parts of the index called “shrads”. The index functions as an essential whole, but is physically located in different geographical areas. The second kind — document servers, store the cache of the crawled pages. The third kind of servers directs the traffic, resolve search queries, implement search algorithms and interface with the user. Typically, an x86 architecture is used, with a Linux-based OS. A typical search engine may have servers in as many as twenty different locations around the world.
What are you really looking for anyway?
Languages are very complex, and context plays an important role in communication. Words such as “executive”, “silver” or “orange” have different meanings and can be interpreted differently. This is a common problem encountered by users, who frequently get search results that are in stark contrast to their expectations. Content creators may have used different words, so a user may have to try different variations of the same search. Those who search for “car” will miss out an entire section of pages that use the word “automobile” instead. Some search engines such as Google resolve this problem by using the tilde character (~) before the search string to search for similar terms.
Apart from the highly interpretive nature of search, there are a number of factors in a page that can affect the relevancy of the content. Therefore, search engines use a number of approaches to resolve the relevancy in each search, and assign a level of significance to each page according to the search.
This process is very complex, and implements a number of algorithms. Basically, a certain page is given a number of points to begin with (often called cash). The page gets these points purely for its existence. These points are then distributed to all the links in the page. A page that has a high number of incoming links, automatically has a lot of points. In a breadth crawl, when a spider goes to as many different sites as possible, the most popular pages get crawled early because many pages link to them. Google predominantly calculates relevancy by the number of incoming links to that page, which is called PageRank. This is a safe way of calculating relevancy, as it disallows the content creator from manipulating his/her page to boost relevancy. However, this has often been abused by a lot of people “bombing” a site by linking to it, to boost its PageRank (called googlebombing). Other search engines use a proprietary mixture of considerations. These considerations include the number of times a searched word has appeared in a document, whether or not it appears in bold text, how many times it appears in the links, in sub-headings and in the title. Other considerations include the neighboring pages, the frequency of site updates and even whether or not the owner has paid the search engine. An important consideration is the geographic origin of the content and the location of the user originating the search query.
Typically, pages created in the same country as the searcher get a higher relevance. An Indian looking for “news” would likely have no interest in the news web sites of Belarus. Relevancy is also calculated using meta-tags in the page. Meta-tags exist specifically for search engines, where the content creator includes keywords in the document, explaining what the document is about. Keyword spamming is where a creator uses a lot of keywords to get hits, most of which have no relevance to the content of the page. Search engines automatically match meta-tags with the content of the page, and accordingly judge the relevancy. Some search engines such as Google and Yahoo! penalize keyword spammers by reducing their relevancy.
Many web site creators try hard to optimize their sites in such a way that they become more relevant for search returns. This includes artificially boosting the number of incoming links, using the right keywords in the meta-tags, formatting the page properly, and a number of other methods. Many of these are transparent, but a lot are clandestine. Optimizing a particular web site for search engines is a very lucrative business. A third party may work on a site between the creator and the search engine. Due to the aggressive nature of search engine optimization, all search engines continuously change the methods for relevance ranking, and keep the exact methods used, a secret.
Deep magic
As the web moves away from the domain of personal computers into portable devices such as PMPs and mobile phones, search engines will also spread out to these technologies. Geo-tagging will take on more of a hold, which means you can search while on the move. Products and services in the real world will be more easily integrated with the virtual alternatives.
Google cannot afford to move away from its minimalistic design. Search, however, does not end at Google. There are a whole bunch of interesting search engines waiting to be found. Swicki (www.eurekster.com) lets you build your own search engine. Chacha (www.chacha.com) answers questions the way good-old Jeeves used to, and even replies to SMS. Gnod (www.gnod.net) has tools for searching movies, music or books similar to what you already like. Results are displayed in a spatial map of similarity. Clusty (http://clusty.com/) breaks down results in clusters, which let you refine your searches.
In recent years, there has been a new breed of search engines that have made their presence felt. These are meta-search engines that do not have indexes or databases of their own, but work of the indexes and databases of other search engines. When a search query is entered in the search forms of these search engines, the search engine will retrieve the results from the databases of Google, Yahoo!, Ask and others. Mamma (www.mamma.com) or Dogpile (www.dogpile.com) are two very popular examples of met search engines.