Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems
What is a Web search Engine Interface to search information on the WWW – Web pages, images & other files Other data from… – Newsgroups, databases, open directories Difference to web directories
History „Archie“ – „Archive“ without the „vee“ – 1990 by Alan Emtage – Directory listings of files located on public FTP sites & created searchable database – No indexing
History Wandex (1993) – First „real“ web search engine – Now-defunct – Collected by „World Wide Web Wanderer“ Developed by Matthew Gray at MIT Aliweb (1993) – Still running
History JumpStation (early 1994) Titles only WebCrawler (1994) First „full text“ crawler-based search engine Every word in any webpage is searchable Standard for most future search engines Lycos, AltaVista, Northern Light, …
Popular Search Engines Google (~2001) Success based on PageRank & link popularity Minimalistic user interface More than over 150 other criteria to determine relevancy Yahoo! Search Acquired Inktomi and Overture (Alltheweb and Altavista) 2004: Own search engine
Microsoft – Used results from other engines until – 2004: Launch of listing own search results Web crawler: msnbot – 2006: New platform called „Live Search“ and „MSN Search“ retired
Baidu – Most popular search engine in China – Very similiar interface to Google
How Web search engines Work Work in following order – 1. Web crawling – 2. Indexing – 3. Searching Storing a information about many web pages – Retrieved by a Web crawler Content of the page analyzed to determine how to index
Data about web pages stored in an index database Different ways of storing – Storing all or part of source page (Google) – Storing every word (AltaVista) Keyword – Engine examines index & lists best matching pages
Boolean operators – AND, OR and NOT to specify search query Usefulness depends on relevance of results – Results ranked by most engines Some Web searching engines supported by ad revenue – Listing higher ranked
Problems Web is growing fast than present technology can index Pages must be re-index when changed Many dynamically generated websites not indexable Invisible web Relevancy Search engine may give a list of unwanted, irrelevant sites, eletronic spam or pop-ups