Search Engines: The players and the field The mechanics of a typical search. The search engine wars. Statistics from search engine logs. The architecture.

Search Engines: The players and the field The mechanics of a typical search. The search engine wars. Statistics from search engine logs. The architecture of a search engine. The query engine.

1 Search Engines: The players and the field The mechanics of a typical search. The search engine wars. Statistics from search engine logs. The architecture of a search engine. The query engine.

2 Results & ads returned ranked

3 Result for phrase query

4 Terms for Web Search Corpus: The publicly accessible Web: static + dynamic Goal: Retrieve high quality results relevant to the user’s need (not docs!) Need Informational – want to learn about something Navigational – want to go to that page Transactional – want to do something (web-mediated)  Access a service  Downloads  Shop Gray areas  Find a good hub  Exploratory search “see what’s there” Low hemoglobin United Airlines Tampere weather Mars surface images Nikon CoolPix Car rental Finland Abortion morality

5 Search Engines as Info Gatekeepers Search engines are becoming the primary entry point for discovering web pages. Ranking of web pages influences which pages users will view. Exclusion of a site from search engines will cut off the site from its intended audience. The privacy policy of a search engine is important. Introna & Nissenbaum: Defining the Web: The Politics of Search Engines Hindman et al: Googlearchy: How a few Heavily-Linked Sites Dominate Politics on the Web

6 Search Engine Wars The battle for domination of the web search space! The competition is good news for users! Crucial: advertising is combined with search results! What if one of the search engines will manage to dominate the space?

7 Yahoo! Synonymous with the dot-com boom, probably the best known brand on the web. Started off as a web directory service in 1994, acquired Inktomi search engine technology in 2003. Has very strong advertising and e-commerce partners

8 Lycos! One of the pioneers of the field Introduced innovations that inspired the creation of Google

9 Google Verb “google” has become synonymous with searching for information on the web. Has raised the bar on search quality Has been the most popular search engine in the last few years. Had a very successful IPO in August 2004. Is innovative and dynamic. Has restored glamour in CS lost in dot-com-bust

10 Ask Jeeves Specializes in natural language question answering. Search driven by Teoma.Teoma Tries to differ…

11 bing ( was : Live Search ( was: MSN Search ) ) Successful third reincarnation of previous attempts Was Synonymous with PC software. Pyrrhic victory in the browser wars with Netscape. “Stop searching, start deciding”: Turned Google into a copycat!

12 Cuil The latest kid on the block Claims to have indexed 120B pages! So far, it does not rank!

13 How do you decide which is best? How do you measure similarity in ranking?

14 The most popular search keywords AltaVista (1998)AlltheWeb (2002)Excite (2001) sexfree appletsex pornodownloadpictures mp3softwarenew chatuknude

15 Newer features: suggest

16 Newer features: Trends

17 Web search Users Ill-defined queries Short length Imprecise terms Sub-optimal syntax (80% queries without operator) Low effort in defining queries Wide variance in Needs Expectations Knowledge Bandwidth Specific behavior 85% look over one result screen only mostly above the fold 78% of queries are not modified  1 query/session Follow links – “the scent of information”...

18 Architecture of a Search Engine A E C D B The Web Ad indexes Web spider Indexer Indexes Search User

19 Crawling the Web Mode of crawl: BFS Frequency of crawl: important robots.txt gives explicit directions on what not to crawl Parallel machines crawl all the time

20 Rate of web content change 720K pages from 270 popular sites sampled daily from Feb 17 – Jun 14, 1999 [Cho00] Mathematically, what does this seem to be? What does this suggest for crawling policy?

21 Diversity Languages/Encodings Hundreds of languages, W3C encodings: 55 (Jul01) [W3C01] Home pages (1997): English 82%, Next 15: 13% [Babe97] Google (mid 2001): English: 53%, JGCFSKRIP: 30% Document & query topic Popular Query Topics (from 1 million Google queries, Apr 2000) 1.8%Regional: Europe7.2%Business ………… 2.3%Business: Industries7.3%Recreation 3.2%Computers: Internet8%Adult 3.4%Computers: Software8.7%Society 4.4%Adult: Image Galleries10.3%Regional 5.3%Regional: North America13.8%Computers 6.1%Arts: Music14.6%Arts

