Web Search Jan Pedersen Chief Scientist, Search and Marketplace Yahoo! Inc.

Slides:



Advertisements
Similar presentations
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Advertisements

Search Engine Marketing Free Traffic for Your Web Site Paul Allen, CEO
Yahoo! Research Bradley Horowitz VP Product Strategy Yahoo, Inc. August 2006.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search Engines and Information Retrieval
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Sigir’99 Inside Internet Search Engines: Fundamentals Jan Pedersen and William Chang.
Information Retrieval in Practice
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace.
Search Quality Jan Pedersen 10 September Outline  The Search Landscape  A Framework for Quality –RCFP  Search Engine Architecture  Detailed.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
Introductions Search Engine Development COMP 475 Spring 2009 Dr. Frank McCown.
Lesson 12 — The Internet and Research
Search Engines and Information Retrieval Chapter 1.
Search Engine Marketing Shelly Brown Director of Web Services Southwest Baptist University.
Slide No. 1 Searching the Web H Search engines and directories H Locating these resources H Using these resources H Interpreting results H Locating specific.
Search Engine Optimization ext 304 media-connection.com The process affecting the visibility of a website across various search engines to.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-1 How Search Engines Work Today we show how a search engine works  What happens when.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Search Engine Interfaces search engine modus operandi.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
1 Search Engines Emphasis on Google.com. 2 Discovery  Discovery is done by browsing & searching data on the Web.  There are 2 main types of search facilities.
The Bits Bazaar Vast amounts of information scattered across the world. Access within reach of millions of people without editors. Search engines provide.
Search Engines AGCM 4143 Electronic Communications in Agriculture.
Search Engine Optimization 101 What is SEM? SEO? How can I use SEO on my blogs and/or my personal web space?
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Module 10 Administering and Configuring SharePoint Search.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
CS 440 Database Management Systems Web Data Management 1.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Search Engine Marketing Science Writers Conference 2009.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
Information Retrieval in Practice
Evaluation Anisio Lacerda.
Search Engine Architecture
Map Reduce.
Search Engine Architecture
The Four Dimensions of Search Engine Quality
WIRED Week 2 Syllabus Update Readings Overview.
What is a Search Engine EIT, Author Gay Robertson, 2017.
Maximizing Exposure for Your Non-Profit
Introduction to Information Retrieval
Web Searching Everything, now..
Jan Pedersen 10 September 2007
Presentation transcript:

Web Search Jan Pedersen Chief Scientist, Search and Marketplace Yahoo! Inc.

Agenda A Short History Internet Search Fundamentals –Web Pages –Indexing Ranking and Evaluation Third Generation Technologies

A Short History

Precursors Information Retrieval (IR) Systems –online catalogs, and News Limited scale, homogeneous text –recall focus –empirical Driven by results on evaluation collections –free text queries shown to win over Boolean Specialized Internet access –Gopher, Wais, Archie FTP archives and special databases Never achieved critical mass

First Generation Systems 1993: Mosaic opens the WWW –1993 Architext/Excite (Stanford/Kleiner Perkins) –1994 Webcrawler (full text Indexing) –1994 Yahoo! (human edited Directory) –1994 Lycos (400K indexed pages) –1994 Infoseek (subscription service) Power systems –1994 AltaVista (Dec Labs, advanced query syntax, large index) –1996 Inktomi (massively distributed solution)

Second Generation Systems Relevance matters –1998 Direct Hit (clickthrough based re-ranking) –1998 Google (link authority based re-ranking) Size matters –1999 FAST/AllTheWeb (scalable architecture) The user matters –1996 Ask Jeeves (question answering) Money matters –1997 Goto/Overture (pay-for-performance search)

Third Generation Systems Market consolidation –2002 Yahoo! Purchases Inktomi –2003 Overture purchases AV and FAST/AllTheWeb –2003 MSN announces intention to build a Search Engine –2004 Google IPO Search matures –$2B market projected to grow to $6B by 2005 –required capital investment limits new players Gigablast? –traffic focused in a few sites Yahoo!, MSN, Google, AOL –consumer use driven by Brand marketing

Web Search Fundamentals

WWW Size How pages are in the WWW? –Lawrence and Giles, 1999: 800M pages with most pages not indexed –Dynamically generated pages imply effective size is infinite How many sites are registered? –Churn due to SPAM

Typical Crawl/Build Architecture Grab URL DB Seed List Discovery Internet Pagefiles Filtered Pagefiles Index Pagefiles Anchor Text DB Connectivity DB Duplicates DB Alias DB Index Build Crawl

Relative Size From SearchEngineShowdown Google claims 3B Fast claims 2.5B AV claims 1B

Freshness From Search Engine Showdown Note hybrid indices; subindices with differing update rates

Query Serving Architecture Index divided into segments each served by a node Each row of nodes replicated for query load Query integrator distributes query and merges results Front end creates a HTML page with the query results Load Balancer FE 1 QI 1 Node 1,1 Node 1,2 Node 1,3 Node 1,N Node 2,1 Node 2,2 Node 2,3 Node 2,N Node 4,1 Node 4,2 Node 4,3 Node 4,N Node 3,1 Node 3,2 Node 3,3 Node 3,N QI 2 QI 8 FE 2 FE 8 “travel” … … … … … … … …

Query Evaluation Index has two tables: –term to posting –document ID to document data Postings record term occurrences –may include positions Ranking employs posting –to score documents Display employs document info –fetched for top scoring documents Terms  Posting Doc ID  Doc Data Query Evaluator “travel” rankingdisplay

Scale Indices typically cover billions of pages –terrabytes of data Tens of millions of queries served every day –translates to hundreds of queries per second User require rapid response –query must be evaluated in under 300 msecs Data Centers typically employ thousands of machines –Individual component failures are common

Search Results Page Blended results –multiple sources Relevance ranked Assisted search –Spell correction Specialized indices –via Tabs Sponsored listing –monetization Localization –Country language experience

Relevance Evaluation

Relevance is Everything The Search Paradigm: 2.4 words, a few clicks, and you’re done –only possible if results are very relevant Relevance is ‘speed’ –time from task initiation to resolution –important factors: Location of useful result UI Clutter latency Relevance is relative –context dependent e.g. ‘football’ in the UK vs the US –task dependent e.g. ‘mafia’ when shopping vs researching

Relevance is Hard to Measure Poorly defined, subjective notion –depends on task, user context, etc. Analysts have Focused on Easier-to-Measure Surrogates –index size, traffic, speed –anecdotal relevance tests e.g. Vanity queries Requires Survey Methodology –averaged over queries –averaged over users

Survey Methodologies Internal expert assessments –assessments typically not replicated –models absolute notion of relevance External consumer assessments –assessments heavily replicated –models statistical notion of relevance A/B surveys –compare whole result sets –visual relevance plays a large role Url surveys –judge relevance of particular url for query

Ranking Given 2.4 query terms, search 2B documents and return 10 highly relevant in 300 msecs –Problem queries: Travel (matches 32M documents) John Ellis (which one) Cobra (medical or animal) Query types –Navigational (known item retrieval) –Informational Ingredients –Keyword match (title, abstract, body) –Anchor Text (referring text) –Quality (link connectivity) –User Feedback (clickrate analysis)

The Components of Relevance First Generation: –Keyword matching Title and abstract worth more Second Generation: –Computed document authority Based on link analysis –Anchor text matching Webmaster voting Development Cycle: Tune Ranking Evaluate Metrics

SPAM Manipulation of content purely to influence ranking –Dictionary SPAM –Link sharing –Domain hi-jacking –Link farms Robotic use of search results –Meta-search engines –Search Engine optimizers –Fraud

Third Generation Technologies

Handling Ambiguity Results for query: Cobra

Impression Tracking Incoherent urls are those that receive high rank for a large diversity of queries. Many incoherent urls indicate SPAM or a bug (as in this case).

Clickrate Relevance Metric Average highest rank clicked perceptibly increased with the release of a new rank function.

User Interface Ranked result lists –Document summaries are critical Hit highlighting Dynamic abstracts url –No recent innovation Graphical presentations not well fit to the task Blending –Predefined segmentation e.g. Paid listing –Intermixed with results from other sources e.g. News

Future Trends Question Answering –WWW as language model Enables simple methods e.g. Dumais et al. (SIGIR 2002) New contexts –Ubiquitous Searching Toolbars, desktop, phone –Implicit Searching Computed links New Tasks –E.g. Local/ Country Search

Bibliography Modeling the Internet and the Web: Probabilistic Methods and Algorithms by Pierre Baldi, Paolo Frasconi, and Padhraic Smyth John Wiley & Sons; May 28, 2003 Mining the Web: Analysis of Hypertext and Semi Structured Data by Soumen Chakrabarti Morgan Kaufmann; August 15, 2002 The Anatomy of a Large-scale Hypertextual Web Search Engine by S. Brin and L. Page. 7th International WWW Conference, Brisbane, Australia; April Websites: – – Presentations –