Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace.

Slides:



Advertisements
Similar presentations
Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search Engines and Information Retrieval
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Sigir’99 Inside Internet Search Engines: Fundamentals Jan Pedersen and William Chang.
INFO 624 Week 3 Retrieval System Evaluation
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Link Analysis, PageRank and Search Engines on the Web
Search Quality Jan Pedersen 10 September Outline  The Search Landscape  A Framework for Quality –RCFP  Search Engine Architecture  Detailed.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Overview of Search Engines
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Introductions Search Engine Development COMP 475 Spring 2009 Dr. Frank McCown.
Lesson 12 — The Internet and Research
Web Search Jan Pedersen Chief Scientist, Search and Marketplace Yahoo! Inc.
Search Engines and Information Retrieval Chapter 1.
Search Engine Marketing Shelly Brown Director of Web Services Southwest Baptist University.
Slide No. 1 Searching the Web H Search engines and directories H Locating these resources H Using these resources H Interpreting results H Locating specific.
Search Engines. Internet protocol (IP) Two major functions: Addresses that identify hosts, locations and identify destination Connectionless protocol.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-1 How Search Engines Work Today we show how a search engine works  What happens when.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Module 10 Administering and Configuring SharePoint Search.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
CS 440 Database Management Systems Web Data Management 1.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Search Engine Marketing Science Writers Conference 2009.
Data mining in web applications
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Fred Dirkse CEO, OIC Group, Inc.
The Four Dimensions of Search Engine Quality
CS 440 Database Management Systems
Maximizing Exposure for Your Non-Profit
Data Mining Chapter 6 Search Engines
Web Search Engines.
Jan Pedersen 10 September 2007
Presentation transcript:

Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Agenda A Short History Internet Search Fundamentals –Web Pages –Indexing Ranking and Evaluation Third Generation Technologies

A Short History

Precursors Information Retrieval (IR) Systems –online catalogs, and News Limited scale, homogeneous text –recall focus –empirical Driven by results on evaluation collections –free text queries shown to win over Boolean Specialized Internet access –Gopher, Wais, Archie FTP archives and special databases Never achieved critical mass

First Generation Systems 1993: Mosaic opens the WWW –1993 Architext/Excite (Stanford/Kleiner Perkins) –1994 Webcrawler (full text Indexing) –1994 Yahoo! (human edited Directory) –1994 Lycos (400K indexed pages) –1994 Infoseek (subscription service) Power systems –1994 AltaVista (Dec Labs, advanced query syntax, large index) –1996 Inktomi (massively distributed solution)

Second Generation Systems Relevance matters –1998 Direct Hit (clickthrough based re-ranking) –1998 Google (link authority based re-ranking) Size matters –1999 FAST/AllTheWeb (scalable architecture) The user matters –1996 Ask Jeeves (question answering) Money matters –1997 Goto/Overture (pay-for-performance search)

Third Generation Systems Market consolidation –2002 Yahoo! Purchases Inktomi –2003 Overture purchases AV and FAST/AllTheWeb –2003 MSN announces intention to build a Search Engine Search matures –$2B market projected to grow to $6B by 2005 –required capital investment limits new players Gigablast? –traffic focused in a few sites Yahoo!, MSN, Google, AOL –consumer use driven by Brand marketing

Web Search Fundamentals

Web Fundamentals URL User BrowserWeb Server HTML Page Page RenderingPage Serving Hyper Links HTTP Request

Definitions URL’s refer to WWW content –referential integrity is not guaranteed –roughly 10% of Url’s go 404 every month HTTP requests fetch content from a server –stateless protocol –cookies provide partial state Web servers generate HTML pages –can be static or dynamic (output of a program) –markup tags determine page rendering HTML pages contain hyperlinks –link consists of a url and anchor text

Url’s URL Definition – fragment is not considered part of the URL params are considered part of the path –params are not frequently used Examples – – http://ad.doubleclick.net/jump;sz=120x60;ptile=6;ord= – – ,00.htmlhttp:// ,00.html

Dynamic Url’s Urls with Dynamic Components –Path (including params) and host are not dynamic If you change the PATH and/or host you will get a 404 or similar error –Query is dynamic If you change the query part, you will get a valid page back source of potentially infinite number of pages Examples – Returns a valid 200 page, even if test is not a valid query term – Returns a 404 error page Not all Url’s Follow this Convention: –

Dynamic Content Content Depends on External (to URL) Factors –Cookies –IP –Referrer –User-Agent Examples – – b314b75262c1031d8af&forumid=65http://forum.doom9.org/forumdisplay.php?s=af9ddb31710c7 b314b75262c1031d8af&forumid=65 Dynamic Url’s and Dynamic Content are Orthogonal –static url’s can refer to dynamic content –dynamic url’s can refer to static content

HMTL Sample Andreas S. WEIGEND, PhD Andreas S. WEIGEND, Ph.D. Chief Scientist, Amazon.com "Sophisticated algorithms have always been a big part of creating the Amazon.com customer experience." (Jeff Bezos, Founder and CEO of Amazon.com) Amazon.com might be the world's largest laboratory to study human behavior and decision making. It for sure is a place with very smart people, with a healthy attitude towards data, measurement, and modeling. I am responsible for research in machine learning and computational marketing. Applications range from real-time predictions of customer intent and satisfaction, to personalization and long-term optimization of pricing and promotions. [<a href=" onclick="window.open(this.href);return false;">Job openings. ] I'm also the point person for academic relations. Schedule Summer 2003

Rendered Page

WWW Size How pages are in the WWW? –Lawrence and Giles, 1999: 800M pages with most pages not indexed –Dynamically generated pages imply effective size is infinite How many sites are registered? –Churn due to SPAM

Crawling Search Engine robot –visits every page that will be indexed –traversal behavior depends on crawl policy Index parameterized by size and freshness –freshness is time since last revisit if page has changed Batch vs Incremental –Batch crawl has several, distinct, batch processing stages discover, grab, index AV discovery phase takes 10 days, grab another 10, etc. sharp freshness curve –Incremental crawl crawler constantly operates, intermixing discovery with grab mild drop-off in freshness

Typical Crawl/Build Architecture Grab URL DB Seed List Discovery Internet Pagefiles Filtered Pagefiles Index Pagefiles Anchor Text DB Connectivity DB Duplicates DB Alias DB Index Build Crawl

Relative Size From SearchEngineShowdown Google claims 3B Fast claims 2.5B AV claims 1B

Freshness From Search Engine Showdown Note hybrid indices; subindices with differing update rates

Query Language Free text with implicit AND and implicit proximity –Syntax-free input Explicit Boolean –AND (+) –OR (|) –AND NOT (-) Explicit Phrasing (“”) Filters –domain:filetype: –host:title: –link:image: –url:anchor:

Query Serving Architecture Index divided into segments each served by a node Each row of nodes replicated for query load Query integrator distributes query and merges results Front end creates a HTML page with the query results Load Balancer FE 1 QI 1 Node 1,1 Node 1,2 Node 1,3 Node 1,N Node 2,1 Node 2,2 Node 2,3 Node 2,N Node 4,1 Node 4,2 Node 4,3 Node 4,N Node 3,1 Node 3,2 Node 3,3 Node 3,N QI 2 QI 8 FE 2 FE 8 “travel” … … … … … … … …

Query Evaluation Index has two tables: –term to posting –document ID to document data Postings record term occurrences –may include positions Ranking employs posting –to score documents Display employs document info –fetched for top scoring documents Terms  Posting Doc ID  Doc Data Query Evaluator “travel” rankingdisplay

Scale Indices typically cover billions of pages –terrabytes of data Tens of millions of queries served every day –translates to hundreds of queries per second User require rapid response –query must be evaluated in under 300 msecs Data Centers typically employ thousands of machines –Individual component failures are common

Search Results Page Blended results –multiple sources Relevance ranked Assisted search –Spell correction Specialized indices –via Tabs Sponsored listing –monetization Localization –Country language experience

Relevance Evaluation

Relevance is Everything The Search Paradigm: 2.4 words, a few clicks, and you’re done –only possible if results are very relevant Relevance is ‘speed’ –time from task initiation to resolution –important factors: Location of useful result UI Clutter latency Relevance is relative –context dependent e.g. ‘football’ in the UK vs the US –task dependent e.g. ‘mafia’ when shopping vs researching

Relevance is Hard to Measure Poorly defined, subjective notion –depends on task, user context, etc. Analysts have Focused on Easier-to-Measure Surrogates –index size, traffic, speed –anecdotal relevance tests e.g. Vanity queries Requires Survey Methodology –averaged over queries –averaged over users

Survey Methodologies Internal expert assessments –assessments typically not replicated –models absolute notion of relevance External consumer assessments –assessments heavily replicated –models statistical notion of relevance A/B surveys –compare whole result sets –visual relevance plays a large role Url surveys –judge relevance of particular url for query

A/B Test Design Strategy: –Compare two ranking algorithms by asking panelists to compare pairs of search results Queries: –1000 semi-random queries, filtered for family-friendly, understandability Users can select from a list of 20 queries URLS –Top 10 search results from 2 algorithms Voting: –5 point scale, 7 replications –Each user rates 6 queries, one of which is a control query Control query has AV results on one side, random URLs on the other Reject voters who take less than 10 seconds to vote

Query selection screen

Rating screen

A/B Test Scoring Test ran until we had 400 decisive votes –Margin of error = 5% Compute: –Majority Vote: count of queries where more than half of the users said one engine was “somewhat better” or “much better” –Total Vote: count of users that rated a result set “somewhat better” of “better” for each engine Compare percentages –test if one system ‘out votes’ the other –determine if the difference is statistically significant

Results Queries with winnerAll accepted votes MajorityUnanimous“a little better”“much better” AltaVista37.6%6.1%24.4%11.0% SE137.3%8.0%22.2%10.7% Same25.1%2.6%31.7% Queries with winnerAll accepted votes MajorityUnanimous“a little better”“much better” Good98.1%51.5%24.4%59.1% Bad 0% 4.7%1.7% Same1.9%0.6%10.1% Control Votes (error bar = 1/sqrt(160) = 7.9%) Test One: AV vs SE1 (error bar = 1/sqrt(400) = 5%)

Results Queries with winnerAll accepted votes MajorityUnanimous“a little better”“much better” AltaVista58.5%13.4%26.5%16.3% SE228.1%4.6%21.8%8.9% Same13.4%0.9%26.4% Queries with winnerAll accepted votes MajorityUnanimous“a little better”“much better” SE135.4%4.7%28.2%13.2% SE240.6%4.1%29.0%15.6% Same24.0%1.9%13.8% Test Three: SE1 Vs SE2 Test Two: AV Vs SE2 (with UI issue)

Ranking Given 2.4 query terms, search 2B documents and return 10 highly relevant in 300 msecs –Problem queries: Travel (matches 32M documents) John Ellis (which one) Cobra (medical or animal) Query types –Navigational (known item retrieval) –Informational Ingredients –Keyword match (title, abstract, body) –Anchor Text (referring text) –Quality (link connectivity) –User Feedback (clickrate analysis)

The Components of Relevance First Generation: –Keyword matching Title and abstract worth more Second Generation: –Computed document authority Based on link analysis –Anchor text matching Webmaster voting Development Cycle: Tune Ranking Evaluate Metrics

Connectivity

Connectivity Goals An indicator of authority –As measured by static links –Each link is a ‘vote’ in favor of a site –Webmasters are the voters Not all links are equal –Links from authoritative sites are worth more Introduces an interesting circularity –Votes from sites with many links are discounted Use your vote wisely –Discount navigational links Not all links are editorial –Account for link SPAM

Connectivity Network A B What is authority score for nodes A and B? Inlink computes: –A = 3 –B = 2 Page Rank Computes –A =.225 –B =.295

Definitions Connectivity Graph –Nodes are pages (or hosts) –Directed edges are links –Graph edges can be represented as a transition matrix, A The ith row of A represents the links out from node i Authority score –Score associated with each node –Some function of inlinks to node and outlinks from node Simplest authority score is inlink count

Contribution averaged over all outlinks Node score is the sum of contributions Fixed point equation –If A is normalized Each row sums to 1.0 Page Rank (Without Random Jump).1 A (.25) B (.3) 1/2.1

A is a stochastic matrix –r(i) can be interpreted as a probability Suppose a surfer takes a outlink at random r(i) is the long run probability of landing at a particular node –Solution to fixed point equation is the principal Eigen vector principal Eigen value is 1.0 Solution can be found by iteration –If then –Start with random initial value for r –Iterate multiplication by A Contribution of smaller eigen values will drop out –Final value is a good estimate of the fixed point solution Page Rank Implications

What’s the score for a node with no in-links? Revised equation Fixed point equation Probability interpretation –As before with  chance of jumping randomly Page Rank (with random jump).1 A (.225) B (.293) 1/2.1  = 0.1

Eigenrank Separates internal from external links –Internal transition matrix I –External transition matrix E Introduces a new parameter –  is the random jump probability –  is the probability of taking an internal link –(1 -  -  ) is the probability of taking an external link

Revised equation Fixed point equation Probability interpretation –  chance of random jump –  chance of internal link –(1-  -  ) chance of external link Eigenrank.1 A (.2) B (.202) 1/2.1  = 0.1  = 0.1

Computational Issues Nodes with no outlinks –Transition matrix with zero row Internal or external –Leave out of computation(?) –Redistribute mass to random jump(?) Currently mass is redistributed –Complex formula that prefers external links

Two scores –Authority score, a –Hub score, h Fixed Point equations –Authority –Hub –Principal Eigen vectors are solutions Kleinberg

SPAM Manipulation of content purely to influence ranking –Dictionary SPAM –Link sharing –Domain hi-jacking –Link farms Robotic use of search results –Meta-search engines –Search Engine optimizers –Fraud

Third Generation Technologies

Handling Ambiguity Results for query: Cobra

Impression Tracking Incoherent urls are those that receive high rank for a large diversity of queries. Many incoherent urls indicate SPAM or a bug (as in this case).

Clickrate Relevance Metric Average highest rank clicked perceptibly increased with the release of a new rank function.

User Interface Ranked result lists –Document summaries are critical Hit highlighting Dynamic abstracts url –No recent innovation Graphical presentations not well fit to the task Blending –Predefined segmentation e.g. Paid listing –Intermixed with results from other sources e.g. News

Future Trends Question Answering –WWW as language model Enables simple methods e.g. Dumais et al. (SIGIR 2002) New contexts –Ubiquitous Searching Toolbars, desktop, phone –Implicit Searching Computed links New Tasks –E.g. Local/ Country Search

Bibliography Modeling the Internet and the Web: Probabilistic Methods and Algorithms by Pierre Baldi, Paolo Frasconi, and Padhraic Smyth John Wiley & Sons; May 28, 2003 Mining the Web: Analysis of Hypertext and Semi Structured Data by Soumen Chakrabarti Morgan Kaufmann; August 15, 2002 The Anatomy of a Large-scale Hypertextual Web Search Engine by S. Brin and L. Page. 7th International WWW Conference, Brisbane, Australia; April Websites: – – Presentations –