Jan Pedersen 10 September 2007

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Natural Language Processing WEB SEARCH ENGINES August, 2002.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Information Retrieval in Practice

1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Sigir’99 Inside Internet Search Engines: Fundamentals Jan Pedersen and William Chang.

1 CS 430: Information Discovery Lecture 21 Web Search 3.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

Search Quality Jan Pedersen 10 September Outline  The Search Landscape  A Framework for Quality –RCFP  Search Engine Architecture  Detailed.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

Overview of Search Engines

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Search Engine Optimization By Tom Fallenstein. Introduction Why you want high rankings Why you want high rankings Keywords Keywords Tools to help choose.

Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.

Search Engine Optimization: Understanding the Engines & Building Successful Sites Zohaib Ahmed Google Analytics Individual Qualified March 2012.

Web Search Jan Pedersen Chief Scientist, Search and Marketplace Yahoo! Inc.

Search Engine Marketing Shelly Brown Director of Web Services Southwest Baptist University.

Yahoo! Acquires Inktomi March 19 th, Yahoo!

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-1 How Search Engines Work Today we show how a search engine works  What happens when.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Web Search Algorithms By Matt Richard and Kyle Krueger.

Question Answering over Implicitly Structured Web Content

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.

Post-Ranking query suggestion by diversifying search Chao Wang.

Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.

Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.

1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.

CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

Web Analytics Fundamentals Presented by Tejaswi, Chandrika, Sunil.

Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.

Search Engine Marketing Science Writers Conference 2009.

SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.

Information Retrieval in Practice

User Modeling for Personal Assistant

Education 499-R01 Search Basics.

Using Search Tools on the Internet

Search Engines and Search techniques

Search Engine Architecture

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Search Market and Technologies

Search Engine Architecture

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

The Four Dimensions of Search Engine Quality

Information Retrieval

Web Search Engines.

Presentation transcript:

Jan Pedersen 10 September 2007 Search Quality Jan Pedersen 10 September 2007

A Framework for Quality Search Engine Architecture Detailed Issues Outline The Search Landscape A Framework for Quality RCFP Search Engine Architecture Detailed Issues

Search Landscape 2007 Three major “Mainframes” >800M searches daily Google,Yahoo, and MSN >800M searches daily 60% international 106 machines $20B in Paid Search Revenues Large indices Billions of documents Petabytes of data Source: Search Engine Watch: US web search share, July 2006 Key Drivers: Scale, Quality, Distribution

World Internet Usage

Search Results page Text Input Search Assists Related Results Search Ads Ranked Results

Search Engine Architecture WebMap Query Serving Stream Computation Meta data Index Crawling WWW Load Replication Snapshot Index Indexing Index Bandwidth Bottleneck Index CPU Bottleneck

Query Serving Architecture 3DNS F5 BigIP FE1 QI1 WI1,1 WI1,2 WI1,3 WI1,48 WI2,1 WI2,2 WI2,3 WI2,48 WI4,1 WI4,2 WI4,3 WI4,48 WI3,1 WI3,2 WI3,3 WI3,48 QI2 QI8 FE2 FE32 Sunnyvale L3 NYC L3 “travel” … Rectangular Array Each row is a replicate Each column is an index segment Results are merged across segments Each node evaluates the query against its segment. Latency is determined by the performance of a single node

The Factory Floor

User Satisfaction What’s the Goal? Understand user intent Problems: Ambiguity and Context Generate relevant matches Problems: Scale and accuracy Present useful information Problems: Ranking and Presentation

Evaluation Graded Relevance score Editorial Assessment Session/Task fulfillment? Behavioral measures?

Clickrate Relevance Metric Average highest rank clicked perceptibly increased with the release of a new rank function.

Quality Dimensions Ranking Comprehensiveness Freshness Presentation Ability to rank hits by relevance Comprehensiveness Index size and composition Freshness Recency of indexed data Presentation Titles and Abstracts

Comprehensiveness Problem: Issues: Selection Problem Make accessible all useful Web pages Issues: Web has an infinite number of pages Finite resources available Bandwidth Disk capacity Selection Problem Which pages to visit Crawl Policy Which pages to index Index Selection Policy

Moore’s Law and Index Size ~150M in 1998 ~5B in 2005 33x increase Moore would predict 25x What about 2010? 40B? Source: Search Engine Watch 1994 Yahoo (directory) and Lycos (index) go public 1995 Infoseek and Excite go public 1997 Alta Vista launches 100M index 1998 Inktomi and Google launched 1999 All The Web launched 2003 Yahoo purchases Inktomi and Overture 2004 Google goes public 2005 Msft launches MS Live

Impossible to achieve exactly Divide and Conquer Freshness Problem: Ensure that what is indexed correctly reflects current state of the web Impossible to achieve exactly Revisit vs Discovery Divide and Conquer A few pages change continually Most pages are relatively static

Changing documents in daily crawl for 32-day period

Freshness Source: Search Engine Showdown

Ranking Problem: Issues: Given a well-formed query, place the most relevant pages in the first few positions Issues: Scale: Many candidate matches Response in < 100 msecs Evaluation: Editorial User Behavior

Query Dependent features Ranking Framework Regression problem Estimate editorial relevance given ranking features Query Dependent features Term overlap between query and Meta-data Content Query Independent Features Quality (e.g. Page Rank) Spamminess

Machine Learned Ranking Goal: Automatically construct a ranking function Input: Large number training examples Features that predict relevance Relevance metrics Output: Ranking function Enables rapid experimental cycle Scientific investigation of Modifications to existing features New feature

Ranking Features A0 - A4 anchor text score per term W0 - W4 term weights L0 - L4 first occurrence location (encodes hostname and title match) SP spam index: logistic regression of 85 spam filter variables (against relevance scores) F0 - F4 term occurrence frequency within document DCLN document length (tokens) ER Eigenrank HB Extra-host unique inlink count ERHB ER*HB A0W0 etc. A0*W0 QA Site factor – logistic regression of 5 site link and url count ratios SPN Proximity FF family friendly rating UD url depth

Implements (Tree 0) A0w0 < 22.3 Y N L0 < 18+1 R=0.0015 F0 < 2 + 1 R = -0.2368 F2 < 1 + 1 R=-0.0039 R=-0.0199 R=-0.1185 R=-0.1604 R=-0.0790

Presentation Spelling Correction Also Try Short cuts Titles and Abstracts

Eye Tracking Studies Golden Triangle Quick scan Longer scan Top left corner Quick scan For candidate Longer scan For relevance

Comparison to State-of-the-art

Search is a hard problem Conclusions Search is a hard problem Solutions are approximate Measurement is difficult Search quality can be decomposed in separate but related problems Ranking Comprehensiveness Freshness Presentation