Search Quality Jan Pedersen 10 September 2007. 2 Outline  The Search Landscape  A Framework for Quality –RCFP  Search Engine Architecture  Detailed.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval in Practice
1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.
Search Engines: Technology, Society, and Business Prof. Marti Hearst Sept 24, 2007.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Sigir’99 Inside Internet Search Engines: Fundamentals Jan Pedersen and William Chang.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Search Engine Optimization HOW AND WHY Introduction to SEO SEO stands for “Search Engine Optimization” and often refers to the ability to easily locate.
Search Engine Optimization By Tom Fallenstein. Introduction Why you want high rankings Why you want high rankings Keywords Keywords Tools to help choose.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Search Engine Optimization: Understanding the Engines & Building Successful Sites Zohaib Ahmed Google Analytics Individual Qualified March 2012.
Web Search Jan Pedersen Chief Scientist, Search and Marketplace Yahoo! Inc.
Search Engine Marketing Shelly Brown Director of Web Services Southwest Baptist University.
Yahoo! Acquires Inktomi March 19 th, Yahoo!
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-1 How Search Engines Work Today we show how a search engine works  What happens when.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Question Answering over Implicitly Structured Web Content
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
Post-Ranking query suggestion by diversifying search Chao Wang.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!
Week 1 Introduction to Search Engine Optimization.
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Web Analytics Fundamentals Presented by Tejaswi, Chandrika, Sunil.
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
Information Retrieval in Practice
User Modeling for Personal Assistant
Using Search Tools on the Internet
Search Engine Architecture
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Search Market and Technologies
The Four Dimensions of Search Engine Quality
Jan Pedersen 10 September 2007
Presentation transcript:

Search Quality Jan Pedersen 10 September 2007

2 Outline  The Search Landscape  A Framework for Quality –RCFP  Search Engine Architecture  Detailed Issues

3 Search Landscape 2007  Key Drivers: Scale, Quality, Distribution Source: Search Engine Watch: US web search share, July 2006  Three major “Mainframes” –Google,Yahoo, and MSN  >800M searches daily –60% international –10 6 machines  $20B in Paid Search Revenues  Large indices –Billions of documents –Petabytes of data

4 World Internet Usage

5 Search Results page Text Input Related Results Search Assists Ranked Results Search Ads

6 Index Search Engine Architecture WWW Meta dataSnapshotIndex Crawling WebMap Indexing Query Serving Bandwidth Bottleneck CPU Bottleneck Stream Computation Load Replication

7 Query Serving Architecture  Rectangular Array –Each row is a replicate –Each column is an index segment  Results are merged across segments –Each node evaluates the query against its segment.  Latency is determined by the performance of a single node 3DNS F5 BigIP FE 1 QI 1 WI 1,1 WI 1,2 WI 1,3 WI 1,48 WI 2,1 WI 2,2 WI 2,3 WI 2,48 WI 4,1 WI 4,2 WI 4,3 WI 4,48 WI 3,1 WI 3,2 WI 3,3 WI 3,48 QI 2 QI 8 FE 2 FE 32 Sunnyvale L3 NYC L3 “travel” … … … … … …

8

9 What’s the Goal?  User Satisfaction –Understand user intent Problems: Ambiguity and Context –Generate relevant matches Problems: Scale and accuracy –Present useful information Problems: Ranking and Presentation

10 Evaluation  Graded Relevance score  Editorial Assessment  Session/Task fulfillment?  Behavioral measures?

11 Clickrate Relevance Metric Average highest rank clicked perceptibly increased with the release of a new rank function.

12 Quality Dimensions  Ranking –Ability to rank hits by relevance  Comprehensiveness –Index size and composition  Freshness –Recency of indexed data  Presentation –Titles and Abstracts

13 Comprehensiveness  Problem: –Make accessible all useful Web pages  Issues: –Web has an infinite number of pages –Finite resources available Bandwidth Disk capacity  Selection Problem –Which pages to visit Crawl Policy –Which pages to index Index Selection Policy

14 Moore’s Law and Index Size  ~150M in 1998  ~5B in 2005 –33x increase –Moore would predict 25x  What about 2010? –40B? Source: Search Engine Watch  1994 Yahoo (directory) and Lycos (index) go public  1995 Infoseek and Excite go public  1997 Alta Vista launches 100M index  1998 Inktomi and Google launched  1999 All The Web launched  2003 Yahoo purchases Inktomi and Overture  2004 Google goes public  2005 Msft launches MS Live

15 Freshness  Problem: –Ensure that what is indexed correctly reflects current state of the web  Impossible to achieve exactly –Revisit vs Discovery  Divide and Conquer –A few pages change continually –Most pages are relatively static

16 Changing documents in daily crawl for 32-day period

17 Freshness Source: Search Engine Showdown

18 Ranking  Problem: –Given a well-formed query, place the most relevant pages in the first few positions  Issues: –Scale: Many candidate matches Response in < 100 msecs –Evaluation: Editorial User Behavior

19 Ranking Framework  Regression problem –Estimate editorial relevance given ranking features  Query Dependent features –Term overlap between query and Meta-data Content  Query Independent Features –Quality (e.g. Page Rank) –Spamminess

20  Goal: Automatically construct a ranking function –Input: Large number training examples Features that predict relevance Relevance metrics –Output: Ranking function  Enables rapid experimental cycle –Scientific investigation of Modifications to existing features New feature Machine Learned Ranking

21 Ranking Features  A0 - A4anchor text score per term  W0 - W4term weights  L0 - L4first occurrence location (encodes hostname and title match)  SPspam index: logistic regression of 85 spam filter variables (against relevance scores)  F0 - F4term occurrence frequency within document  DCLNdocument length (tokens)  EREigenrank  HBExtra-host unique inlink count  ERHBER*HB  A0W0 etc.A0*W0  QASite factor – logistic regression of 5 site link and url count ratios  SPNProximity  FFfamily friendly rating  UDurl depth

22 Implements (Tree 0) A0w0 < 22.3 L0 < 18+1R= L1 < 18+1L1 < R= W0 < 856F0 < 2 + 1R = R= R= R= F2 < R= R= Y N

23 Presentation Spelling Correction Also Try Short cuts Titles and Abstracts

24 Eye Tracking Studies  Golden Triangle –Top left corner  Quick scan –For candidate  Longer scan –For relevance

25 Comparison to State-of-the-art

26 Conclusions  Search is a hard problem –Solutions are approximate –Measurement is difficult  Search quality can be decomposed in separate but related problems –Ranking –Comprehensiveness –Freshness –Presentation