The Four Dimensions of Search Engine Quality

Slides:



Advertisements
Similar presentations
Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Sigir’99 Inside Internet Search Engines: Fundamentals Jan Pedersen and William Chang.
Search Quality Jan Pedersen 10 September Outline  The Search Landscape  A Framework for Quality –RCFP  Search Engine Architecture  Detailed.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Search Engine Optimization: Understanding the Engines & Building Successful Sites Zohaib Ahmed Google Analytics Individual Qualified March 2012.
Web Search Jan Pedersen Chief Scientist, Search and Marketplace Yahoo! Inc.
Yahoo! Acquires Inktomi March 19 th, Yahoo!
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Master Thesis Defense Jan Fiedler 04/17/98
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Personalized Search Cheng Cheng (cc2999) Department of Computer Science Columbia University A Large Scale Evaluation and Analysis of Personalized Search.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Question Answering over Implicitly Structured Web Content
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
COMAD 2008Chakrabarti Bridging the Structured-Unstructured Gap Born in New York in 1934, Sagan was a noted astronomer whose lifelong passion was searching.
Post-Ranking query suggestion by diversifying search Chao Wang.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
11 Why tune relevance Because we want to find the one single best item, among a large group of possible candidates….
CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!
Yahoo! BOSS Open up Yahoo!’s Search data via web services Developer & Custom Tracks Big Goal – If you’re in a vertical and you perform a search, you should.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
SEARCH ENGINES The World Wide Web contains a wealth of information, so much so that without search facilities it could be impossible to find what you were.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Information Retrieval in Practice
User Modeling for Personal Assistant
Evaluation Anisio Lacerda.
Improving Search Relevance for Short Queries in Community Question Answering Date: 2014/09/25 Author : Haocheng Wu, Wei Wu, Ming Zhou, Enhong Chen, Lei.
Search Engine Architecture
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Search Market and Technologies
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
Improvements to Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Anatomy of a search engine
IST 497E Information Retrieval and Organization
Characterization of Search Engine Caches
ISWC 2013 Entity Recommendations in Web Search
Jan Pedersen 10 September 2007
Presentation transcript:

The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005

Outline The Search Landscape A Framework for Quality RCFP Search Engine Architecture Detailed Issues

Search Landscape 2005 Four major “Mainframes” >450M searches daily Google,Yahoo, MSN, and ASK >450M searches daily 60% international Thousands of machines $8+B in Paid Search Revenues Large indices Billions of documents Terrabytes of data Excellent relevance For some tasks Source: Search Engine Watch

What’s the Goal? User Satisfaction Understand user intent Problems: Ambiguity and Context Generate Relevant matches Problems: Scale and accuracy Present Useful information Problems: Ranking and Presentation

Quality Dimensions Ranking Comprehensiveness Freshness Presentation Ability to rank hits by relevance Comprehensiveness Index size and composition Freshness Recency of indexed data Presentation Titles and Abstracts

Search Engine Architecture Web Map Meta data WWW Crawl Query Serving Web Index Snapshot Indexer Comprehensiveness and Freshness Ranking and Presentation

Comprehensiveness Problem: Issues: Selection Problem Make accessible all useful Web pages Issues: Web has an infinite number of pages Finite resources available Bandwidth Disk capacity Selection Problem Which pages to visit Crawl Policy Which pages to index Index Selection Policy

Crawl Policy Pages found by following links Basic iteration: Framework From an initial root set Basic iteration: Visit pages and extract links Prioritize next pages to visit (or revisit) Framework Visit pages most likely to be viewed most likely to contain links to pages that will be viewed Prioritization by Query-independent Quality

Freshness Problem: Impossible to achieve exactly Divide and Conquer Ensure that what is indexed correctly reflects current state of the web Impossible to achieve exactly Revisit vs Discovery Divide and Conquer A few pages change continually Most pages are relatively static

Changing documents in daily crawl for 32-day period

Freshness Source: Search Engine Showdown

Ranking Problem: Issues: Given a well-formed query, place the most relevant pages in the first few positions Issues: Scale: Many candidate matches Response in < 100 msecs Evaluation: Editorial User Behavior

Query Serving Architecture 3DNS F5 BigIP FE1 QI1 WI1,1 WI1,2 WI1,3 WI1,48 WI2,1 WI2,2 WI2,3 WI2,48 WI4,1 WI4,2 WI4,3 WI4,48 WI3,1 WI3,2 WI3,3 WI3,48 QI2 QI8 FE2 FE32 Sunnyvale L3 NYC L3 “travel” … Rectangular Array Each row is a replicate Each column is an index segment Results are merged across segments Each node evaluates the query against its segment. Latency is determined by the performance of a single node

Editorial Relevance Users grade relevance Search Engines are scored in aggregate over a query sample

Clickrate Relevance Metric Average highest rank clicked perceptibly increased with the release of a new rank function.

Ranking Framework Categorization problem Query Dependent features Estimate the probability of relevance given ranking features Query Dependent features Term overlap between query and Meta-data Content Query Independent Features Quality (e.g. Page Rank) Spamminess

Handling Ambiguity Results for query: Cobra

Presentation Spelling Correction Also Try Short cuts Titles and Abstracts

Conclusions Search is a hard problem Solutions are approximate Measurement is difficult Search quality can be decomposed in separate but related problems Ranking Comprehensiveness Freshness Presentation