Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

WEB MINING. Why IR ? Research & Fun
Chapter 5: Introduction to Information Retrieval
Natural Language Processing WEB SEARCH ENGINES August, 2002.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Information Retrieval in Practice
Search Engines and Information Retrieval
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Sigir’99 Inside Internet Search Engines: Fundamentals Jan Pedersen and William Chang.
Link Analysis, PageRank and Search Engines on the Web
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Search Engines and Information Retrieval Chapter 1.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
Using Hyperlink structure information for web search.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Data Structures & Algorithms and The Internet: A different way of thinking.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Ranking Link-based Ranking (2° generation) Reading 21.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
Sigir’99 Inside Internet Search Engines: Products William Chang and Jan Pedersen.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Information Retrieval in Practice
Search Engine Architecture
Information Retrieval
Search Engines and Link Analysis on the Web
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Information Retrieval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Information Retrieval and Web Design
Discussion Class 9 Google.
Presentation transcript:

Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang

Sigir’992 Basic Architectures: Search Web Log Index SE Spider Spam Freshness Quality results 20M queries/day Browser 800M pages? 24x7 SE

Sigir’993 Query Language Augmented Vector space Relevance scored results Tf, idf weighting Boolean constraints: +, - Phrases: “” Fields: e.g. title:

Sigir’994 Does Word Order Matter? Try “information retrieval” versus “retrieval information” Do you get the same results? The query parser Interprets query syntax: +,-, “” Rarely used General query from free text Critical for precision

Sigir’995

6 Precision Enhancement Phrase induction All terms, the closer the better Url and Title matching Site clustering Group urls from same site Quality-based reranking

Sigir’997 Link Analysis Authors vote via links Pages with higher inlink are higher quality Not all links are equal Links from higher quality sites are better Links in context are better Resistant to Spam Only cross-site links considered

Sigir’998 Page Rank (Page’98) Limiting distribution of a random walk Jump to a random page with Prob.  Follow a link with Prob. 1-  Probability of landing at a page D:  /T +  P(C)/L(C) Sum over pages leading to D L(C) = number of links on page D

Sigir’999 HITS (Kleinbery’98) Hubs: pages that point to many good pages Authorities: pages pointed to by many good pages Operates over a vincity graph pages relevant to a query Refined by the IBM Clever group further contextualization

Sigir’9910 Hyperlink Vector Voting (Li’97) Index documents by in-link anchor texts Follow links backward Can be both precision and recall enhancing The “evil empire” How to combine with standard ranking? Relative weight is a tuning issue

Sigir’9911 Evaluation No industry standard benchmark Evaluations are qualitative Excessive claims abound Press is not be discerning Shifting target Indices change daily Cross engine comparison elusive

Sigir’9912 Complexity Analysis Search is both CPU and I/O intensive I/O to access postings Random access CPU to compute scores Caching strategies are very effective Term cache has 40% hit rate Expensive queries are long and loaded with rare terms

Sigir’9913 Performance versus Size Index Size Time

Sigir’9914 Complexity Analysis CPU costs asymptotically constant Due to term truncation I/O cost can be kept to one I/O per term Again due to truncation Implies the bigger the better No advantage to distributed search

Sigir’9915 The Economics of Big Indices Very large indices require distributed search Easy scalability; maintenance Practical hardware limitations Implies Cost = Size * Throughput Since each half of a big index requires the same hardware to sustain the same throughput Worse: queries needing a big index are hard to monetize

Sigir’9916 How to Have your Cake... Layered Search Small, high quality engine for common queries Low cost per query; high revenue per query Large, low throughput engine for rare queries High cost per query, low revenue per query Average query costs can be kept low While still offering comprehensiveness

Sigir’9917

Sigir’9918 Novel Search Engines Ask Jeeves Question Answering Directory for the Hidden Web Direct Hit Direct popularity Click stream mining

Sigir’9919

Sigir’9920

Sigir’9921 Summary Search Engines are surprisingly effective Given short queries Precision enhancing techniques are critical Centralized search is maximally efficient but one can achieve a big index through layering