@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Advertisements

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

Evaluating Search Engine

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.

Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.

1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

Information Retrieval

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

Overview of Search Engines

Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia.

Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Adversarial Information Retrieval The Manipulation of Web Content.

1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.

Search Engines and Information Retrieval Chapter 1.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Chapter 6: Information Retrieval and Web Search

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.

Evolution of Web from a Search Engine Perspective Saket Singam

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.

Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.

Panagiotis G. Ipeirotis Luis Gravano

Information Retrieval and Web Design

Presentation transcript:

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

@ Carnegie Mellon Databases 2 Web Crawling One important application (our focus): search Topic-specific search engines + General-purpose ones repository index search queries usercrawler WWW

@ Carnegie Mellon Databases 3 Out-of-date Repository Web is always changing [Arasu et.al., TOIT’01] – 23% of Web pages change daily – 40% commercial Web pages change daily Many problems may arise due to an out-of- date repository – Hurt both precision and recall

@ Carnegie Mellon Databases 4 Web Crawling Optimization Problem Not enough resources to (re)download every web document every day/hour – Must pick and choose  optimization problem Others: objective function = avg. freshness, age Our goal: focus directly on impact on users repository index search queries usercrawler WWW

@ Carnegie Mellon Databases 5 Web Search User Interface 1.User enters keywords 2.Search engine returns ranked list of results 3.User visits subset of results … documents

@ Carnegie Mellon Databases 6 Objective: Maximize Repository Quality (as perceived by users) Suppose a user issues search query q: Quality q = Σ documents D (likelihood of viewing D) x (relevance of D to q) Given a workload W of user queries: Average quality = 1/K x Σ queries q  W (freq q x Quality q )

@ Carnegie Mellon Databases 7 Viewing Likelihood Rank Probability of Viewing view probability rank Depends primarily on rank in list [Joachims KDD’02] From AltaVista data [Lempel et al. WWW’03]: ViewProbability(r)  r –1.5

@ Carnegie Mellon Databases 8 Search engines’ internal notion of how well a document matches a query Each D/Q pair  numerical score  [0,1] Combination of many factors, including: – Vector-space similarity (e.g., TF.IDF cosine metric) – Link-based factors (e.g., PageRank) – Anchortext of referring pages Relevance Scoring Function

@ Carnegie Mellon Databases 9 (Caveat) Using scoring function for absolute relevance – Normally only used for relative ranking – Need to craft scoring function carefully

@ Carnegie Mellon Databases 10 Measuring Quality Avg. Quality = Σ q ( freq q x Σ D (likelihood of viewing D) x (relevance of D to q) ) query logs scoring function over (possibly stale) repository scoring function over “live” copy of D usage logs ViewProb( Rank(D, q) )

@ Carnegie Mellon Databases 11 Lessons from Quality Metric ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in descending order of relevance Out-of-date repository: scrambles ranking  lowers quality Avg. Quality = Σ q ( freq q x Σ D (ViewProb( Rank(D, q) ) x (relevance of D to q) ) Let ΔQ D = loss in quality due to inaccurate information about D  Alternatively, improvement in quality if we (re)download D

@ Carnegie Mellon Databases 12 ΔQ D : Improvement in Quality REDOWNLOAD Web Copy of D (fresh) Repository Copy of D (stale) Repository Quality += ΔQ D

@ Carnegie Mellon Databases 13 Download Prioritization Two difficulties: 1.Live copy unavailable 2.Given both the “live” and repository copies of D, measuring ΔQ D may require computing ranks of all documents for all queries Q: How to measure ΔQ D ? Idea: Given ΔQ D for each doc., prioritize (re)downloading accordingly Approach: (1) Estimate ΔQ D for past versions, (2) Forecast current ΔQ D

@ Carnegie Mellon Databases 14 Overhead of Estimating ΔQ D Estimate while updating inverted index

@ Carnegie Mellon Databases 15 Forecast Future ΔQ D Top 50% Top 80% Top 90% first 24 weeks second 24 weeks Avg. weekly ΔQ D : Data: 48 weekly snapshots of 15 web sites sampled from OpenDirectory topics Queries: AltaVista query log

@ Carnegie Mellon Databases 16 Summary Estimate ΔQ D at index time Forecast future ΔQ D Prioritize downloading according to forecasted ΔQ D

@ Carnegie Mellon Databases 17 Overall Effectiveness Staleness = fraction of out-of-date documents* [Cho et al. 2000] Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002] * Used “shingling” to filter out “trivial” changes Scoring function: PageRank (similar results for TF.IDF) Quality (fraction of ideal) resource requirement Min. Staleness Min. Embarrassment User-Centric

@ Carnegie Mellon Databases 18 (boston.com) Does not rely on size of text change to estimate importance Tagged as important by shingling measure, although did not match many queries in workload Reasons for Improvement

@ Carnegie Mellon Databases 19 Reasons for Improvement Accounts for “false negatives” Does not always ignore frequently-updated pages User-centric crawling repeatedly re-downloads this page (washingtonpost.com)

@ Carnegie Mellon Databases 20 Related Work (1/2) General-purpose Web crawling: – Min. Staleness [Cho, Garcia-Molina, SIGMOD’00] Maximize average freshness or age for fixed set of docs. – Min. Embarrassment [Wolf et al., WWW’02]: Maximize weighted avg. freshness for fixed set of docs. Document weights determined by prob. of “embarrassment” – [Edwards et al., WWW’01] Maximize average freshness for a growing set of docs. How to balance new downloads vs. redownloading old docs.

@ Carnegie Mellon Databases 21 Related Work (2/2) Focused/topic-specific crawling – [Chakrabarti, many others] – Select subset of pages that match user interests – Our work: given a set of pages, decide when to (re)download each based on predicted content shifts + user interests

@ Carnegie Mellon Databases 22 Summary Crawling: an optimization problem Objective: maximize quality as perceived by users Approach: – Measure ΔQ D using query workload and usage logs – Prioritize downloading based on forecasted ΔQ D Various reasons for improvement – Accounts for false positives and negatives – Does not rely on size of text change to estimate importance – Does not always ignore frequently updated pages

@ Carnegie Mellon Databases 23 THE END Paper available at:

@ Carnegie Mellon Databases 24 Most Closely Related Work [Wolf et al., WWW’02]: – Maximize weighted avg. freshness for fixed set of docs. – Document weights determined by prob. of “embarrassment” User-Centric Crawling: – Which queries affected by a change, and by how much? Change A: significantly alters relevance to several common queries Change B: only affects relevance to infrequent queries, and not by much – Metric penalizes false negatives Doc. ranked #1000 for a popular query should be ranked #2 – Small embarrassment but big loss in quality

@ Carnegie Mellon Databases 25 Inverted Index Cancer Seminar Symptoms Word Posting list DocID (freq) Doc7 (2)Doc9 (1)Doc1 (1) Doc5 (1)Doc1 (1)Doc6 (1) Doc1 (1)Doc8 (2)Doc4 (3) Seminar: Cancer Symptoms Doc1

@ Carnegie Mellon Databases 26 Updating Inverted Index Seminar: Cancer Symptoms Cancer management: how to detect breast cancer Stale Doc1Live Doc1 CancerDoc7 (2)Doc9 (1)Doc1 (1)Doc1 (2)

@ Carnegie Mellon Databases 27 Measure ΔQ D While Updating Index Compute previous and new scores of the downloaded document while updating postings Maintain an approximate mapping between score and rank for each query term (20 bytes per mapping in our exps.) Compute previous and new ranks (approximately) using the computed scores and score-to-rank mapping Measure ΔQ D using previous and new ranks (by applying an approximate function derived in the paper)

@ Carnegie Mellon Databases 28 Out-of-date Repository Web Copy of D (fresh) Repository Copy of D (stale)