@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University
@ Carnegie Mellon Databases 2 Web Crawling One important application (our focus): search Topic-specific search engines + General-purpose ones repository index search queries usercrawler WWW
@ Carnegie Mellon Databases 3 Out-of-date Repository Web is always changing [Arasu et.al., TOIT’01] – 23% of Web pages change daily – 40% commercial Web pages change daily Many problems may arise due to an out-of- date repository – Hurt both precision and recall
@ Carnegie Mellon Databases 4 Web Crawling Optimization Problem Not enough resources to (re)download every web document every day/hour – Must pick and choose optimization problem Others: objective function = avg. freshness, age Our goal: focus directly on impact on users repository index search queries usercrawler WWW
@ Carnegie Mellon Databases 5 Web Search User Interface 1.User enters keywords 2.Search engine returns ranked list of results 3.User visits subset of results … documents
@ Carnegie Mellon Databases 6 Objective: Maximize Repository Quality (as perceived by users) Suppose a user issues search query q: Quality q = Σ documents D (likelihood of viewing D) x (relevance of D to q) Given a workload W of user queries: Average quality = 1/K x Σ queries q W (freq q x Quality q )
@ Carnegie Mellon Databases 7 Viewing Likelihood Rank Probability of Viewing view probability rank Depends primarily on rank in list [Joachims KDD’02] From AltaVista data [Lempel et al. WWW’03]: ViewProbability(r) r –1.5
@ Carnegie Mellon Databases 8 Search engines’ internal notion of how well a document matches a query Each D/Q pair numerical score [0,1] Combination of many factors, including: – Vector-space similarity (e.g., TF.IDF cosine metric) – Link-based factors (e.g., PageRank) – Anchortext of referring pages Relevance Scoring Function
@ Carnegie Mellon Databases 9 (Caveat) Using scoring function for absolute relevance – Normally only used for relative ranking – Need to craft scoring function carefully
@ Carnegie Mellon Databases 10 Measuring Quality Avg. Quality = Σ q ( freq q x Σ D (likelihood of viewing D) x (relevance of D to q) ) query logs scoring function over (possibly stale) repository scoring function over “live” copy of D usage logs ViewProb( Rank(D, q) )
@ Carnegie Mellon Databases 11 Lessons from Quality Metric ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in descending order of relevance Out-of-date repository: scrambles ranking lowers quality Avg. Quality = Σ q ( freq q x Σ D (ViewProb( Rank(D, q) ) x (relevance of D to q) ) Let ΔQ D = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D
@ Carnegie Mellon Databases 12 ΔQ D : Improvement in Quality REDOWNLOAD Web Copy of D (fresh) Repository Copy of D (stale) Repository Quality += ΔQ D
@ Carnegie Mellon Databases 13 Download Prioritization Two difficulties: 1.Live copy unavailable 2.Given both the “live” and repository copies of D, measuring ΔQ D may require computing ranks of all documents for all queries Q: How to measure ΔQ D ? Idea: Given ΔQ D for each doc., prioritize (re)downloading accordingly Approach: (1) Estimate ΔQ D for past versions, (2) Forecast current ΔQ D
@ Carnegie Mellon Databases 14 Overhead of Estimating ΔQ D Estimate while updating inverted index
@ Carnegie Mellon Databases 15 Forecast Future ΔQ D Top 50% Top 80% Top 90% first 24 weeks second 24 weeks Avg. weekly ΔQ D : Data: 48 weekly snapshots of 15 web sites sampled from OpenDirectory topics Queries: AltaVista query log
@ Carnegie Mellon Databases 16 Summary Estimate ΔQ D at index time Forecast future ΔQ D Prioritize downloading according to forecasted ΔQ D
@ Carnegie Mellon Databases 17 Overall Effectiveness Staleness = fraction of out-of-date documents* [Cho et al. 2000] Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002] * Used “shingling” to filter out “trivial” changes Scoring function: PageRank (similar results for TF.IDF) Quality (fraction of ideal) resource requirement Min. Staleness Min. Embarrassment User-Centric
@ Carnegie Mellon Databases 18 (boston.com) Does not rely on size of text change to estimate importance Tagged as important by shingling measure, although did not match many queries in workload Reasons for Improvement
@ Carnegie Mellon Databases 19 Reasons for Improvement Accounts for “false negatives” Does not always ignore frequently-updated pages User-centric crawling repeatedly re-downloads this page (washingtonpost.com)
@ Carnegie Mellon Databases 20 Related Work (1/2) General-purpose Web crawling: – Min. Staleness [Cho, Garcia-Molina, SIGMOD’00] Maximize average freshness or age for fixed set of docs. – Min. Embarrassment [Wolf et al., WWW’02]: Maximize weighted avg. freshness for fixed set of docs. Document weights determined by prob. of “embarrassment” – [Edwards et al., WWW’01] Maximize average freshness for a growing set of docs. How to balance new downloads vs. redownloading old docs.
@ Carnegie Mellon Databases 21 Related Work (2/2) Focused/topic-specific crawling – [Chakrabarti, many others] – Select subset of pages that match user interests – Our work: given a set of pages, decide when to (re)download each based on predicted content shifts + user interests
@ Carnegie Mellon Databases 22 Summary Crawling: an optimization problem Objective: maximize quality as perceived by users Approach: – Measure ΔQ D using query workload and usage logs – Prioritize downloading based on forecasted ΔQ D Various reasons for improvement – Accounts for false positives and negatives – Does not rely on size of text change to estimate importance – Does not always ignore frequently updated pages
@ Carnegie Mellon Databases 23 THE END Paper available at:
@ Carnegie Mellon Databases 24 Most Closely Related Work [Wolf et al., WWW’02]: – Maximize weighted avg. freshness for fixed set of docs. – Document weights determined by prob. of “embarrassment” User-Centric Crawling: – Which queries affected by a change, and by how much? Change A: significantly alters relevance to several common queries Change B: only affects relevance to infrequent queries, and not by much – Metric penalizes false negatives Doc. ranked #1000 for a popular query should be ranked #2 – Small embarrassment but big loss in quality
@ Carnegie Mellon Databases 25 Inverted Index Cancer Seminar Symptoms Word Posting list DocID (freq) Doc7 (2)Doc9 (1)Doc1 (1) Doc5 (1)Doc1 (1)Doc6 (1) Doc1 (1)Doc8 (2)Doc4 (3) Seminar: Cancer Symptoms Doc1
@ Carnegie Mellon Databases 26 Updating Inverted Index Seminar: Cancer Symptoms Cancer management: how to detect breast cancer Stale Doc1Live Doc1 CancerDoc7 (2)Doc9 (1)Doc1 (1)Doc1 (2)
@ Carnegie Mellon Databases 27 Measure ΔQ D While Updating Index Compute previous and new scores of the downloaded document while updating postings Maintain an approximate mapping between score and rank for each query term (20 bytes per mapping in our exps.) Compute previous and new ranks (approximately) using the computed scores and score-to-rank mapping Measure ΔQ D using previous and new ranks (by applying an approximate function derived in the paper)
@ Carnegie Mellon Databases 28 Out-of-date Repository Web Copy of D (fresh) Repository Copy of D (stale)