Download presentation
Presentation is loading. Please wait.
Published byBranden Malone Modified over 9 years ago
1
User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon
2
Christopher Olston 2 Distributed Sources of Dynamic Information source Asource Bsource C resource constraints central monitoring node Support integrated querying Maintain historical archive Sensors Web sites
3
Christopher Olston 3 Workload-driven Approach Goal: meet usage needs, while adhering to resource constraints Tactic: pay attention to workload workload = usage + data dynamics this talk Current focus: autonomous sources – Data archival from Web sources [VLDB’04] – Supporting Web search [WWW’05] Thesis work: cooperative sources [VLDB’00, SIGMOD’01, SIGMOD’02, SIGMOD’03a, SIGMOD’03b]
4
Christopher Olston 4 Outline Introduction: monitoring distributed sources User-centric web crawling – Model + approach – Empirical results – Related & future work
5
Christopher Olston 5 Web Crawling to Support Search web site Aweb site Bweb site C resource constraint search engine repository index search queries users crawler Q: Given a full repository, when to refresh each page?
6
Christopher Olston 6 Approach Faced with optimization problem Others: – Maximize freshness, age, or similar – Boolean model of document change Our approach: – User-centric optimization objective – Rich notion of document change, attuned to user-centric objective
7
Christopher Olston 7 Web Search User Interface 1.User enters keywords 2.Search engine returns ranked list of results 3.User visits subset of results 1.--------- 2.--------- 3.--------- 4.… documents
8
Christopher Olston 8 Objective: Maximize Repository Quality, from Search Perspective Suppose a user issues search query q Quality q = Σ documents d (likelihood of viewing d) x (relevance of d to q) Given a workload W of user queries: Average quality = 1/K x Σ queries q W (freq q x Quality q )
9
Christopher Olston 9 Viewing Likelihood 0 0.2 0.4 0.6 0.8 1 1.2 050100150 Rank Probability of Viewing view probability rank Depends primarily on rank in list [Joachims KDD’02] From AltaVista data [Lempel et al. WWW’03]: ViewProbability(r) r –1.5
10
Christopher Olston 10 Search engines’ internal notion of how well a document matches a query Each D/Q pair numerical score [0,1] Combination of many factors, e.g.: – Vector-space similarity (e.g., TF.IDF cosine metric) – Link-based factors (e.g., PageRank) – Anchortext of referring pages Relevance Scoring Function
11
Christopher Olston 11 (Caveat) Using scoring function for absolute relevance (Normally only used for relative ranking) – Need to ensure scoring function has meaning on an absolute scale Probabilistic IR models, PageRank okay Unclear whether TF-IDF does (still debated, I believe) Bottom line: stricter interpretability requirement than “good relative ordering”
12
Christopher Olston 12 Measuring Quality Avg. Quality = Σ q ( freq q x Σ d (likelihood of viewing d) x (relevance of d to q) ) query logs scoring function over (possibly stale) repository scoring function over “live” copy of d usage logs ViewProb( Rank(d, q) )
13
Christopher Olston 13 Lessons from Quality Metric ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in descending order of true relevance Out-of-date repository: scrambles ranking lowers quality Avg. Quality = Σ q ( freq q x Σ d (ViewProb( Rank(d, q) ) x Relevance(d, q)) ) Let ΔQ D = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D
14
Christopher Olston 14 ΔQ D : Improvement in Quality REDOWNLOAD Web Copy of D (fresh) Repository Copy of D (stale) Repository Quality += ΔQ D
15
Christopher Olston 15 Formula for Quality Gain (ΔQ D ) Quality beforehand: Quality after re-download: Quality gain: Q(t– ) = Σ q ( freq q x Σ d (ViewProb( Rank t– (d, q) ) x Relevance(d, q)) ) Q(t) = Σ q ( freq q x Σ d (ViewProb( Rank t (d, q) ) x Relevance(d, q)) ) ∆ Q D (t) = Q(t) – Q(t– ) = Σ q ( freq q x Σ d ( VP x Relevance(d, q)) ) where VP = ViewProb( Rank t (d, q) ) – ViewProb( Rank t– (d, q) ) Re-download document D at time t.
16
Christopher Olston 16 Download Prioritization Three difficulties: 1.ΔQ D depends on order of downloading 2.Given both the “live” and repository copies of D, measuring ΔQ D is computationally expensive 3.Live copy usually unavailable Idea: Given ΔQ D for each doc., prioritize (re)downloading accordingly
17
Christopher Olston 17 Difficulty 1: Order of Downloading Matters ΔQ D depends on relative rank positions of D Hence, ΔQ D depends on order of downloading To reduce implementation complexity, avoid tracking inter-document ordering dependencies Assume ΔQ D independent of downloading of other docs. Q D (t) = Σ q ( freq q x Σ d ( VP x Relevance(d, q)) ) where VP = ViewProb( Rank t (d, q) ) – ViewProb( Rank t– (d, q) )
18
Christopher Olston 18 Difficulty 3: Live Copy Unavailable Take measurements upon re-downloading D (live copy available at that time) Use forecasting techniques to project forward time past re-downloads now forecast ΔQ D ( t now ) ΔQ D (t 1 )ΔQ D (t 2 )
19
Christopher Olston 19 Ability to Forecast ΔQ D Top 50% Top 80% Top 90% first 24 weeks second 24 weeks Avg. weekly ΔQ D (log scale) Data: 15 web sites sampled from OpenDirectory topics Queries: AltaVista query log Docs downloaded once per week, in random order
20
Christopher Olston 20 Strategy So Far Measure shift in quality (ΔQ D ) each time re-download document D Forecast future ΔQ D – Treat each D independently Prioritize re-downloading by ΔQ D Remaining difficulty: 2.Given both the “live” and repository copies of D, measuring ΔQ D is computationally expensive
21
Christopher Olston 21 Difficulty 2: Metric Expensive to Compute Example: “Live” copy of D becomes less relevant to query q than before Now D is ranked too high Some users visit D in lieu of Y, which is more relevant Result: less-than-ideal quality Upon redownloading D, measuring quality gain requires knowing relevancy of Y, Z Solution: estimate! Use approximate relevance rank mapping functions, fit in advance for each query Results for q Actual Ideal 1. X 2. D 2. Y 3. Y 3. Z 4. Z 4. D One problem: measurements of other documents required.
22
Christopher Olston 22 Estimation Procedure Focus on query q (later we’ll see how to sum across all affected queries) Let F q (rel) be relevance rank mapping for q – We use piecewise linear function in log-log space – Let r 1 = D’s old rank (r 1 = F q (Rel(D old, q))), r 2 = new rank – Use integral approximation of summation DETAIL Q D,q = Σ d ( ViewProb(d,q) x Relevance(d,q) ) = VP(D,q) x Rel(D,q) + Σ d≠D ( VP(d,q) x Rel(d,q) ) ≈ Σ r=r1+1…r2 (VP(r–1) – VP(r)) x F –1 q (r)
23
Christopher Olston 23 Where we stand … Q D,q = VP(D,q) x Rel(D,q) + Σ d≠D ( VP(d,q) x Rel(d,q) ) DETAIL ≈ f(Rel(D,q), Rel(D old,q)) ≈ VP( F q (Rel(D, q)) ) – VP( F q (Rel(D old, q)) ) Q D,q ≈ g(Rel(D,q), Rel(D old,q)) Context: Q D = Σ q ( freq q x Q D,q )
24
Christopher Olston 24 Difficulty 2, continued Additional problem: must measure effect of shift in rank across all queries. Solution: couple measurements with index updating operations Sketch: – Basic index unit: posting. Conceptually: – Each time insert/delete/update a posting, compute old & new relevance contribution from term/document pair* – Transform using estimation procedure, and accumulate across postings touched to get ΔQ D term ID document ID scoring factors * assumes scoring function treats term/document pairs independently
25
Christopher Olston 25 Background: Text Indexes Dictionary Postings Term# docsTotal freq aid11 all22 cold11 duck12 Doc #Freq 581 371 621 151 412 Basic index unit: posting – One posting for each term/document pair – Contains information needed for scoring function Number of occurrences, font size, etc. DETAIL
26
Christopher Olston 26 Pre-Processing: Approximate the Workload Break multi-term queries into set of single-term queries – Now, term query – Index has one posting for each query/document pair DETAIL Dictionary Postings Term# docsTotal freq aid11 all22 cold11 duck12 Doc #Freq 581 371 621 151 412 = query
27
Christopher Olston 27 Taking Measurements During Index Maintenance While updating index: – Initialize bank of ΔQ D accumulators, one per document (actually, materialized on demand using hash table) – Each time insert/delete/update a posting: Compute new & old relevance contributions for query/document pair: Rel(D,q), Rel(D old,q) Compute ΔQ D,q using estimation procedure, add to accumulator: ΔQ D += freq q x g(Rel(D,q), Rel(D old,q)) DETAIL
28
Christopher Olston 28 Measurement Overhead Caveat: Does not handle factors that do not depend on a single term/doc. pair, e.g. term proximity and anchortext inclusion Implemented in Lucene
29
Christopher Olston 29 Summary of Approach User-centric metric of search repository quality (Re)downloading document improves quality Prioritize downloading by expected quality gain Metric adaptations to enable feasible+efficient implementation
30
Christopher Olston 30 Next: Empirical Results Introduction: monitoring distributed sources User-centric web crawling – Model + approach – Empirical results – Related & future work
31
Christopher Olston 31 Overall Effectiveness Staleness = fraction of out-of-date documents* [Cho et al. 2000] Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002] * Used “shingling” to filter out “trivial” changes Scoring function: PageRank (similar results for TF.IDF) Quality (fraction of ideal) resource requirement Min. Staleness Min. Embarrassment User-Centric
32
Christopher Olston 32 (boston.com) Does not rely on size of text change to estimate importance Tagged as important by staleness- and embarrassment-based techniques, although did not match many queries in workload Reasons for Improvement
33
Christopher Olston 33 Reasons for Improvement Accounts for “false negatives” Does not always ignore frequently-updated pages User-centric crawling repeatedly re-downloads this page (washingtonpost.com)
34
Christopher Olston 34 Related Work (1/2) General-purpose web crawling – [Cho, Garcia-Molina, SIGMOD’00], [Edwards et al., WWW’01] Maximize average freshness or age Balance new downloads vs. redownloading old documents Focused/topic-specific crawling – [Chakrabarti, many others] Select subset of documents that match user interests Our work: given a set of docs., decide when to (re)download
35
Christopher Olston 35 Most Closely Related Work [Wolf et al., WWW’02]: – Maximize weighted average freshness – Document weight = probability of “embarrassment” if not fresh User-Centric Crawling: – Measure interplay between update and query workloads When document X is updated, which queries are affected by the update, and by how much? – Metric penalizes false negatives Doc. ranked #1000 for a popular query should be ranked #2 Small embarrassment but big loss in quality
36
Christopher Olston 36 Future Work: Detecting Change-Rate Changes Current techniques schedule monitoring to exploit existing change-rate estimates (e.g., ΔQ D ) No provision to explore change-rates explicitly Explore/exploit tradeoff – Ongoing work on Bandit Problem formulation Bad case: change-rate = 0, so never monitor – Won’t notice future increase in change-rate
37
Christopher Olston 37 Summary Approach: – User-centric metric of search engine quality – Schedule downloading to maximize quality Empirical results: – High quality with few downloads – Good at picking “right” docs. to re-download
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.