Download presentation
Presentation is loading. Please wait.
1
December 20, 2002CUL Metadata WG Meeting1 Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems
2
December 20, 2002CUL Metadata WG Meeting2 Outline Crawlers Collection Synthesis Focused Crawling Some Results Student Project (Fall 2002)
3
December 20, 2002CUL Metadata WG Meeting3 Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.
4
December 20, 2002CUL Metadata WG Meeting4 Crawlers – some background Resource discovery Crawlers and internet history Crawling and crawlers Mercator
5
December 20, 2002CUL Metadata WG Meeting5 Resource Discovery Finding info on the Web –Surfing (random strategy, goal is serendipity) –Searching (inverted indices; specific info) –Crawling (“all” the info) Uses for crawling –Find stuff –Gather stuff –Check stuff
6
December 20, 2002CUL Metadata WG Meeting6 Crawlers and internet history 1991: HTTP 1992: 26 servers 1993: 60+ servers; self-register; archie 1994 (early) – first crawlers 1996 – search engines abound 1998 – focused crawling 1999 – web graph studies 2002 – use for digital libraries
7
December 20, 2002CUL Metadata WG Meeting7 Crawling and Crawlers Web overlays the internet A crawl overlays the web seed
8
December 20, 2002CUL Metadata WG Meeting8 Crawler Issues The web is so big Visit Order The URL itself Politeness Robot Traps The hidden web System Considerations
9
December 20, 2002CUL Metadata WG Meeting9 Standard for Robot Exclusion Martin Koster (1994) http://any-server:80/robots.txt Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler
10
December 20, 2002CUL Metadata WG Meeting10 Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster
11
December 20, 2002CUL Metadata WG Meeting11 The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web
12
December 20, 2002CUL Metadata WG Meeting12 System Issues Crawlers are complicated systems Efficiency is of utmost importance Crawlers are demanding of system and network resources
13
December 20, 2002CUL Metadata WG Meeting13
14
December 20, 2002CUL Metadata WG Meeting14 Mercator Features Written in Java One file configures a crawl Can add your own code –Extend one or more of M’s base classes –Add totally new classes called by your own Industrial-strength crawler: – uses its own DNS and java.net package
15
December 20, 2002CUL Metadata WG Meeting15 Collection Synthesis The NSDL –National Scientific Digital Library –Educational materials for K-thru-grave –A collection of digital collections Collection (automatically derived) –20-50 items on a topic, represented by their URLs, expository in nature, precision trumps recall
16
December 20, 2002CUL Metadata WG Meeting16 Crawler is the Key A general search engine is good for precise results, few in number A search engine must cover all topics, not just scientific For automatic collection assembly, a Web crawler is needed A focused crawler is the key
17
December 20, 2002CUL Metadata WG Meeting17 Focused Crawling
18
December 20, 2002CUL Metadata WG Meeting18 Focused Crawling 432 765 1 1 R Breadth-first crawl 1 432 5 R X X Focused crawl
19
December 20, 2002CUL Metadata WG Meeting19 Collections and Clusters Traditional – document universe is divided into clusters, or collections Each collection represented by its centroid Web – size of document universe is infinite Agglomerative clustering is used instead Two aspects: –Collection descriptor –Rule for when items belong to that Collection
20
December 20, 2002CUL Metadata WG Meeting20 Q = 0.2 Q = 0.6
21
December 20, 2002CUL Metadata WG Meeting21 The Setup A virtual collection of items about Chebyshev Polynomials
22
December 20, 2002CUL Metadata WG Meeting22 Adding a Centroid An empty collection of items about Chebyshev Polynomials
23
December 20, 2002CUL Metadata WG Meeting23 Document Vector Space Classic information retrieval technique Each word is a dimension in N-space Each document is a vector in N-space Example: Normalize the weights Both the “centroid” and the downloaded document are term vectors
24
December 20, 2002CUL Metadata WG Meeting24 Agglomerate A collection with 3 items about Ch. Polys.
25
December 20, 2002CUL Metadata WG Meeting25 Where does the Centroid come from? “Chebyshev Polynomials” A really good centroid for a collection about C.P.’s
26
December 20, 2002CUL Metadata WG Meeting26 Building a Centroid 1. Google(“Chebyshev Polynomials”) {url1 … url-n 2. Let H be a hash (k,v) where k=word, value=freq 3. For each url in {u1 … un} do D download(url) V term vector(d) For each term t in V do If t not in H add it with value H(t) ++ 4. Compute tf-idf weights. C top 20 terms.
27
December 20, 2002CUL Metadata WG Meeting27 Dictionary Given centroids C1, C2, C3 … Dictionary is C1 + C2 + C3 … –Terms are union of terms in Ci –Term Frequencies are total frequency in Ci –Document Frequency is how many C’s have t –Term IDF is as from Berkeley Dictionary is 300-500 terms
28
December 20, 2002CUL Metadata WG Meeting28 Focused Crawling Recall the cartoon for a focused crawl: A simple way to do it is with 2 “knobs” 1 432 5 R X X
29
December 20, 2002CUL Metadata WG Meeting29 Focusing the Crawl Threshold: page is on-topic if correlation to the closest centroid is above this value Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than the cutoff
30
December 20, 2002CUL Metadata WG Meeting30 Illustration 23 4 6 7 1 555 5 Cutoff = 1 Corr >= threshold
31
December 20, 2002CUL Metadata WG Meeting31 Closest Furthest
32
December 20, 2002CUL Metadata WG Meeting32 Collection “Evaluation” Assume higher correlations are good With human relevance assessments, one can also compute a “precision” curve Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.
33
December 20, 2002CUL Metadata WG Meeting33 Cutoff = 0 Threshold = 0.3
34
December 20, 2002CUL Metadata WG Meeting34
35
December 20, 2002CUL Metadata WG Meeting35 Tunneling with Cutoff Nugget – dud – dud… - dud – nugget Notation: 0 – X – X … - X – 0 Fixed cutoff: 0 – X1 – X2 - … Xc Adaptive cutoff: 0 – X1 – X2 - … X?
36
December 20, 2002CUL Metadata WG Meeting36 Statistics Collected 500,000 documents Number of seeds: 4 Path data for all but seeds 6620 completed paths (0-x…x-0) 100,000s incomplete paths (0-x…x..)
37
December 20, 2002CUL Metadata WG Meeting37 Nuggets that are x steps from a nugget
38
December 20, 2002CUL Metadata WG Meeting38 Nuggets that are x steps from a seed and/or a nugget
39
December 20, 2002CUL Metadata WG Meeting39 Better parents have better children.
40
December 20, 2002CUL Metadata WG Meeting40 Using the Empirical Observations Use the path history Use the page quality - cosine correlation Current distance should increase exponentially as you get away from quality nodes Distance = 0 if this is a nugget, otherwise: 1 or (1-corr) exp (2 x parent’s distance / cutoff)
41
December 20, 2002CUL Metadata WG Meeting41 Results Details in the ECDL paper Smaller frontier more docs/second More documents downloaded in same time Higher-scoring documents were downloaded Cutoff of 20 averaged 7 steps at the cutoff
42
December 20, 2002CUL Metadata WG Meeting42 Fall 2002 Student Project Query Mercator CentroidCollectionDescription Term vectors Centroids, Dictionary Collection URLs Chebyshev P.s HTML
43
December 20, 2002CUL Metadata WG Meeting43 Conclusion We’ve covered crawling – history, technology, use Focused crawling with tunneling Adaptive cutoff with tunneling We have a good experimental setup for exploring automatic collection synthesis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.