Download presentation
Presentation is loading. Please wait.
1
Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19
2
The Web is a BIG Graph “Diameter” of the Web Cannot crawl even the static part, completely New technology: the focused crawl
3
Crawling and Crawlers Web overlays the internet A crawl overlays the web seed
4
Crawler Issues System Considerations The URL itself Politeness Visit Order Robot Traps The hidden web
5
Standard for Robot Exclusion Martin Koster (1994) http://any-server:80/robots.txt Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler Specification: http://www.robotstxt.org/wc/norobots.html http://www.robotstxt.org/wc/norobots.html
6
Visit Order The frontier Breadth-first: FIFO queue Depth-first: LIFO queue Best-first: Priority queue Random Refresh rate
7
Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster
8
The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web
9
Redefining Order Making for Networked Information Challenge: Accommodate not impose ordering mechanisms Ordering mechanisms should be independent of: –Physical location –Who owns the content –Who manages the content
10
Tools for Order Making Better search engines –google Better metadata –Dublin Core, INDECS, IMS Tools for selection and specialization –Collection Services
11
Collections in the Traditional Library Selection – defining the resources Specialization – defining the mechanisms Management – defining the policies. http://campusgw.library.cornell.edu/about/ spcollections.htmlhttp://campusgw.library.cornell.edu/about/ spcollections.html http://scriptorium.lib.duke.edu/
12
Traditional Model Doesn’t Map Irrelevance of locality – both among and within resources Blurring of containment – inter-resource linkages Loss of permanence – ephemeral resources are the norm
13
Defining a Digital Collection A criterion for selecting a set of resources possibly distributed across multiple distributed repositories
14
Collection Synthesis The NSDL –National Scientific Digital Library –Educational materials for K-thru-grave –A collection of digital collections Collection (automatically derived) –20-50 items on a topic, represented by their URLs, expository in nature, precision trumps recall. Collection description (automatically derived)
15
Crawler is the Key A general search engine is good for precise results, few in number A search engine must cover all topics, not just scientific For automatic collection assembly, a Web crawler is needed A focused crawler is the key
16
Focused Crawling
17
432 765 1 1 R Breadth-first crawl 1 432 5 R X X Focused crawl
18
Collections and Clusters Traditional – document universe is divided into clusters, or collections Each collection represented by its centroid Web – size of document universe is infinite Agglomerative clustering is used instead Two aspects: –Collection descriptor –Rule for when items belong to that Collection
19
Q = 0.2 Q = 0.6
20
The Setup A virtual collection of items about Chebyshev Polynomials
21
Adding a Centroid An empty collection of items about Chebyshev Polynomials
22
Document Vector Space Classic information retrieval technique Each word is a dimension in N-space Each document is a vector in N-space Example: Normalize the weights Both the “centroid” and the downloaded document are term vectors
23
Agglomerate A collection with 3 items about Ch. Polys.
24
Where does the Centroid come from? “Chebyshev Polynomials” A really good centroid for a collection about C.P.’s
25
Building a Centroid 1. Google(“Chebyshev Polynomials”) url1, url2, … 2. Let H be a hash (k,v) where k=word, value=freq 3. For each url in {url1, url2,…} do D download(url) V term vector(d) For each term t in V do If t not in H add it with value 0 H(t) ++ 4. Compute tf-idf weights. C top 20 terms (by weight).
26
Dictionary Given centroids C1, C2, C3 … Dictionary is C1 + C2 + C3 … –Terms are union of terms in Ci –Term Frequencies are total frequency in Ci –Document Frequency is how many C’s have t –Term IDF is based on Berkeley’s DocFreqs Dictionary is 300-500 terms
27
Tunneling with Cutoff Nugget – dud – dud… – dud – nugget Notation: 0 – X – X … - X – 0 Fixed cutoff: 0 – X1 – X2 - … Xc Adaptive cutoff: 0 – X1 – X2 - … X?
28
Statistics Collected 500,000 documents Number of seeds: 4 Path data for all but seeds 6620 completed paths (0-x…x-0) 100,000s incomplete paths (0-x…x..)
29
Nuggets that are x steps from a nugget
30
Nuggets that are x steps from a seed and/or a nugget
31
Better parents have better children.
32
NSDL http://www.nsdl.org
33
Central storage of all metadata about all resources in the NSDL –Defines the extent of NSDL collection –Metadata includes collections, items, annotations, etc. MR main functions –Aggregation –Normalization –redistribution Ingest of metadata by various means –Harvesting, manual, automatic, cross-walking Open access to MR contents for service builders via OAI-PMH Metadata Repository
34
Metadata Strategy Collect and redistribute any native (XML) metadata format Provide crosswalks to Dublin Core from eight standard formats –Dublin Core, DC-GEM, LTSC (IMS), ADL (SCORM), MARC, FGCD, EAD Concentrate on collection-level metadata Use automatic generation to augment item-level metadata
35
Importing metadata into the MR Collections Harvest Staging area Cleanup and crosswalks Database load Metadata Repository
36
Exporting metadata from the MR
37
NSDL Data Warehouse A Web of Entities and Relationships
38
Data Stores Document Repositories Databases Web Resources Publisher Repositories Harvesting Gathering Normalization Digital Sources NSDL Data Warehouse: Entities and their Relationships (wholesale) Diverse Network of Specialized Partners (retail) Specialized Mining Annotation Augmentation Portal s
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.