Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19.

Similar presentations


Presentation on theme: "Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19."— Presentation transcript:

1 Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19

2 The Web is a BIG Graph “Diameter” of the Web Cannot crawl even the static part, completely New technology: the focused crawl

3 Crawling and Crawlers Web overlays the internet A crawl overlays the web seed

4 Crawler Issues System Considerations The URL itself Politeness Visit Order Robot Traps The hidden web

5 Standard for Robot Exclusion Martin Koster (1994) http://any-server:80/robots.txt Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler Specification: http://www.robotstxt.org/wc/norobots.html http://www.robotstxt.org/wc/norobots.html

6 Visit Order The frontier Breadth-first: FIFO queue Depth-first: LIFO queue Best-first: Priority queue Random Refresh rate

7 Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster

8 The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web

9 Redefining Order Making for Networked Information Challenge: Accommodate not impose ordering mechanisms Ordering mechanisms should be independent of: –Physical location –Who owns the content –Who manages the content

10 Tools for Order Making Better search engines –google Better metadata –Dublin Core, INDECS, IMS Tools for selection and specialization –Collection Services

11 Collections in the Traditional Library Selection – defining the resources Specialization – defining the mechanisms Management – defining the policies. http://campusgw.library.cornell.edu/about/ spcollections.htmlhttp://campusgw.library.cornell.edu/about/ spcollections.html http://scriptorium.lib.duke.edu/

12 Traditional Model Doesn’t Map Irrelevance of locality – both among and within resources Blurring of containment – inter-resource linkages Loss of permanence – ephemeral resources are the norm

13 Defining a Digital Collection A criterion for selecting a set of resources possibly distributed across multiple distributed repositories

14 Collection Synthesis The NSDL –National Scientific Digital Library –Educational materials for K-thru-grave –A collection of digital collections Collection (automatically derived) –20-50 items on a topic, represented by their URLs, expository in nature, precision trumps recall. Collection description (automatically derived)

15 Crawler is the Key A general search engine is good for precise results, few in number A search engine must cover all topics, not just scientific For automatic collection assembly, a Web crawler is needed A focused crawler is the key

16 Focused Crawling

17 432 765 1 1 R Breadth-first crawl 1 432 5 R X X Focused crawl

18 Collections and Clusters Traditional – document universe is divided into clusters, or collections Each collection represented by its centroid Web – size of document universe is infinite Agglomerative clustering is used instead Two aspects: –Collection descriptor –Rule for when items belong to that Collection

19 Q = 0.2 Q = 0.6

20 The Setup A virtual collection of items about Chebyshev Polynomials

21 Adding a Centroid An empty collection of items about Chebyshev Polynomials

22 Document Vector Space Classic information retrieval technique Each word is a dimension in N-space Each document is a vector in N-space Example: Normalize the weights Both the “centroid” and the downloaded document are term vectors

23 Agglomerate A collection with 3 items about Ch. Polys.

24 Where does the Centroid come from? “Chebyshev Polynomials” A really good centroid for a collection about C.P.’s

25 Building a Centroid 1. Google(“Chebyshev Polynomials”)  url1, url2, … 2. Let H be a hash (k,v) where k=word, value=freq 3. For each url in {url1, url2,…} do D  download(url) V  term vector(d) For each term t in V do If t not in H add it with value 0 H(t) ++ 4. Compute tf-idf weights. C  top 20 terms (by weight).

26 Dictionary Given centroids C1, C2, C3 … Dictionary is C1 + C2 + C3 … –Terms are union of terms in Ci –Term Frequencies are total frequency in Ci –Document Frequency is how many C’s have t –Term IDF is based on Berkeley’s DocFreqs Dictionary is 300-500 terms

27 Tunneling with Cutoff Nugget – dud – dud… – dud – nugget Notation: 0 – X – X … - X – 0 Fixed cutoff: 0 – X1 – X2 - … Xc Adaptive cutoff: 0 – X1 – X2 - … X?

28 Statistics Collected 500,000 documents Number of seeds: 4 Path data for all but seeds 6620 completed paths (0-x…x-0) 100,000s incomplete paths (0-x…x..)

29 Nuggets that are x steps from a nugget

30 Nuggets that are x steps from a seed and/or a nugget

31 Better parents have better children.

32 NSDL http://www.nsdl.org

33 Central storage of all metadata about all resources in the NSDL –Defines the extent of NSDL collection –Metadata includes collections, items, annotations, etc. MR main functions –Aggregation –Normalization –redistribution Ingest of metadata by various means –Harvesting, manual, automatic, cross-walking Open access to MR contents for service builders via OAI-PMH Metadata Repository

34 Metadata Strategy Collect and redistribute any native (XML) metadata format Provide crosswalks to Dublin Core from eight standard formats –Dublin Core, DC-GEM, LTSC (IMS), ADL (SCORM), MARC, FGCD, EAD Concentrate on collection-level metadata Use automatic generation to augment item-level metadata

35 Importing metadata into the MR Collections Harvest Staging area Cleanup and crosswalks Database load Metadata Repository

36 Exporting metadata from the MR

37 NSDL Data Warehouse A Web of Entities and Relationships

38 Data Stores Document Repositories Databases Web Resources Publisher Repositories Harvesting Gathering Normalization Digital Sources NSDL Data Warehouse: Entities and their Relationships (wholesale) Diverse Network of Specialized Partners (retail) Specialized Mining Annotation Augmentation Portal s


Download ppt "Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19."

Similar presentations


Ads by Google