Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19.

Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19

The Web is a BIG Graph “Diameter” of the Web Cannot crawl even the static part, completely New technology: the focused crawl

Crawling and Crawlers Web overlays the internet A crawl overlays the web seed

Crawler Issues System Considerations The URL itself Politeness Visit Order Robot Traps The hidden web

Standard for Robot Exclusion Martin Koster (1994) http://any-server:80/robots.txt Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler Specification: http://www.robotstxt.org/wc/norobots.html http://www.robotstxt.org/wc/norobots.html

Visit Order The frontier Breadth-first: FIFO queue Depth-first: LIFO queue Best-first: Priority queue Random Refresh rate

Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster

The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web

Redefining Order Making for Networked Information Challenge: Accommodate not impose ordering mechanisms Ordering mechanisms should be independent of: –Physical location –Who owns the content –Who manages the content

Tools for Order Making Better search engines –google Better metadata –Dublin Core, INDECS, IMS Tools for selection and specialization –Collection Services

Collections in the Traditional Library Selection – defining the resources Specialization – defining the mechanisms Management – defining the policies. http://campusgw.library.cornell.edu/about/ spcollections.htmlhttp://campusgw.library.cornell.edu/about/ spcollections.html http://scriptorium.lib.duke.edu/

Traditional Model Doesn’t Map Irrelevance of locality – both among and within resources Blurring of containment – inter-resource linkages Loss of permanence – ephemeral resources are the norm

Defining a Digital Collection A criterion for selecting a set of resources possibly distributed across multiple distributed repositories

Collection Synthesis The NSDL –National Scientific Digital Library –Educational materials for K-thru-grave –A collection of digital collections Collection (automatically derived) –20-50 items on a topic, represented by their URLs, expository in nature, precision trumps recall. Collection description (automatically derived)

Crawler is the Key A general search engine is good for precise results, few in number A search engine must cover all topics, not just scientific For automatic collection assembly, a Web crawler is needed A focused crawler is the key

Focused Crawling

432 765 1 1 R Breadth-first crawl 1 432 5 R X X Focused crawl

Collections and Clusters Traditional – document universe is divided into clusters, or collections Each collection represented by its centroid Web – size of document universe is infinite Agglomerative clustering is used instead Two aspects: –Collection descriptor –Rule for when items belong to that Collection

Q = 0.2 Q = 0.6

The Setup A virtual collection of items about Chebyshev Polynomials

Adding a Centroid An empty collection of items about Chebyshev Polynomials

Document Vector Space Classic information retrieval technique Each word is a dimension in N-space Each document is a vector in N-space Example: Normalize the weights Both the “centroid” and the downloaded document are term vectors

Agglomerate A collection with 3 items about Ch. Polys.

Where does the Centroid come from? “Chebyshev Polynomials” A really good centroid for a collection about C.P.’s

Building a Centroid 1. Google(“Chebyshev Polynomials”)  url1, url2, … 2. Let H be a hash (k,v) where k=word, value=freq 3. For each url in {url1, url2,…} do D  download(url) V  term vector(d) For each term t in V do If t not in H add it with value 0 H(t) ++ 4. Compute tf-idf weights. C  top 20 terms (by weight).

Dictionary Given centroids C1, C2, C3 … Dictionary is C1 + C2 + C3 … –Terms are union of terms in Ci –Term Frequencies are total frequency in Ci –Document Frequency is how many C’s have t –Term IDF is based on Berkeley’s DocFreqs Dictionary is 300-500 terms

Tunneling with Cutoff Nugget – dud – dud… – dud – nugget Notation: 0 – X – X … - X – 0 Fixed cutoff: 0 – X1 – X2 - … Xc Adaptive cutoff: 0 – X1 – X2 - … X?

Statistics Collected 500,000 documents Number of seeds: 4 Path data for all but seeds 6620 completed paths (0-x…x-0) 100,000s incomplete paths (0-x…x..)

Nuggets that are x steps from a nugget

Nuggets that are x steps from a seed and/or a nugget

Better parents have better children.

NSDL http://www.nsdl.org

Central storage of all metadata about all resources in the NSDL –Defines the extent of NSDL collection –Metadata includes collections, items, annotations, etc. MR main functions –Aggregation –Normalization –redistribution Ingest of metadata by various means –Harvesting, manual, automatic, cross-walking Open access to MR contents for service builders via OAI-PMH Metadata Repository

Metadata Strategy Collect and redistribute any native (XML) metadata format Provide crosswalks to Dublin Core from eight standard formats –Dublin Core, DC-GEM, LTSC (IMS), ADL (SCORM), MARC, FGCD, EAD Concentrate on collection-level metadata Use automatic generation to augment item-level metadata

Importing metadata into the MR Collections Harvest Staging area Cleanup and crosswalks Database load Metadata Repository

Exporting metadata from the MR

NSDL Data Warehouse A Web of Entities and Relationships

Data Stores Document Repositories Databases Web Resources Publisher Repositories Harvesting Gathering Normalization Digital Sources NSDL Data Warehouse: Entities and their Relationships (wholesale) Diverse Network of Specialized Partners (retail) Specialized Mining Annotation Augmentation Portal s

Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19.

Similar presentations

Presentation on theme: "Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19.

Similar presentations

Presentation on theme: "Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19."— Presentation transcript:

Similar presentations

About project

Feedback