Collection Synthesis CS 502 – Carl Lagoze – Cornell University

Collection Synthesis CS 502 – 20030331 Carl Lagoze – Cornell University
Introduce myself as a researcher in the Cornell Computer Science department currently assigned to Cornell Information Systems doing web-related stuff. I am also now part of the NSDL project. Acknowledgements: Donna Bergmark

Divine Chaos of Global Web
XXX

Chaos as an Attribute Universality – anyone can participate in any role – publisher, author, consumer Uniformity – all are treated as equals Decentralization – anarchism rules beyond basic protocol agreement

Costs of Chaos Quantity without consistent quality
Mixing Nobel Prize winners with winners of first grade writing contest Uniformity vs. importance of specialization Sophisticated tools, technologies, and guidance for using many classes of information The importance of information integrity Reliability, security and privacy of users and information providers, long-term survivability

The Order-Making Tradition

Digital Libraries and Order Making
The enduring role of libraries (and other information intermediaries) in the transition from physical to digital will be order making – making information easier to find and manage. With due acknowledgement to Jim O’Donnell and David Levy

Digital Libraries as Portals

Redefining Order Making for Networked Information
Challenge: Accommodate not impose ordering mechanisms Ordering mechanisms should be independent of: Physical location Who owns the content Who manages the content

Tools for selection and specialization
Tools for Order Making Better search engines google Better metadata Dublin Core, INDECS, IMS Tools for selection and specialization Collection Services

Collections in the Traditional Library
Selection – defining the resources Specialization – defining the mechanisms Management – defining the policies.

Traditional Model Doesn’t Map
Irrelevance of locality – both among and within resources Blurring of containment – inter-resource linkages Loss of permanence – ephemeral resources are the norm

Defining a Digital Collection
A criterion for selecting a set of resources possibly distributed across multiple distributed repositories

Collection Synthesis The NSDL Collection (automatically derived)
National Scientific Digital Library Educational materials for K-thru-grave A collection of digital collections Collection (automatically derived) 20-50 items on a topic, represented by their URLs, expository in nature, precision trumps recall. Collection description (automatically derived) On to the second topic: our work in Collection Synthesis.Having access to Mercator allowed us to apply crawling to an application motivated by the NSF-funded project currently underway: to build a National Scientific Digital Library. Basically the library is a collection of collections. Most of the NSDL-funded projects are in fact aimed at putting together collections on certain topics. The Core Infrastructure, which is the part that Cornell is working on, provides a way to search and access these collections. Only metadata will be kept at the central site. The collections are hand-assembled for now, but automation would be highly desirable I have been exploring the question of what kinds of automatic techniques would help create content for the digital library. However, rather than archiving this content, I hope instead simply to save the URLs of the collection. This stuff is scientific, rather than historical. So archiving does not entirely apply. So. How would you go about, say, assembling a collection of Web resources to teach you about Chebyshev Polynomials? A Search Engine would give you 2 or 3, but not And they would not necessarily be expository. And if you go to a topic group like on Yahoo, or dmoz, you’ll find most stuff is not expository.

A general search engine is good for precise results, few in number
Crawler is the Key A general search engine is good for precise results, few in number A search engine must cover all topics, not just scientific For automatic collection assembly, a Web crawler is needed A focused crawler is the key If we aim to build collections automatically, we need to crawl the Web (as opposed to search by hand). As previously noted, search engines are good for when you want “the” answer – usually is the top hit. Crawlers are good if you want to crawl the web in batch style and assemble materials off line. Powerful crawler lets you cover more of the Web. It can be focused, whereas search engine crawlers must collect all material. Powerful crawler implies running a parallel crawler. There are several parallel / distributed crawlers out there, including Mercator.

Focused Crawling The Web is a BIG Graph
“Diameter” of the Web - growing Cannot crawl even the static part, completely New technology: the focused crawl Lawrence and Giles in 1998 pointed out that the Web was quickly overtaking Search Engine indexing. (< 20 % nos). Since the SE’s need to be ready for any query, the crawl needs to be broad and general. Rapidly expanding – the arrows. Bottom portion: crawled and cached for indexing by search engine crawlers. Right portion: crawl just what you want. Theoretically same % and effort, but more choosy. No .com, for example.

Focused Crawling 1 1 4 3 2 5 R X 2 3 4 5 6 7 R Focused crawl
We crawl the web to find materials on Chebyshev Polynomials. Not ALL of them (unlike traditional crawling); skip irrelevant parts of the web. Here is a simple picture. Recall the the web crawl is a tree, because we do not revisit nodes. Now we add relevance ®. A focused crawl is that it is more efficient at finding what you want. Normal crawl is shown on the left. Suppose 7 is about our topic. This is a relevance judgement. There is no such thing as “relevance” for a Search Engine crawler. It has to find and index everything. (Although focused search engines have recently undergone development.) So that’s enough on crawling and focused crawling, and the relationship between crawling and search engines … now onto collection synthesis. How do we get a cluster of related items? … R Focused crawl Breadth-first crawl 1

Collections and Clusters
Traditional – document universe is divided into clusters, or collections Each collection represented by its centroid Web – size of document universe is infinite Agglomerative clustering is used instead Two aspects: Collection descriptor Rule for when items belong to that Collection In classic information retrieval, there was a document universe that could be divided into collections C1, C2 … Cn. Represent the documents as term vectors in N-space where N is the size of the dictionary (I.e. English language). Compute center of mass for each collection (the “centroid”). Why? – to find an item, you can match it with centroids first. Approach does not work for the Web. So we go the reverse way – start with collection descriptors and then agglomerate by crawling. Use centroids to describe the desired collections and then put items into them as they are encountered on the Web. Here is a picture of how collections would grow …

Q = 0.2 Q = 0.6 Once you have the centroids, or semantic description of the desired collection, the crawl does the rest. This picture is an illustration of matching HTML documents against the various Centroid Term vectors (2 shown here). Each document has a different relationship with the various vectors (here, Q). The Put the document into one or more collections where it is sufficiently close to the centroid. This illustrates how a crawl helps build collections, but how is it really done?

A virtual collection of items about Chebyshev Polynomials
The Setup 1. A virtual collection on a particular topic (“chebyshev polynomials”). Starts off as a collection with nothing in it. A virtual collection of items about Chebyshev Polynomials

An empty collection of items about Chebyshev Polynomials
Adding a Centroid We add a centroid for this collection: The most important terms for this subject area (20-40 or so) Weighted re how important they are relative to each other (use IDF and what not) Is a vector in space (there is a thing in computer science known as “the curse of dimensionality” – I reduce the dimensionality up front) Euclidean-based distance to the centroid: cosine(centroid, document) is what gives the 0.2 and 0.6 shown on an earlier slide. An empty collection of items about Chebyshev Polynomials

Both the “centroid” and the downloaded document are term vectors
Document Vector Space Classic information retrieval technique Each word is a dimension in N-space Each document is a vector in N-space Example: <0, 0.003, 0,0,.01, .984,0,.001> Normalize the weights Both the “centroid” and the downloaded document are term vectors N = number of words in your dictionary of interesting words. The dictionary is ALWAYS finite. Just like N-space. For us, union of terms occurring in centroids. In the example, we pretend to have a dictionary of 8 words (or concepts). The document shown here is mostly about the sixth concept. It is also possible to use phrases in addition to words. For example, “vector space” is more precise in meaning than either word alone would be. A document is turned into a vector of weights, with the i-th weight being the “importance” of the iI-th term (or word phrase) in our dictionary. So “closest” means smallest angle between document vector and centroid vector. We use cosine of the angle.

A collection with 3 items about Ch. Polys.
Agglomerate Now we simply crawl. Each document that is encountered is compared with each of our centroids and added to the collection if sufficiently close. If we put a limit on collection size (20 items, say) then the “ball” around the centroid will get smaller and smaller as further out items are replaced by items closer in. How fast the ball shrinks is not particularly important, but I’ll have a slide later on that shows empirical results. At the end we want to have a small ball, implying all items have a high cosine correlation with the centroid, which represents our subject area well (we hope). Recall is not important; precision is. Later on I will show a slide of empirical results about precision. One more thing – since automated library building techniques are all about efficiency, why not do many topics at once? For example all mathematics topic? Doing 20 crawls to assember D1…D20 is MUCH more work than for each downloaded document, compute its correlation with each of 20 centroids. A collection with 3 items about Ch. Polys.

Where does the Centroid come from?
One remaining problem – where does the centroid come from? We go from a topic to a nice weighted term vector that describes the collection. For this we do leverage search engine technology…. “Chebyshev Polynomials” A really good centroid for a collection about C.P.’s

1. Google(“Chebyshev Polynomials”)  url1, url2, …
Building a Centroid 1. Google(“Chebyshev Polynomials”)  url1, url2, … 2. Let H be a hash (k,v) where k=word, value=freq 3. For each url in {url1, url2,…} do D  download(url) V  term vector(d) For each term t in V do We leverage search engine technology – they have inverted indices, page ranks, have done massive crawls, know what words are important. Why not build a centroid out of a search result? Recall that the first few hits are pretty good. Can use a search engine to map the topic into n urls (n about 5 or so). This uses “page scraping”; in the case of Google there is an API that is useful. Once the n top hits have been counted, multiply tf by idf. We get the inverse document frequencies from Berkeley. Send them a query with all terms t in H, get back their idf values. Very handy. Keep the best terms as a centroid. If t not in H add it with value 0 H(t) ++ 4. Compute tf-idf weights. C  top 20 terms (by weight).

Dictionary Given centroids C1, C2, C3 … Dictionary is C1 + C2 + C3 …
Terms are union of terms in Ci Term Frequencies are total frequency in Ci Document Frequency is how many C’s have t Term IDF is based on Berkeley’s DocFreqs Dictionary is terms A dictionary is needed during the crawl – as downloaded documents are turned into term vectors, we keep only those words that are in the dictionary. (Remember, precision trumps recall) If we miss some documents that is OK. So we don’t use a thesaurus. But you ight want to use one when describing the resulting collections? Our space of words for a couple dozen centroids (20-40 terms each) is small enough to classify documents on the fly.

Collection “Evaluation”
Assume higher correlations are good With human relevance assessments, one can also compute a “precision” curve Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n. Intuitively, the tighter the ball around the centroid, but better the collection will be. But high correlation is not necessarily the same as high precision. Precision curves are a good way to evaluate the collection that results. Unfortunately this requires human input. I’ll show some in a minute. Basically you want two things: good centroids that will attract the “right stuff” Good correlation metrics, so that the higher the correlation the more likely useful the item is to this collection Other evaluation measures: Harvest rate – how fast to you get relevant documents (shown earlier) Frontier size – balance between a frontier full of junky links and no links at all Processing rate – documents/second. All crawls slow down over time.

Cutoff = 0 Threshold = 0.3 Here are some precision curves. The subject is plane geometry. The crawl was 4 minutes. We are looking at the top 35 documents. We also consider the top 40 search results from Google. The crawl looked at 5785 documents in all that had a correlation of at least 25% with the nearest centroid. What I like about this plot is that it demonstrates how good a search engine is for getting a few results. But to build up a modest-sized collection, you really want to do a crawl. You should be able to keep your collection relatively pure. Extending this crawl for a longer time and inspecting the same collection afterwards (first 41 hits from Google, top 50 from crawl) …..

Precision is a good way of estimating the value of rank as a way of predicting relevance. Here, rank is simply the correlation value for the collection synthesis crawl (highest correlation has rank 1), and search result order for the Google search (first hit is rank 1). Here is a chart for one of the classes (one about geometry) where relevance judgements were assigned to each document in the result set (41 results for Google, 53 results for the Collection Synthesis crawl). I think this was for a five-hour crawl. In both cases, in this particular class, both collection synthesis and Google had a relevant document in the rank 1 position. Google had relevant documents in the first 6, whereas the second-highest-ranking document in the synthesized collection was not relevant (precision falls to .5). In the long run, however, Google’s results get more and more irrelevant, while collection synthesis hovers just above .5. We hope to improve even this by adding some page characterization to our crawling, to weed out course lists, calls for papers, and the like.

Nugget – dud – dud… – dud – nugget
Tunneling with Cutoff Nugget – dud – dud… – dud – nugget Notation: 0 – X – X … - X – 0 Fixed cutoff: 0 – X1 – X2 - … Xc Adaptive cutoff: 0 – X1 – X2 - … X? The precision results were gotten with cutoff = 0, threshold = It was a bit un-satisfying that things always seemed to work best at cutoff=0, no matter what the threshold. So then we explored replacing the fixed cutoff by an adaptive one. Tunneling through trash to get to a nugget – how far should we tunnel before giving up? This could vary between 0 and the dimension of the web. And maybe it depends on what you’ve seen of the trash so far. To help us get a grip on this we looked at lots of paths that were generated by our crawls. A path is simply two nuggets (or a seed) connected by duds. Fixed cutoff incomplete paths were where crawls stopped at Cutoff duds past the previous ancestor. Adaptive cutoff - The “?” denotes a value that is computed somehow on the fly during the crawl. Some researchers have put a learner here to compute the value; our approach is more to figure out a formula to use, based on empirical observations. Try to figure out what increases the liklihood of getting to a nugget even though you are looking at trash, and then how to compute this liklihood. To collect statistics, we set the threshold to 0.5 and the cutoff to 20. Nuggets are So this should encourage crawls away from nuggets. We looked at a 35,000 node crawl where path data was available.

Path data for all but seeds 6620 completed paths (0-x…x-0)
Statistics Collected 500,000 documents Number of seeds: 4 Path data for all but seeds 6620 completed paths (0-x…x-0) 100,000s incomplete paths (0-x…x..) 500,000 documents were downloaded, while only a small portion of the Web, is a reasonable size on which to collect statistics. The subject area was Mathematics. There were 4 seeds: search4science, yahoo, archives.math.utk, purplemath.com. At lease one of these (yahoo) is in the largest SCC of the web. A nugget is either a seed or a page whose cosine correlation is at least 0.50 with one of the centroids. Path data was collected by retaining parent information (except for seeds) during the crawl. For completed paths, seeds were considered to be “honorary” nuggets Incomplete paths were not yet at the cutoff (20) when we stopped the crawl. There are 365,050 lines in the PathData file.. The longest path, complete or incomplete, had 31 edges in it. There were more than 1 million URLs on the frontier paths x 7.5/path = 50,000 (about) nodes in the completed paths. So probably 300,000 incomplete paths.

Nuggets that are x steps from a nugget
We got 2000 completed paths, almost all 1 link away. In this sample, starting from nuggets, you never had to go out more than 7 or 8 steps. The next slide shows the same data starting from the seeds (one of which was a nugget)

Nuggets that are x steps from a seed and/or a nugget
If we called seeds nuggets, then we get this picture. Most nuggests were one step from a nugget, though there was a significant chance of stumbling across a nugget about 7 steps out. This suggests that about 7 steps is a good max. The seeds were sometimes random (yahoo), sometimes related (search4science) and sometimes a near nugget and once a nugget. By 7 or 8 links out from the seeds you have the bulk of the nuggets. (The idea behind choosing a cutoff of 20 was to give the crawl every chance of finding a nugget serendipitously. 20 is essentially “infinity”, or at least the diameter of the web.)

Better parents have better children.
We have been looking at the world in black and white terms. Here we look at nodes that aren’t nuggets but are almost. Consider X-O where score(X)=.45-5, just outside the ball around the centroid. Compare them with the population in general. Here is what we get This just shows that higher-quality pages have higher-quality children on average. The X axis are correlation buckets, where “1” is 0-0.5, and “17” is (There are no nodes with correlation above .90) This suggests that the cutoff should take the parent’s correlation into account.

Details in our ECDL 2002 paper Smaller frontier  more docs/second
Results Details in our ECDL 2002 paper Smaller frontier  more docs/second More documents downloaded in same time Higher-scoring documents were downloaded Cutoff of 20 averaged 7 steps at the cutoff As we move away from a nugget, the average correlation decreases. The distance is no longer the number of hops. A cutoff of 20, which is now distance, not # of hops, is pretty reasonable, and usually corresponds to about 7 hops, which as we saw before gets you most of the nuggets you are going to get. Performance – that is, the focus of the crawl – was much improved. From the number of documents downloaded per second, relatively smaller growth in frontier size, lots more documents downloaded in the same time due to the efficiency of the focused crawl, and higher-scoring documents, too.

Mercator Fall 2002 Student Project Centroids, Dictionary Term vectors
Collection URLs Query Centroid Collection Description A project has just been completed (or almost completed) to tie all these concepts together into a well-packaged Java system that can hook on to Mercator but potentially to other crawlers as well. We don’t have a good name for it yet, but here is what it looks like. Here we see just one collection represented, but in fact we had 31 topics in astronomy from “astronomy comets hale-bopp” to “astronomy meteors torino effect”. General search engine class that can be implemented for many search engines in future. The output is intended to be a top-level HTML page that describes the whole Astronomy collection, with links to an HTML page for each sub-collection. Natural language processing codes can be inserted here. This is where we would eventually like to hook on automatic Dublin Core metadata generation as well. All this can be done on the fly in Palo Alto by having Mercator call the collection phase, passing it downloaded documents one by one. Collection phase passes back “instructions” concerning the page: is it a nugget? Should links be followed from this page? (Eventually, which links) The query, centroid, and description synthesis are one-time steps. Query and Centroid Synthesis happens when Mercator instantiates this code; Description synthesis happens when Mercator terminates the crawl (time is up, whatever). Each package shown here also imports and exports XML. Thus you could run the first two phases and keep the XML for the centroids, and then start or resume a crawl from that. (Centroids and dictionary don’t change during an extended crawl; collections do.) Mercator Chebyshev P.s HTML

Collection Synthesis CS 502 – Carl Lagoze – Cornell University

Similar presentations

Presentation on theme: "Collection Synthesis CS 502 – Carl Lagoze – Cornell University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Collection Synthesis CS 502 – Carl Lagoze – Cornell University

Similar presentations

Presentation on theme: "Collection Synthesis CS 502 – Carl Lagoze – Cornell University"— Presentation transcript:

Similar presentations

About project

Feedback