Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002
Collection – what is it? For a digital library, it could be a set of URLs The documents pointed to are about the same topic They may or may not be archived They may be collected by hand or automatically
Collections and Clusters Clusters are collections of items The items within the cluster are closer to each other than to items in other clusters There exist many statistical methods for cluster identification If clusters are pre-existing, then collection synthesis is a “classification problem”
The Document Vector Space Classic approach in IR The documents pointed to are about the same topic They may or may not be archived They may be collected by hand or automatically
Document Vector Space Model Classic “Saltonian” theory Originally based on collections Each word is a dimension in N-space Each document is a vector in N-space Best to use normalized weights Example:
Distance in DV Space How similar are two documents, or a document and a query? You look at their vectors in N space If there is overlap, the documents are similar If there is no overlap, the documents are orthogonal (I.e. totally unrelated)
Cosine Correlation Correlation ranges between 0 and 1 0 nothing in common at all (orthogonal) 1 all terms in common (complete overlap) Easy to compute Intuitive
Cosine Correlation Given vectors x, y both consisting of real numbers x1, x2, … xN and y1, y2, …yN Compute cosine correlation by:
The Dictionary Usual to keep a dictionary of actual words (or their stems) Efficient word lookup Common words left out Their document frequency df(I) Their discrimination value idf(I)
Computing the Document Vector Download a document, get the words, look each one up in our dictionary For each word that is actually in the dictionary, compute a weight for it: W(I) = tf(I) * idf(I)
Assembling a Collection Download a document Compute its term vector Add it to the collection it is most like, based on its vector and the collection’s vector How to get the collection vectors?
Collections: virtual to real
The Centroids “Centroid” is what I called the collection’s document vector It is critical to the quality of the collection that is assembled Where do the centroids come from? How to weight the terms?
The Topic Hierarchy 0 Algebra 1 Basic Algebra 2 Equations 3 Graphing Equations 2 Polynomials 1 Linear Algebra 2 Eigenvectors/Eigenvalues :
Building a seed URL set Given topic “T” Find hubs/authorities on that topic Exploit a search engine to do this How many results to keep? I chose 7; Kleinberg chooses 200. Google does not allow automated searches without prior permission
Query: Graphing Basic Algebra… Accessone.com/~bbunge/Algebra/Algebra.html Library.thinkquest.org/20991/prealg/eq.html Library.thinkquest.org/20991/prealg/graph.html Sosmath.com/algebra/algebra.html Algebrahelp.com/ Archives.math.utk.edu/topics/algebra.html Purplemath.com/modules/modules.htm
Results: Centroids 26 centroids (from about 30 topics) Seed sets must have at least 4 URLs All terms from seed URL documents were extracted and weighted Kept the top 40 words in each vector Union of the vectors became our dictionary Centroid evaluation: 90% of seed URLs classified with “their” centroid
Three Knobs for Crawl Control “On topic”: downloaded page correlates with the nearest centroid at least “Q”, where 0 < Q <= 1.0 Cutoff – how many off-topic pages to travel through before cutting off this search line? 0 <= Cutoff <= D Time limit – how many hours to crawl
Results: Some Collections Built 26 collections in Math Keep of the best correlating URLs for each class Best Cutoff is 0 I have crawled (for math) about 5 hours Some collections are larger than others
Collection “Evaluation” The only automatic evaluation method is by the correlative value == how close to the collection is an item With human relevance assessments, one can also compute a “precision” curve Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.
Results: Class 14 Mathforum.org/dr.math/problems/keesha html Mathforum.org/dr.math/problems/kmiller ht ml Mathforum.org/dr.math/problems/santiago html : Mtl.math.uiuc.edu/message_board/messages/326. html
Conclusions We are still working on the collections. Picking parameters. Will add machine learning. Discussion? Questions?