Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Information Retrieval Review
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Chapter 19: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Chapter 19: Information.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Chapter 19: Information Retrieval
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Automatic Collection “Recruiter” Shuang Song. Project Goal Given a collection, automatically suggest other items to add to the collection  Design a process.
December 20, 2002CUL Metadata WG Meeting1 Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems.
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Computing & Information Sciences Kansas State University Monday, 04 Dec 2006CIS 560: Database System Concepts Lecture 41 of 42 Monday, 04 December 2006.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
SINGULAR VALUE DECOMPOSITION (SVD)
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Clustering C.Watters CS6403.
1 Beginning & Intermediate Algebra – Math 103 Math, Statistics & Physics.
Post-Ranking query suggestion by diversifying search Chao Wang.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
IR 6 Scoring, term weighting and the vector space model.
Database System Concepts, 5th Ed. ©Sang Ho Lee Chapter 19: Information Retrieval.
Plan for Today’s Lecture(s)
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Collection Fusion in Carrot2
Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Information Retrieval
Information Retrieval and Web Search
Chapter 5: Information Retrieval and Web Search
Collection Synthesis CS 502 – Carl Lagoze – Cornell University
Panagiotis G. Ipeirotis Luis Gravano
Retrieval Utilities Relevance feedback Clustering
INF 141: Information Retrieval
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Information Retrieval and Web Design
Chapter 19: Information Retrieval
CS 430: Information Discovery
Presentation transcript:

Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002

Collection – what is it? For a digital library, it could be a set of URLs The documents pointed to are about the same topic They may or may not be archived They may be collected by hand or automatically

Collections and Clusters Clusters are collections of items The items within the cluster are closer to each other than to items in other clusters There exist many statistical methods for cluster identification If clusters are pre-existing, then collection synthesis is a “classification problem”

The Document Vector Space Classic approach in IR The documents pointed to are about the same topic They may or may not be archived They may be collected by hand or automatically

Document Vector Space Model Classic “Saltonian” theory Originally based on collections Each word is a dimension in N-space Each document is a vector in N-space Best to use normalized weights Example:

Distance in DV Space How similar are two documents, or a document and a query? You look at their vectors in N space If there is overlap, the documents are similar If there is no overlap, the documents are orthogonal (I.e. totally unrelated)

Cosine Correlation Correlation ranges between 0 and 1 0  nothing in common at all (orthogonal) 1  all terms in common (complete overlap) Easy to compute Intuitive

Cosine Correlation Given vectors x, y both consisting of real numbers x1, x2, … xN and y1, y2, …yN Compute cosine correlation by:

The Dictionary Usual to keep a dictionary of actual words (or their stems) Efficient word lookup Common words left out Their document frequency df(I) Their discrimination value idf(I)

Computing the Document Vector Download a document, get the words, look each one up in our dictionary For each word that is actually in the dictionary, compute a weight for it: W(I) = tf(I) * idf(I)

Assembling a Collection Download a document Compute its term vector Add it to the collection it is most like, based on its vector and the collection’s vector How to get the collection vectors?

Collections: virtual to real

The Centroids “Centroid” is what I called the collection’s document vector It is critical to the quality of the collection that is assembled Where do the centroids come from? How to weight the terms?

The Topic Hierarchy 0 Algebra 1 Basic Algebra 2 Equations 3 Graphing Equations 2 Polynomials 1 Linear Algebra 2 Eigenvectors/Eigenvalues :

Building a seed URL set Given topic “T” Find hubs/authorities on that topic Exploit a search engine to do this How many results to keep? I chose 7; Kleinberg chooses 200. Google does not allow automated searches without prior permission

Query: Graphing Basic Algebra… Accessone.com/~bbunge/Algebra/Algebra.html Library.thinkquest.org/20991/prealg/eq.html Library.thinkquest.org/20991/prealg/graph.html Sosmath.com/algebra/algebra.html Algebrahelp.com/ Archives.math.utk.edu/topics/algebra.html Purplemath.com/modules/modules.htm

Results: Centroids 26 centroids (from about 30 topics) Seed sets must have at least 4 URLs All terms from seed URL documents were extracted and weighted Kept the top 40 words in each vector Union of the vectors became our dictionary Centroid evaluation: 90% of seed URLs classified with “their” centroid

Three Knobs for Crawl Control “On topic”: downloaded page correlates with the nearest centroid at least “Q”, where 0 < Q <= 1.0 Cutoff – how many off-topic pages to travel through before cutting off this search line? 0 <= Cutoff <= D Time limit – how many hours to crawl

Results: Some Collections Built 26 collections in Math Keep of the best correlating URLs for each class Best Cutoff is 0 I have crawled (for math) about 5 hours Some collections are larger than others

Collection “Evaluation” The only automatic evaluation method is by the correlative value == how close to the collection is an item With human relevance assessments, one can also compute a “precision” curve Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.

Results: Class 14 Mathforum.org/dr.math/problems/keesha html Mathforum.org/dr.math/problems/kmiller ht ml Mathforum.org/dr.math/problems/santiago html : Mtl.math.uiuc.edu/message_board/messages/326. html

Conclusions We are still working on the collections. Picking parameters. Will add machine learning. Discussion? Questions?