More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Chapter 5: Introduction to Information Retrieval
Albert Gatt Corpora and Statistical Methods Lecture 13.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Data Mining Techniques: Clustering
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
CS 430 / INFO 430 Information Retrieval
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Hinrich Schütze and Christina Lioma
Application Layer At long last we can ask the question - how does the user interface with the network?
CS/Info 430: Information Retrieval
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Advanced Multimedia Text Classification Tamara Berg.
APPLYING INFORMATION RETRIEVAL TO TEXT MINING Data mining Lab 이아람.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Text mining.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Clustering C.Watters CS6403.
Vector Space Models.
Conceptual structures in modern information retrieval Claudio Carpineto Fondazione Ugo Bordoni
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Data Mining and Text Mining. The Standard Data Mining process.
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
Clustering of Web pages
Text Based Information Retrieval
Information Retrieval and Web Search
John Nicholas Owen Sarah Smith
Document Clustering Matt Hughes.
Automatic Global Analysis
Retrieval Utilities Relevance feedback Clustering
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?

Reprise l Vector Model of IR l Mapping onto a space l Distance between documents

Objectives for this lecture l The cluster hypothesis l Clustering Methods l Non Text Retrieval

Vector Model Word Sunderla nd Word Football Word Club Word        

Vector Model Implementation         Word Sunderla nd Word Football Word Club Word

Query/Document Match                  QueryDocument

Two sorts of vector model l Full Model –use counts of terms in documents rather than just whether they appear once or not –in fact it uses weights of terms to reflect their importance –Inverse Document Frequency and across collection Term Frequency

Similarity l Documents and Queries are similar if documents have entries in query word positions l Very good documents will have high counts in query positions – especially of infrequently occurring query terms.

Cluster Hypothesis l Closely associated documents tend to be relevant to the same request

Clustering Methods l Inside Out (Bottom Up) l Outside In (Top Down) l Both these methods are hierarchical l Non-hierarchical clustering is also possible

Inside Out (Bottom Up) l Minimum spanning tree l Cluster each document with its nearest neighbour l Merge with the nearest cluster l Repeat until “good enough” or sufficiently few clusters

Bottom up Clustering - dendrogram Similarity d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d After Chakrabarti 2003

Outside In (Top Down) l Divide the Whole space in two l Divide each subpart in two l Repeat

Outside in Clustering

Outside In (2)

Outside In (3)

Term Vectors as Feature Vectors l Documents don’t have to be text l Vectors don’t have to be term vectors l Term Vectors are a sort of feature vector l Features might be: –Colour –Melody –Shape

Conclusions l What a cluster is and why it might be useful l How cluster could be formed l How the vector model might be used in non textual domains

Reading l Soumen Chakrabarti –Mining the Web –Morgan Kaufmann Publishers –2003 »P84 on