Thesaurus Creation/ Term Clustering

Slides:



Advertisements
Similar presentations
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Advertisements

A Vector Space Model for Automatic Indexing
Chapter 5: Introduction to Information Retrieval
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
CS Thesaurus Creation/ Term Clustering Two major applications: 1.Query expansion – fleshing out sparse queries with related words  improves recall.
2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
Term and Document Clustering Manual thesaurus generation Automatic thesaurus generation Term clustering techniques: –Cliques,connected components,stars,strings.
-- CS466 Lecture XVI --1 Vector Models for Person / Place PERSON CENTROID PLACE CENTROID PERSON PLACE KEY.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
IR 6 Scoring, term weighting and the vector space model.
Data Science Dimensionality Reduction WFH: Section 7.3 Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall.
Automated Information Retrieval
Information Retrieval: Models and Methods
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Semantic Processing with Context Analysis
Lecture 12: Relevance Feedback & Query Expansion - II
The topic discovery models
Information Retrieval: Models and Methods
Data Mining K-means Algorithm
Vector Semantics Introduction.
Data Clustering Michael J. Watts
CSC 594 Topics in AI – Natural Language Processing
Vector-Space (Distributional) Lexical Semantics
The topic discovery models
Multimedia Information Retrieval
Counter propagation network (CPN) (§ 5.3)
Special Topics on Information Retrieval
Information Organization: Clustering
Representation of documents and queries
From frequency to meaning: vector space models of semantics
The topic discovery models
Word Embedding Word2Vec.
CS 430: Information Discovery
Lexical Ambiguity Resolution / Sense Disambiguation
Lecture 22 Word Similarity
CS224N: Query Focused Multi-Document Summarization
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Automatic Global Analysis
FLOSCAN: An Artificial Life Based Data Mining Algorithm
CS 430: Information Discovery
Retrieval Utilities Relevance feedback Clustering
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Information Retrieval and Web Design
Vector Models for IR Gerald Salton, Cornell SMART System
CS 430: Information Discovery
Presentation transcript:

Thesaurus Creation/ Term Clustering Two major applications: Query expansion – fleshing out sparse queries with related words improves recall (at possible expense of reduced precision) Termset dimensionality reduction Similar outcome with smaller model CS466-10

Term Dimensionality Reduction Query : Water Spaniel Diseases Spaniel Spaniels disease diseases collie illness ant Original vector VQ 1 0 0 1 0 0 0 0 Reduced vector VQ’ DOG ILL INSECT 1 1 Collie illnesses Poodle sickness Problem : Reduced flexibility in partial weighting of synonym set  Synonyms got as much weight as the original Equivalent to query expansion when i for all synonyms is 1 CS466-10

Query Expansion Query : Water Spaniel Diseases Original 1 1 1 0 0 0 0 0 syn Expanded 1 1 1 1 2 3 4 0 Water Spaniel diseases Spaniels diseases illness collie ant stem   i  semantic dist(wi ,t) Relate Document set: D1 : Water Spaniels D2 : Water Spaniel illnesses D3 : Collie diseases stem syn syn CS466-10

Query Expansion Query : Water Spaniel diseases document1 : … water spaniels ….. … CS466-10

Simplest Term Clustering  Stemming is a clustering method stemming Original Term Set Clustered Term Set computing flies computers houses computation flown compute flew house comput * fly * CS466-10

Another simple clustering method: Pre-existing thesauri(e.g. Rogets’) different pos illness  disease, sickness, unwell, sick, ill, … PhD  Ph.D, PhD, Phd, Ph.D., …. Term equivalence classes Loosely related topic sets DOG  Spaniel, Collie, Schunauzer, bulldog, Poodle, …. same part of speech(pos) CS466-10

Term Clustering Non-hierarchical methods : single pass(Salton, ’71) Given clustering threshold/target size and similarity function sim(i , j )  Pick random document Dj Assign a document di : sim(Dj , dj) <  to cluster Cj and recalc centroid else create a new cluster Ck with centroid dk  Exclude di from document list  Repeat until document list empty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . D2 . . . . . . . . . . . . . . . . . . . . . . . . . . D1 . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CS466-10

Types of Clustering Behavior/Criterion Sim(ti , tj) Document level – co-occurrence in same document Verb-object Syntagmatic similarity sim(drink, wine) appears together sim(eat, meat) in region sim(drink, water) Paradigmatic similarity sim(wine, water) appears as objects sim(wine, drink) of the same verb sim(wine, meat) based on object of drink or of all verbs   CS466-10

Syntagmatic similarity sim(Hong, Kong) occur together sim(soap, opera) N-gram Syntagmatic similarity sim(Hong, Kong) occur together sim(soap, opera) sim(soap, suds) Paradigmatic similarity sim(opera, suds) occur in same sim(tall, short) context sim(long, short) sim(Hong, Kong)  soap opera suds residue Ivory soap Dial Lye CS466-10