Download presentation
Presentation is loading. Please wait.
Published byGaven Pasker Modified over 10 years ago
1
CS466-101 Thesaurus Creation/ Term Clustering Two major applications: 1.Query expansion – fleshing out sparse queries with related words improves recall (at possible expense of reduced precision) 2.Termset dimensionality reduction Similar outcome with smaller model
2
CS466-102 Term Dimensionality Reduction Query : Water Spaniel Diseases VQVQ VQ’VQ’ Spaniel Spaniels disease diseases collie illness ant 1 0 0 1 0 0 0 0 DOG ILL INSECT 11 0 Original vector Reduced vector Collie illnesses Poodle sickness Problem : Reduced flexibility in partial weighting of synonym set Synonyms got as much weight as the original Equivalent to query expansion when i for all synonyms is 1
3
CS466-103 Query Expansion Query : Water Spaniel Diseases Water Spaniel diseases Spaniels diseases illness collie ant 1 1 1 0 0 0 0 0Original Expanded Relate Document set: s D 1 : Water Spaniels D 2 : Water Spaniel illnesses D 3 : Collie diseases 1 1 1 1 2 3 4 0 stem syn stem syn i semantic dist(w i,t)
4
CS466-104 Query Expansion Query : Water Spaniel diseases document1 : … water spaniels ….. …
5
CS466-105 Simplest Term Clustering Stemming is a clustering method Original Term Set Clustered Term Set computing flies computers houses computation flown compute flew house comput * fly * stemming
6
CS466-106 Another simple clustering method: illness disease, sickness, unwell, sick, ill, … PhD Ph.D, PhD, Phd, Ph.D., …. Term equivalence classes Loosely related topic sets DOG Spaniel, Collie, Schunauzer, bulldog, Poodle, …. Pre-existing thesauri(e.g. Rogets’) same part of speech(pos) different pos
7
CS466-107 Term Clustering Non-hierarchical methods : single pass(Salton, ’71) Given clustering threshold/target size and similarity function sim(i, j ) Pick random document D j Assign a document d i : sim(D j, d j ) < to cluster C j and recalc centroid else create a new cluster C k with centroid d k Exclude d i from document list Repeat until document list empty D1D1 D2D2 D3D3 .....................................................................................................................................................................................................................................................................
8
CS466-108 Types of Clustering Behavior/Criterion Sim(t i, t j ) Document level – co-occurrence in same document Verb-object Syntagmatic similarity sim(drink, wine) appears togethersim(eat, meat) in regionsim(drink, water) Paradigmatic similaritysim(wine, water) appears as objectssim(wine, drink) of the same verbsim(wine, meat) based on object of drink or of all verbs
9
CS466-109 N-gram Syntagmatic similarity sim(Hong, Kong) occur togethersim(soap, opera) sim(soap, suds) Paradigmatic similaritysim(opera, suds) occur in same sim(tall, short) contextsim(long, short) sim(Hong, Kong) soap opera suds residue Ivory soap Dial Lye
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.