Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS466-101 Thesaurus Creation/ Term Clustering Two major applications: 1.Query expansion – fleshing out sparse queries with related words  improves recall.

Similar presentations


Presentation on theme: "CS466-101 Thesaurus Creation/ Term Clustering Two major applications: 1.Query expansion – fleshing out sparse queries with related words  improves recall."— Presentation transcript:

1 CS466-101 Thesaurus Creation/ Term Clustering Two major applications: 1.Query expansion – fleshing out sparse queries with related words  improves recall (at possible expense of reduced precision) 2.Termset dimensionality reduction Similar outcome with smaller model

2 CS466-102 Term Dimensionality Reduction Query : Water Spaniel Diseases VQVQ VQ’VQ’ Spaniel Spaniels disease diseases collie illness ant 1 0 0 1 0 0 0 0 DOG ILL INSECT 11 0 Original vector Reduced vector  Collie illnesses  Poodle sickness Problem : Reduced flexibility in partial weighting of synonym set  Synonyms got as much weight as the original  Equivalent to query expansion when  i for all synonyms is 1

3 CS466-103 Query Expansion Query : Water Spaniel Diseases Water Spaniel diseases Spaniels diseases illness collie ant 1 1 1 0 0 0 0 0Original Expanded Relate Document set: s D 1 : Water Spaniels D 2 : Water Spaniel illnesses D 3 : Collie diseases 1 1 1  1  2  3  4 0 stem syn stem syn   i  semantic dist(w i,t)

4 CS466-104 Query Expansion Query : Water Spaniel diseases document1 : … water spaniels ….. …

5 CS466-105 Simplest Term Clustering  Stemming is a clustering method Original Term Set Clustered Term Set computing flies computers houses computation flown compute flew house comput * fly * stemming

6 CS466-106 Another simple clustering method: illness  disease, sickness, unwell, sick, ill, … PhD  Ph.D, PhD, Phd, Ph.D., …. Term equivalence classes Loosely related topic sets DOG  Spaniel, Collie, Schunauzer, bulldog, Poodle, …. Pre-existing thesauri(e.g. Rogets’) same part of speech(pos) different pos

7 CS466-107 Term Clustering Non-hierarchical methods : single pass(Salton, ’71) Given clustering threshold/target size and similarity function sim(i, j )  Pick random document D j  Assign a document d i : sim(D j, d j ) <  to cluster C j and recalc centroid else create a new cluster C k with centroid d k Exclude d i from document list  Repeat until document list empty D1D1  D2D2  D3D3 .....................................................................................................................................................................................................................................................................

8 CS466-108 Types of Clustering Behavior/Criterion Sim(t i, t j ) Document level – co-occurrence in same document Verb-object Syntagmatic similarity sim(drink, wine) appears togethersim(eat, meat) in regionsim(drink, water) Paradigmatic similaritysim(wine, water) appears as objectssim(wine, drink) of the same verbsim(wine, meat) based on object of drink or of all verbs  

9 CS466-109 N-gram Syntagmatic similarity sim(Hong, Kong) occur togethersim(soap, opera) sim(soap, suds) Paradigmatic similaritysim(opera, suds) occur in same sim(tall, short) contextsim(long, short) sim(Hong, Kong)  soap opera suds residue Ivory soap Dial Lye


Download ppt "CS466-101 Thesaurus Creation/ Term Clustering Two major applications: 1.Query expansion – fleshing out sparse queries with related words  improves recall."

Similar presentations


Ads by Google