(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005
(C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev Office: 3080, West Hall Connector Phone: (734) Office hours: M & Th 12-1 or via Course page: Class meets on Fridays, 2:10-4:55 PM in 409 West Hall
(C) 2003, The University of Michigan3 Approximate string matching
(C) 2003, The University of Michigan4 Levenshtein edit distance Examples: –Theatre-> theater –Ghaddafi->Qadafi –Computer->counter Edit distance (inserts, deletes, substitutions) –Edit transcript Done through dynamic programming
(C) 2003, The University of Michigan5 Recurrence relation Three dependencies –D(i,0)=i –D(0,j)=j –D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j- 1)+t(i,j)] Simple edit distance: –t(i,j) = 0 iff S1(i)=S2(j)
(C) 2003, The University of Michigan6 Example Gusfield 1997 WRITERS V11 I22 N33 T44 N55 E66 R77
(C) 2003, The University of Michigan7 Example (cont’d) Gusfield 1997 WRITERS V I N T44444* N55 E66 R77
(C) 2003, The University of Michigan8 Tracebacks Gusfield 1997 WRITERS V I N T44444* N55 E66 R77
(C) 2003, The University of Michigan9 Weighted edit distance Used to emphasize the relative cost of different edit operations Useful in bioinformatics –Homology information –BLAST –Blosum – heidelberg.de:8000/misc/mat/blosum50.htmlhttp://eta.embl- heidelberg.de:8000/misc/mat/blosum50.html
(C) 2003, The University of Michigan10 Web sites: – – Demo: –/clair4/class/ir-w05/dp –./dp.pl theater theatre
(C) 2003, The University of Michigan11 Clustering
(C) 2003, The University of Michigan12 Clustering Exclusive/overlapping clusters Hierarchical/flat clusters The cluster hypothesis –Documents in the same cluster are relevant to the same query
(C) 2003, The University of Michigan13 Representations for document clustering Typically: vector-based –Words: “cat”, “dog”, etc. –Features: document length, author name, etc. Each document is represented as a vector in an n-dimensional space Similar documents appear nearby in the vector space (distance measures are needed)
(C) 2003, The University of Michigan14 Hierarchical clustering Dendrograms E.g., language similarity:
(C) 2003, The University of Michigan15 Another example Kingdom = animal Phylum = Chordata Subphylum = Vertebrata Class = Osteichthyes Subclass = Actinoptergyii Order = Salmoniformes Family = Salmonidae Genus = Oncorhynchus Species = Oncorhynchus kisutch (Coho salmon)
(C) 2003, The University of Michigan16 Clustering using dendrograms REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation? Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A
(C) 2003, The University of Michigan17 Methods Single-linkage –One common pair is sufficient –disadvantages: long chains Complete-linkage –All pairs have to match –Disadvantages: too conservative Average-linkage Centroid-based (online) –Look at distances to centroids Demo: –/clair4/class/ir-w05/clustering
(C) 2003, The University of Michigan18 k-means Needed: small number k of desired clusters hard vs. soft decisions Example: Weka
(C) 2003, The University of Michigan19 k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while
(C) 2003, The University of Michigan20 Example Cluster the following vectors into two groups: –A = –B = –C = –D = –E = –F =
(C) 2003, The University of Michigan21 Complexity Complexity = O(kn) because at each step, n documents have to be compared to k centroids.
(C) 2003, The University of Michigan22 Weka A general environment for machine learning (e.g. for classification and clustering) Book by Witten and Frank
(C) 2003, The University of Michigan23 Demos pletKM.htmlhttp:// pletKM.html rhttp:// r % cd /data2/tools/weka % export CLASSPATH=/data2/tools/weka-3-3-4/weka.jar % java weka.clusterers.SimpleKMeans -t data/weather.arff
(C) 2003, The University of Michigan24 Human clustering Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).