1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20
2 Outline Introduction Motivation Measurement Algorithms Experiments Conclusion
3 Introduction Memory-Based Reasoning –Case-Based Reasoning –Instance-Based Learning Given a training dataset and a new object, predict the class (target value) of the new object. Focus on table data
4 Introduction K Nearest Neighbor Search –Compute similarity between the new object and each object in the training dataset. –Linear time to the size of the dataset Similarity: Euclidean distance Multi-dimension Index –Spatial data structure, such as R-tree –Numeric data
5 Introduction Indexing on Categorical Data? –Linear order of the categories –Existing correct ordering? –Best ordering? Store the mapped data on a multi- dimensional data structure as filtering mechanism
6 Measurement for Ordering Ordering Problem Given an undirected weighted complete graph, a simple path is an ordering of the vertices. The edges are the distances between pairs of vertices. The ordering problem is to find a path, called ordering path, of maximal value according to a certain scoring function.
7 Measurement for Ordering Relationship Scoring Reasonable Ordering Score In an ordering path, 3-tuple is reasonable if and only if dist(v i-1, v i+1 ) ≧ dist(v i-1, v i ) and dist(v i-1, v i+1 ) ≧ dist(v i, v i+1 ).
8 Measurement for Mapping Pairwise Difference Scoring –Normalized distance matrix –Mapping values of categories –Dist m (v i, v j ) = |mapping(v i ) - mapping(v j )|
9 Algorithms Prim-like Ordering Kruskal-like Ordering Divisive Ordering GA Approach Ordering –A vertex is a category –A graph represent a distance matrix
10 Prim-like Ordering Algorithm Prim’s Minimum Spanning Tree –Initially, choose a least edge (u, v) –Add the edge to the tree; S = {u, v} –Choose a least edge connecting a vertex in S and a vertex, w, not in S –Add the edge to the tree; Add w to S –Repeat until all vertices are in S
11 Prim-like Ordering Algorithm Prim-like Ordering –Choose a least edge (u, v) –Add the edge to the ordering path; S = {u, v} –Choose a least edge connecting a vertex in S and a vertex, w, not in S –If the edge creates a circle on the path, discard the edge, and choose again –Else, add the edge to the ordering path; Add w to S –Repeat until all vertices are in S
12 Kruskal-like Ordering Algorithm Kruskal Minimum Spanning Tree –Initially, choose a least edge (u, v) –Add the edge to the tree; S = {u, v} –Choose a least edge as long as the edge does not create a circle in the tree –Add the edge to the tree; Add the two vertiecs to S –Repeat until all vertices are in S
13 Kruskal-like Ordering Algorithm Kruskal-like Ordering –Initially, choose a least edge (u, v) and add it to the ordering path; S = {u, v} –Choose a least edge as long as the edge does not create a circle in the tree, and degree of either vertex on the path is <= 2 –Add the edge to the ordering path; Add the two vertices to S –Repeat until all vertices are in S Heap array can be used to speed up choosing least edge
14 Divisive Ordering Algorithm Idea: –Pick a central vertex, and split the rest vertices –Building a binary tree: vertices are the leaves Central Vertex:
15 Divisive Ordering Algorithm A R is closer to P than A L is. B L is closer to P than B R is. P A ALAL ARAR B BLBL BRBR
16 Clustering Splitting a Set of Vertices into Two Groups –Each group has at least one vertex –Close (similar) vertices in same group Distant vertices in different groups Clustering Algorithms –Two clusters
17 Clustering Clustering –Grouping a set of objects into classes of similar objects Agglomerative Hierarchical Clustering Algorithm –Singleton clusters –Merge similar clusters
18 Clustering Clustering Algorithm: Cluster Similarity –Single link dist(Ci, Cj) = min(dist(p, q)), p in Ci, q in Cj –Complete link dist(Ci, Cj) = max(dist(p, q)), p in Ci, q in Cj –Average link -- adopted in our study dist(Ci, Cj) = avg(dist(p, q)), p in Ci, q in Cj –others
19 Clustering Clustering Implementation Issues –Which pair of clusters to be merged: Keep cluster-to-cluster similarity for each pair –Recursively partition sets of vertices while building the binary tree: Non-recursive version with a stack
20 GA Ordering Algorithm Genetic Algorithm for Optimal Problems Chromosome: solution Population: pool of solutions Genetic Operations –Crossover –Mutation
21 GA Ordering Algorithm Encoding a Solution –Binary string –Ordered list of categories – in our ordering problem Fitness Function –Reasonable ordering score Selecting Chromosomes for crossover –High fitness value => high probability
22 GA Ordering Algorithm Crossover –Single point –Multiple points –Mask Crossover AB | CDE and BD | AEC Results in ABAEC and BDCDE => Illegal
23 GA Ordering Algorithm Repair Illegal Chromosome ABAEC –AB*EC => fill D in * position Repair Illegal Chromosome ABABC –AB**C –D and E are missing –Which one is closest to B, fill it in first * position
24 Mapping Function Ordering Path Mapping(v i ) =
25 Experiments Synthetic Data (width/length = 5)
26 Experiments Synthetic Data (width/length = 10)
27 Experiments Synthetic Data: Reasonable Ordering Score for Divisive Algorithm –width/length = 5 => 0.82 –width/length = 10 => 0.9 –No Ordering => 1/3 Divisive algorithm is better than Prim-like algorithm when number of categories > 100
28 Experiments Synthetic Data (width/length = 5)
29 Experiments Synthetic Data (width/length = 10)
30 Experiments Divisive Ordering is best among the three ordering algorithms For divisive ordering algorithm on > 100 categories, RMSE scores are around 0.07 when width/length = 10, and are around 0.05 when width/length = 10. Prim-like ordering algorithm: 0.12 and 0.1, respectively.
31 Experiments “Census-Income” dataset from the University of California, Irvine (UCI) KDD Archive 33 nominal attributes, 7 continuous attributes Sample 5000 records for training dataset. Sample 2000 records for approximate KNN search experiment.
32 Experiments Distance Matrix: distance between two categories V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS-Clustering Categorical Data Using Summaries,” ACM KDD, 1999 D = {d 1, d 2, …, d n } of n tuples. D is subset of D 1 * D 2 * … * D k, where D i is a categorical domain, for 1 ≦ i ≦ k. di =.
33 Experiments
34 Experiments Approximate KNN – nominal attributes
35 Experiments Approximate KNN – nominal attributes
36 Experiments Approximate KNN – nominal attributes
37 Experiments Approximate KNN – all attributes
38 Experiments Approximate KNN – all attributes
39 Experiments Approximate KNN – all attributes
40 Conclusion Developed Ordering Algorithms –Prim-like –Krusal-like –Divisive –GA-based Devised Measurement –Reasonable ordering score –Root mean squared error
41 Conclusion What next? –New categories, new mapping function –New index structure? –Training mapping function for a given ordering path.
42 Thank you.