New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu, Andrew W. Moore, and Alexander Gray
Overview Introduction k Nearest Neighbors ( k -NN) KNS1: conventional k -NN search New algorithms for k -NN classification KNS2: for skewed-class data KNS3: ”are at least t of k -NN positive”? Results Comments
Introduction: k -NN k -NN Nonparametric classification method. Given a data set of n data points, it finds the k closest points to a query point, and chooses the label corresponding to the majority. Computational complexity is too high in many solutions, especially for the high- dimensional case.
Introduction: KNS1 KNS1: Conventional k -NN search with ball-tree. Ball-Tree (binary): Root node represents full set of points. Leaf node contains some points. Non-leaf node has two children nodes. Pivot of a node: one of the points in the node, or the centroid of the points. Radius of a node:
Introduction: KNS1 Bound the distance from a query point q : Trade off the cost of construction against the tightness of the radius of the balls.
Introduction: KNS1 recursive procedure: PS out =BallKNN (PS in, Node) PS in consists of the k-NN of q in V ( the set of points searched so far) PS out consists of the k-NN of q in V and Node
KNS2 KNS2: For skewed-class data: one class is much more frequent than the other. Find the # of the k NN in the positive class without explicitly finding the k -NN set. Basic idea: Build two ball-trees: Postree (small), Negtree “Find Positive”: Search Postree to find k-nn set Posset k using KNS1 ; “Insert negative”: Search Negtree, use Posset k as bounds to prune nodes far away and to estimate the # of negative points to be inserted to the true nearest neighbor set.
KNS2 Definitions: Dists={Dist 1,…, Dist k } : the distance to the k nearest positive neighbors of q, sorted in increasing order. V: the set of points in the negative balls visited so far. (n, C): n is the # of positive points in k NN of q. C ={C 1,…,C n }, C i is # of the negative points in V closer than the i th positive neighbor to q. and
KNS2 Step 2 “insert negative” is implemented by the recursive function (n out, C out )=NegCount(n in, C in, Node, j parent, Dists) (n in, C in ) sumarize interesting negative points for V; (n out, C out ) sumarize interesting negative points for V and Node;
KNS3 KNS3 “are at least t of k nearest neighbors positive?” No constraint of skewness in the class. Proposition: Instead of directly compute the exact values, we compute the lower and upper bound, since m+t=k+1
KNS3 P is a set of balls from Postree, N consists of balls from Negtree.
Experimental results Real data
Experimental results k=9, t=ceiling(k/2), Randomly pick 1% negative records and 50% positive records as test (986 points) Train on the reaming data points
Comments Why k-NN? Baseline No free lunch: For uniform high-dimensional data, no benefits. Results mean the intrinsic dimensionality is much lower.