New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu, Andrew W. Moore, and Alexander Gray.

Slides:

Advertisements

Similar presentations

Introduction to Algorithms Quicksort

Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

WSPD Applications.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Nearest Neighbor Search

DECISION TREES. Decision trees  One possible representation for hypotheses.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

1 Top-k Spatial Joins

Nearest Neighbor Finding Using Kd-tree Ref: Andrew Moore’s PhD thesis (1991)Andrew Moore’s PhD thesis.

Nearest Neighbor Queries using R-trees

K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.

Data Mining Classification: Alternative Techniques

Multidimensional Indexing

Searching on Multi-Dimensional Data

Nearest Neighbor Queries using R-trees Based on notes from G. Kollios.

Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},

Efﬁcient Reverse k-Nearest Neighbors Retrieval with Local kNN-Distance Estimation Mike Lin.

SASH Spatial Approximation Sample Hierarchy

2-dimensional indexing structure

Decision Tree Rong Jin. Determine Milage Per Gallon.

Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)

Spatial Indexing for NN retrieval

Project Proposals Simonas Šaltenis Aalborg University Nykredit Center for Database Research Department of Computer Science, Aalborg University.

Spatial Queries Nearest Neighbor Queries.

Aprendizagem baseada em instâncias (K vizinhos mais próximos)

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Branch and Bound Algorithm for Solving Integer Linear Programming

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

1 Section 9.2 Tree Applications. 2 Binary Search Trees Goal is implementation of an efficient searching algorithm Binary Search Tree: –binary tree in.

Efficient Distance Computation between Non-Convex Objects By Sean Quinlan Presented by Sean Augenstein and Nicolas Lee.

Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.

Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Combining Regression Trees and Radial Basis Function Networks paper by: M. Orr, J. Hallam, K. Takezawa, A. Murray, S. Ninomiya, M. Oide, T. Leonard presentation.

B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)

Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.

The Binary Heap. Binary Heap Looks similar to a binary search tree BUT all the values stored in the subtree rooted at a node are greater than or equal.

Multi-dimensional Search Trees

Efficient Progressive Processing of Skyline Queries in Peer-to-Peer Systems INFOSCALE’06.

M- tree: an efficient access method for similarity search in metric spaces Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.

NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.

CSED101 INTRODUCTION TO COMPUTING TREE 2 Hwanjo Yu.

Applications of Dynamic Programming and Heuristics to the Traveling Salesman Problem ERIC SALMON & JOSEPH SEWELL.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 6: Nearest and k-nearest Neighbor Classification.

Computer Science 112 Fundamentals of Programming II Introduction to Trees.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Multi-dimensional Search Trees CS302 Data Structures Modified from Dr George Bebis.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:

1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.

Decision Trees DEFINITION: DECISION TREE A decision tree is a tree in which the internal nodes represent actions, the arcs represent outcomes of an action,

k-Nearest neighbors and decision tree

Record Storage, File Organization, and Indexes

Data Science Algorithms: The Basic Methods

KD Tree A binary search tree where every node is a

Nearest Neighbor Queries using R-trees

K Nearest Neighbor Classification

Nearest-Neighbor Classifiers

Compact routing schemes with improved stretch

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Data Mining CSCI 307, Spring 2019 Lecture 23

Presentation transcript:

New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu, Andrew W. Moore, and Alexander Gray

Overview  Introduction k Nearest Neighbors ( k -NN) KNS1: conventional k -NN search  New algorithms for k -NN classification KNS2: for skewed-class data KNS3: ”are at least t of k -NN positive”?  Results  Comments

Introduction: k -NN  k -NN Nonparametric classification method. Given a data set of n data points, it finds the k closest points to a query point, and chooses the label corresponding to the majority. Computational complexity is too high in many solutions, especially for the high- dimensional case.

Introduction: KNS1  KNS1: Conventional k -NN search with ball-tree. Ball-Tree (binary):  Root node represents full set of points.  Leaf node contains some points.  Non-leaf node has two children nodes.  Pivot of a node: one of the points in the node, or the centroid of the points.  Radius of a node:

Introduction: KNS1  Bound the distance from a query point q :  Trade off the cost of construction against the tightness of the radius of the balls.

Introduction: KNS1 recursive procedure: PS out =BallKNN (PS in, Node)  PS in consists of the k-NN of q in V ( the set of points searched so far)  PS out consists of the k-NN of q in V and Node

KNS2  KNS2: For skewed-class data: one class is much more frequent than the other. Find the # of the k NN in the positive class without explicitly finding the k -NN set. Basic idea:  Build two ball-trees: Postree (small), Negtree  “Find Positive”: Search Postree to find k-nn set Posset k using KNS1 ;  “Insert negative”: Search Negtree, use Posset k as bounds to prune nodes far away and to estimate the # of negative points to be inserted to the true nearest neighbor set.

KNS2 Definitions:  Dists={Dist 1,…, Dist k } : the distance to the k nearest positive neighbors of q, sorted in increasing order.  V: the set of points in the negative balls visited so far.  (n, C): n is the # of positive points in k NN of q. C ={C 1,…,C n }, C i is # of the negative points in V closer than the i th positive neighbor to q.  and

KNS2 Step 2 “insert negative” is implemented by the recursive function (n out, C out )=NegCount(n in, C in, Node, j parent, Dists) (n in, C in ) sumarize interesting negative points for V; (n out, C out ) sumarize interesting negative points for V and Node;

KNS3  KNS3 “are at least t of k nearest neighbors positive?” No constraint of skewness in the class. Proposition:  Instead of directly compute the exact values, we compute the lower and upper bound, since m+t=k+1

KNS3 P is a set of balls from Postree, N consists of balls from Negtree.

Experimental results  Real data

Experimental results k=9, t=ceiling(k/2), Randomly pick 1% negative records and 50% positive records as test (986 points) Train on the reaming data points

Comments  Why k-NN? Baseline  No free lunch: For uniform high-dimensional data, no benefits. Results mean the intrinsic dimensionality is much lower.