A Fast and Scalable Nearest Neighbor Based Classification

Slides:



Advertisements
Similar presentations
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
Advertisements

Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Instructor: Mircea Nicolescu Lecture 15 CS 485 / 685 Computer Vision.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
CS Instance Based Learning1 Instance Based Learning.
Module 04: Algorithms Topic 07: Instance-Based Learning
Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
1 Lazy Learning – Nearest Neighbor Lantz Ch 3 Wk 2, Part 1.
AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center).
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Levels of Image Data Representation 4.2. Traditional Image Data Structures 4.3. Hierarchical Data Structures Chapter 4 – Data structures for.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Overview Data Mining - classification and clustering
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
An Improved Algorithm for Decision-Tree-Based SVM Sindhu Kuchipudi INSTRUCTOR Dr.DONGCHUL KIM.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Recognizing specific objects Matching with SIFT Original suggestion Lowe, 1999,2004.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Fuzzy Logic in Pattern Recognition
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
An Image Database Retrieval Scheme Based Upon Multivariate Analysis and Data Mining Presented by C.C. Chang Dept. of Computer Science and Information.
School of Computer Science & Engineering
Instance Based Learning
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Image Classification on Vertically Decomposed Data
Decision Tree Induction for High-Dimensional Data Using P-Trees
Space Filling Curves and Functional Contours
Efficient Ranking of Keyword Queries Using P-trees
Efficient Ranking of Keyword Queries Using P-trees
Proximal Support Vector Machine for Spatial Data Using P-trees1
= xRd=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes
Mean Shift Segmentation
Introduction Computer vision is the analysis of digital images
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)
Efficient Image Classification on Vertically Decomposed Data
K Nearest Neighbor Classification
Pre-Processing What is the best amount of amortized preprocessing?
Nearest-Neighbor Classifiers
Vertical K Median Clustering
A Fast and Scalable Nearest Neighbor Based Classification
Data Mining extracting knowledge from a large amount of data
Vertical K Median Clustering
Instance Based Learning
North Dakota State University Fargo, ND USA
Vertical K Median Clustering
Review Given a training space T(A1,…,An, C) and its features subspace X(A1,…,An) = T[A1,…,An], a functional f:X Reals, distance d(x,y)  |f(x)-f(y)| and.
North Dakota State University Fargo, ND USA
Contours: Y R f R* f(x) Y R f S
Notes from 02_CAINE conference
Presentation transcript:

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University

Outline Nearest Neighbors Classification Problems SMART TV (SMall Absolute diffeRence of ToTal Variation): A Fast and Scalable Nearest Neighbors Classification Algorithm SMART TV in Image Classification

Search for the K-Nearest Neighbors Classification Given a (large) TRAINING SET, R(A1,…,An, C), with C=CLASSES and {A1…An}=FEATURES Classification is: labeling unclassified objects based on the class label assignment pattern of objects in the training set kNN classification goes as follows: Training Set Search for the K-Nearest Neighbors Vote the class Unclassified Object

Problems with KNN Finding k-Nearest Neighbor Set can be expensive when the training set contains millions of objects (very large training set) linear to the size of the training set Can we make it faster (more scalable)?

But it is pure (pure0) so this branch ends A file, R(A1..An), containing horizontal structures (records) is Predicate trees: vertically partition; compress each vertical bit slice into a basic Ptree; horizontally process these basic Ptrees using one multi-operand logical AND. processed vertically (vertical scans) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 Horizontal structures (records) Scanned vertically R11 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11: 1. Whole file is not pure1 0 2. 1st half is not pure1  0 3. 2nd half is not pure1  0 0 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 0 0 0 1 01 10 1 0 0 1 0 1 0 0 10 01 5. 2nd half of 2nd half is  1 0 0 0 1 6. 1st half of 1st of 2nd is  1 0 0 0 1 1 7. 2nd half of 1st of 2nd not 0 0 0 0 1 10 4. 1st half of 2nd half not  0 0 0 But it is pure (pure0) so this branch ends Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2 01 21-level

Total Variation The Total Variation of a set X about (the mean), , measures total squared separation of objects in X about , defined as follows: We will use the concept of functional contours in this presentation to determine small pruned superset of the nearest neighbor set (which will then be scanned) First we will discuss functional contours in general then consider the specific TV contours.

Given f:R(A1..An)Y (any range) and SY (any subset of the range) , define contour(f,S)  f-1(S). A1 A2 An : : . . . YS A1..An space Y S graph(f) = { (a1,...,an,f(a1.an)) | (a1..an)R } contour(f,S) There is a DUALITY between functions, f:R(A1..An)Y and derived attributes, Af of R given by x.Af  f(x) where Dom(Af)=Y A1 A2 An x1 x2 xn : . . . Y f(x) f A1 A2 An Af x1 x2 xn f(x) R R* From the derived attribute point of view, Contour(Af,S) = SELECT A1..An FROM R* WHERE R*.Af  S. If S={a}, f-1({a}) is Isobar(f, a)

= xRd=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes R(A1..An) TV(a)=xR(x-a)o(x-a) If we use d for a index variable over the dimensions, = xRd=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes = xRd=1..n(k2kxdk)2 - 2xRd=1..nad(k2kxdk) + |R||a|2 = xd(i2ixdi)(j2jxdj) - 2xRd=1..nad(k2kxdk) + |R||a|2 = xdi,j 2i+jxdixdj - 2 x,d,k2k ad xdk + |R||a|2 = x,d,i,j 2i+j xdixdj - |R||a|2 2 dad x,k2kxdk + = x,d,i,j 2i+j xdixdj - |R|dadad 2|R| dadd + = x,d,i,j 2i+j xdixdj + dadad ) |R|( -2dadd + TV(a) = i,j,d 2i+j |Pdi^dj| - |R||a|2 k2k+1 dad |Pdk| + The first term does not depend upon a, thus, the simpler derived attribute, TV-TV() does not have that term but has with identical contours as TV (just lowered the graph by the constant, TV() ). We also find it useful to apply a log to this simpler Total Variation function (to reduce the number of bit slices. The resulting functional is called the High-Dimension-ready Total Variation or HDTV(a).

TV(a) = x,d,i,j 2i+j xdixdj + |R| ( -2dadd + dadad ) From equation 7, f(a)=TV(a)-TV() TV(a) = x,d,i,j 2i+j xdixdj + |R| ( -2dadd + dadad ) = |R| ( -2d(add-dd) + d(adad- dd) ) + dd2 ) = |R|( dad2 - 2ddad = |R| |a-|2 so f()=0 and letting g(a) HDTV(a) = ln( f(a) )= ln|R| + ln|a-|2 Taking  g / ad (a) = | a- |2 2( a -)d The Gradient of g at a = 2/| a- |2 (a -) The gradient =0 iff a= and gradient length depends only on the length of a- so isobars are hyper-circles The gradient function is has the form, h(r) = 2/r in along any ray from , Integrating, we get that g(a) has the form, 2ln|a-| along any coordinate direction (in fact any radial direction from ), so the shape of graph(g) is a funnel: To get an -contour, we move in and out along a- by  to inner point, b=(1-/|a-|)(a-) and outer point c=(1+/|a-|)(a-). Then take f(b) and f(c) as lower and upper endpoints of the red vertical interval. Then we use formulas on that interval to get a P-tree for the -contour (which is a well-pruned superset of the -nbrhd of a  -contour (radius  about a) What inteval endpts gives an exact -contour in feature space? a f(b) f(c) b c

2. Calculate b and c (which depend upon a and ) For additional vertical pruning we can use any other functional contours that are can easily computed (e.g., the dimension projection functionals). To classify a, then 1. Calculate basic P-trees for the derived attribute column of each training point 2. Calculate b and c (which depend upon a and ) 3. Get the feature space P-tree for those points with derived attribute value in [f(b),f(c)] (Note, when the camera ready paper was submitted we were still doing this step by sorting TV(a) values and then forming the predicate tree. Now we use the contour approach which speeds up that step considerably). 4. User that P-tree to prune out the candidate NNS. 5. If the root count of the candidate set is now small, proceed to scan and assign votes using Gaussian vote weights, else look for another pruning functional (e.g., dimension projection function for the major a- dimensions).  -contour (radius  about a) a f(b) f(c) b c

Graph of TV, TV-TV() and HDTV TV()=TV(x33) TV(x15) 1 2 3 4 5 X Y TV  HDTV 1 2 3 TV(x15)-TV() 4 5 X Y TV-TV()

Dataset KDDCUP-99 Dataset (Network Intrusion Dataset) 4.8 millions records, 32 numerical attributes 6 classes, each contains >10,000 records Class distribution: Testing set: 120 records, 20 per class 4 synthetic datasets (randomly generated): 10,000 records (SS-I) 100,000 records (SS-II) 1,000,000 records (SS-III) 2,000,000 records (SS-IV) Normal 972,780 IP sweep 12,481 Neptune 1,072,017 Port sweep 10,413 Satan 15,892 Smurf 2,807,886

Dataset (Cont.) OPTICS dataset 8,000 points, 8 classes (CL-1, CL-2,…,CL-8) 2 numerical attributes Training set: 7,920 points Testing set: 80 points, 10 per class

Dataset (Cont.) IRIS dataset 150 samples 3 classes (iris-setosa, iris-versicolor, and iris-virginica) 4 numerical attributes Training set: 120 samples Testing set: 30 samples, 10 per class

Speed and Scalability Speed and Scalability Comparison (k=5, hs=25) Algorithm x 1000 cardinality 10 100 1000 2000 4891 SMART-TV 0.14 0.33 2.01 3.88 9.27 P-KNN 0.89 1.06 3.94 12.44 30.79 KNN 0.39 2.34 23.47 49.28  NA Machine used: Intel Pentium 4 CPU 2.6 GHz machine, 3.8GB RAM, running Red Hat Linux

Classification Accuracy (Cont.) Classification Accuracy Comparison (SS-III), k=5, hs=25 Algorithm Class TP FP P R F SMART-TV normal 18 1.00 0.90 0.95 ipsweep 20 1 0.98 neptune portsweep satan 17 2 0.85 0.87 smurf 4 0.83 0.91 P-KNN 15 0.75 0.86 14 0.93 0.70 0.80 5 0.89 KNN 3 0.94

Overall Classification Accuracy Comparison Overall Accuracy Overall Classification Accuracy Comparison Datasets SMART-TV PKNN KNN IRIS 0.97 0.71 OPTICS 0.96 0.99 SS-I 0.72 0.89 SS-II 0.92 0.91 SS-III 0.94 SS-IV NI 0.93 NA

Summary A nearest-based classification algorithm that starts its classification steps by approximating a number of candidates of nearest neighbors The absolute difference of total variation between data points in the training set and the unclassified point is used to approximate the candidates The algorithm is fast, and it scales well in very large dataset. The classification accuracy is very comparable to that of KNN algorithm.

Appendix: Image Preprocessing We extracted color and texture features from the original pixel of the images Color features: We used HVS color space and quantized the images into 52 bins i.e. (6 x 3 x 3) bins Texture features: we used multi-resolutions Gabor filter with two scales and four orientation (see B.S. Manjunath, IEEE Trans. on Pattern Analysis and Machine Intelligence, 1996)

Image Dataset - 54 from color features - 16 from texture features Corel images (http://wang.ist.psu.edu/docs/related) 10 categories Originally, each category has 100 images Number of feature attributes: - 54 from color features - 16 from texture features We randomly generated several bigger size datasets to evaluate the speed and scalability of the algorithms. 50 images for testing set, 5 for each category

Image Dataset

Example on Corel Dataset

Results Class SMART-TV KNN k=3 k=5 k=7 hs=15 hs=25 hs=35 C1 0.69 0.72 0.75 0.74 0.73 0.78 0.81 0.77 0.79 C2 0.64 0.60 0.59 0.62 0.68 0.63 0.66 C3 0.65 0.67 0.76 0.57 0.70 C4 0.84 0.87 0.90 0.88 C5 0.91 0.92 0.93 0.89 0.94 C6 0.61 0.71 C7 0.85 C8 0.96 C9 0.52 0.43 0.45 0.54 C10 0.82

Results Classification Time

Results Preprocessing Time

Appendix:Overview of SMART-TV Compute Root Count Measure HDTV of each object Large Training Set Store the root count and HDTV values Preprocessing Phase Unclassified Object Approximate the candidate set of NNs Search the K-nearest neighbors for the candidate set Vote Classifying Phase

Preprocessing Phase Compute the root counts of each class Cj, 1 j  number of classes. Store the results. Complexity: O(kdb2) where k is the number of classes, d is the total of dimensions, and b is the bit-width. Compute , 1 j  number of classes. Complexity: O(n) where n is the cardinality of the training set. Also, retain the results.

Classifying Phase Stored values of root count and TV Classifying Phase Unclassified Object Approximate the candidate set of NNs Search the K-nearest neighbors from the candidate set Vote Classifying Phase

Classifying Phase For each class Cj with nj objects, 1  j  number of classes, do the followings: a. Compute , where is the unclassified object Find hs objects in Cj such that the absolute difference between the total variation of the objects in Cj and the total variation of about Cj are the smallest, i.e. Let A be an array and , where Store all objectIDs in A into TVGapList

Classifying Phase (Cont.) For each objectIDt, 1 t  Len(TVGapList) where Len(TVGapList) is equal to hs times the total number of classes, retrieve the corresponding object features from the training set and measure the pair-wise Euclidian distance between and , i.e. and determine the k nearest neighbors of Vote the class label for using the k nearest neighbors