A Fast and Scalable Nearest Neighbor Based Classification

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Multiclass SVM and Applications in Object Classification
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Radial-Basis Function Networks
Image Processing David Kauchak cs458 Fall 2012 Empirical Evaluation of Dissimilarity Measures for Color and Texture Jan Puzicha, Joachim M. Buhmann, Yossi.
Exercise Session 10 – Image Categorization
Module 04: Algorithms Topic 07: Instance-Based Learning
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
1 Lazy Learning – Nearest Neighbor Lantz Ch 3 Wk 2, Part 1.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.
Levels of Image Data Representation 4.2. Traditional Image Data Structures 4.3. Hierarchical Data Structures Chapter 4 – Data structures for.
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Computer Vision Spring ,-685 Instructor: S. Narasimhan WH 5409 T-R 10:30am – 11:50am Lecture #23.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
Semi-Supervised Clustering
Histograms CSE 6363 – Machine Learning Vassilis Athitsos
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
School of Computer Science & Engineering
Efficient Image Classification on Vertically Decomposed Data
Improving the Performance of Fingerprint Classification
Performance of Computer Vision
Trees, bagging, boosting, and stacking
Efficient Ranking of Keyword Queries Using P-trees
Yue (Jenny) Cui and William Perrizo North Dakota State University
Proximal Support Vector Machine for Spatial Data Using P-trees1
Machine Learning Dr. Mohamed Farouk.
Basic machine learning background with Python scikit-learn
Yue (Jenny) Cui and William Perrizo North Dakota State University
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Machine Learning Week 1.
Efficient Image Classification on Vertically Decomposed Data
A Fast and Scalable Nearest Neighbor Based Classification
K Nearest Neighbor Classification
Nearest-Neighbor Classifiers
Vertical K Median Clustering
Prepared by: Mahmoud Rafeek Al-Farra
Data Mining extracting knowledge from a large amount of data
Instance Based Learning
Outline Introduction Background Our Approach Experimental Results
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier
Foundation of Video Coding Part II: Scalar and Vector Quantization
Image Classification Painting and handwriting identification
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Review Given a training space T(A1,…,An, C) and its features subspace X(A1,…,An) = T[A1,…,An], a functional f:X Reals, distance d(x,y)  |f(x)-f(y)| and.
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Taufik Abidin and William Perrizo
Research Institute for Future Media Computing
Topological Signatures For Fast Mobility Analysis
Using Association Rules as Texture features
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Practice Project Overview
Presentation transcript:

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University

Outline Nearest Neighbors Classification Problems SMART TV (SMall Absolute diffeRence of ToTal Variation): A Fast and Scalable Nearest Neighbors Classification Algorithm SMART TV in Image Classification

Search for the K-Nearest Neighbors Classification Given a (large) TRAINING SET, R(A1,…,An, C), with C=CLASSES and (A1…An)=FEATURES Classification task is: to label the unclassified objects based on the pre-defined class labels of objects in the training set Prominent classification algorithms: SVM, KNN, Bayesian, etc. Training Set Search for the K-Nearest Neighbors Vote the class Unclassified Object

Problems with KNN Finding k-nearest neighbors is expensive when the training set contains millions of objects (very large training set) The classification time is linear to the size of the training set Can we make it faster and scalable?

P-Tree Vertical Data Structure A1 A2 A3 A4 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 =  R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 The construction steps of P-trees: 1. Convert the data into binary 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 2. Vertically project each attribute 3. Vertically project each bit position 4. Compress each bit slice into a P-tree 0 0 0 1 0 0 10 01 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 1 10 1 0 01 0 0 0 1 01 10 1 0 0 1 ^

Total Variation The Total Variation of a set X about (the mean), , measures total squared separation of objects in X about , defined as follows: TV(X,)=TV(X,x33) 1 2 3 4 5 X TV g a- a  Y

Total Variation (Cont.) 21 x 1 + 20 x 0 + 21 x 1 + 20 x 1 = 5 2 3 1 0 1 1 21 x 2 + 20 x 1 = 5

Total Variation (Cont.)

Total Variation (Cont.)

Total Variation (Cont.)

The Independency of RC The root count operations are independence from , which allows us to run the operations once in advance and retain the count results In classification task, the sets of classes are known and unchanged. Thus, the total variation of an object about its class can be pre-computed

Overview of SMART-TV Compute Root Count Store the root count Measure TV of each object Large Training Set Store the root count and TV values Preprocessing Phase Unclassified Object Approximate the candidate set of NNs Search the K-nearest neighbors for the candidate set Vote Classifying Phase

Preprocessing Phase Compute the root counts of each class Cj, 1 j  number of classes. Store the results. Complexity: O(kdb2) where k is the number of classes, d is the total of dimensions, and b is the bit-width. Compute , 1 j  number of classes. Complexity: O(n) where n is the cardinality of the training set. Also, retain the results.

Classifying Phase Stored values of root count and TV Classifying Phase Unclassified Object Approximate the candidate set of NNs Search the K-nearest neighbors from the candidate set Vote Classifying Phase

Classifying Phase For each class Cj with nj objects, 1  j  number of classes, do the followings: a. Compute , where is the unclassified object Find hs objects in Cj such that the absolute difference between the total variation of the objects in Cj and the total variation of about Cj are the smallest, i.e. Let A be an array and , where Store all objectIDs in A into TVGapList

Classifying Phase (Cont.) For each objectIDt, 1 t  Len(TVGapList) where Len(TVGapList) is equal to hs times the total number of classes, retrieve the corresponding object features from the training set and measure the pair-wise Euclidian distance between and , i.e. and determine the k nearest neighbors of Vote the class label for using the k nearest neighbors

Dataset KDDCUP-99 Dataset (Network Intrusion Dataset) 4.8 millions records, 32 numerical attributes 6 classes, each contains >10,000 records Class distribution: Testing set: 120 records, 20 per class 4 synthetic datasets (randomly generated): 10,000 records (SS-I) 100,000 records (SS-II) 1,000,000 records (SS-III) 2,000,000 records (SS-IV) Normal 972,780 IP sweep 12,481 Neptune 1,072,017 Port sweep 10,413 Satan 15,892 Smurf 2,807,886

Dataset (Cont.) OPTICS dataset 8,000 points, 8 classes (CL-1, CL-2,…,CL-8) 2 numerical attributes Training set: 7,920 points Testing set: 80 points, 10 per class

Dataset (Cont.) IRIS dataset 150 samples 3 classes (iris-setosa, iris-versicolor, and iris-virginica) 4 numerical attributes Training set: 120 samples Testing set: 30 samples, 10 per class

Speed and Scalability Speed and Scalability Comparison (k=5, hs=25) Algorithm x 1000 cardinality 10 100 1000 2000 4891 SMART-TV 0.14 0.33 2.01 3.88 9.27 P-KNN 0.89 1.06 3.94 12.44 30.79 KNN 0.39 2.34 23.47 49.28  NA Machine used: Intel Pentium 4 CPU 2.6 GHz machine, 3.8GB RAM, running Red Hat Linux

Classification Accuracy (Cont.) Classification Accuracy Comparison (SS-III), k=5, hs=25 Algorithm Class TP FP P R F SMART-TV normal 18 1.00 0.90 0.95 ipsweep 20 1 0.98 neptune portsweep satan 17 2 0.85 0.87 smurf 4 0.83 0.91 P-KNN 15 0.75 0.86 14 0.93 0.70 0.80 5 0.89 KNN 3 0.94

Overall Classification Accuracy Comparison Overall Accuracy Overall Classification Accuracy Comparison Datasets SMART-TV PKNN KNN IRIS 0.97 0.71 OPTICS 0.96 0.99 SS-I 0.72 0.89 SS-II 0.92 0.91 SS-III 0.94 SS-IV NI 0.93 NA

Outline Nearest Neighbors Classification Problems SMART TV (SMall Absolute diffeRence of ToTal Variation): A Fast and Scalable Nearest Neighbors Classification Algorithm SMART TV in Image Classification

Image Preprocessing We extracted color and texture features from the original pixel of the images Color features: We used HVS color space and quantized the images into 52 bins i.e. (6 x 3 x 3) bins Texture features: we used multi-resolutions Gabor filter with two scales and four orientation (see B.S. Manjunath, IEEE Trans. on Pattern Analysis and Machine Intelligence, 1996)

Image Dataset - 54 from color features - 16 from texture features Corel images (http://wang.ist.psu.edu/docs/related) 10 categories Originally, each category has 100 images Number of feature attributes: - 54 from color features - 16 from texture features We randomly generated several bigger size datasets to evaluate the speed and scalability of the algorithms. 50 images for testing set, 5 for each category

Image Dataset

Example on Corel Dataset

Results Class SMART-TV KNN k=3 k=5 k=7 hs=15 hs=25 hs=35 C1 0.69 0.72 0.75 0.74 0.73 0.78 0.81 0.77 0.79 C2 0.64 0.60 0.59 0.62 0.68 0.63 0.66 C3 0.65 0.67 0.76 0.57 0.70 C4 0.84 0.87 0.90 0.88 C5 0.91 0.92 0.93 0.89 0.94 C6 0.61 0.71 C7 0.85 C8 0.96 C9 0.52 0.43 0.45 0.54 C10 0.82

Results Classification Time

Results Preprocessing Time

Summary A nearest-based classification algorithm that starts its classification steps by approximating a number of candidates of nearest neighbors The absolute difference of total variation between data points in the training set and the unclassified point is used to approximate the candidates The algorithm is fast, and it scales well in very large dataset. The classification accuracy is very comparable to that of KNN algorithm.