Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.

CS107 Introduction to Computer Science

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

CS107 Introduction to Computer Science Lecture 7, 8 An Introduction to Algorithms: Efficiency of algorithms.

CS Instance Based Learning1 Instance Based Learning.

Module 04: Algorithms Topic 07: Instance-Based Learning

Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.

Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING

Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.

Chapter 9 – Classification and Regression Trees

Scaling up Decision Trees. Decision tree learning.

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.

CSC 211 Data Structures Lecture 13

September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.

Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center).

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.

Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

KNN Classifier.  Handed an instance you wish to classify  Look around the nearby region to see what other classes are around  Whichever is most common—make.

March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.

Overview Data Mining - classification and clustering

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.

Data Science Algorithms: The Basic Methods

Instance Based Learning

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Efficient Image Classification on Vertically Decomposed Data

Decision Tree Induction for High-Dimensional Data Using P-Trees

Efficient Ranking of Keyword Queries Using P-trees

Efficient Ranking of Keyword Queries Using P-trees

Yue (Jenny) Cui and William Perrizo North Dakota State University

Proximal Support Vector Machine for Spatial Data Using P-trees1

= xRd=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes

North Dakota State University Fargo, ND USA

Yue (Jenny) Cui and William Perrizo North Dakota State University

PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)

Efficient Image Classification on Vertically Decomposed Data

A Fast and Scalable Nearest Neighbor Based Classification

Pre-Processing What is the best amount of amortized preprocessing?

Vertical K Median Clustering

A Fast and Scalable Nearest Neighbor Based Classification

Vertical K Median Clustering

Outline Introduction Background Our Approach Experimental Results

North Dakota State University Fargo, ND USA

Vertical K Median Clustering

Review Given a training space T(A1,…,An, C) and its features subspace X(A1,…,An) = T[A1,…,An], a functional f:X Reals, distance d(x,y)  |f(x)-f(y)| and.

North Dakota State University Fargo, ND USA

Contours: Y R f R* f(x) Y R f S

Presentation transcript:

Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University

Given a (large) TRAINING SET, R(A 1,…,A n, C), with C=CLASSES and {A 1 …A n }=FEATURES Classification is: labeling unclassified objects based on the training set kNN classification goes as follows: Classification Search for the k-Nearest Neighbors Vote the class Training Set Unclassified Object

Problems with kNN Finding k-Nearest Neighbor Set from horizontally structured data (record oriented data) can be expensive for large training set (containing millions or trillions of tuples) – linear to the size of the training set (1 scan) – Closed kNN is much more accurate but requires 2 scans Vertically structuring the data can help.

6. 1 st half of 1 st of 2 nd is  st half of 2 nd half not  st half is not pure1  Whole file is not pure1  0 Horizontal structures (records) Scanned vertically P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P nd half of 2 nd half is  R process P-trees using multi-operand logical AND s. Vertical Predicate-tree (P-tree) structuring: vertically partition table; compress each vertical bit slice into a basic Ptree; R( A 1 A 2 A 3 A 4 ) A data table, R(A 1..A n ), containing horizontal structures (records) is processed vertically (vertical scans) The basic (1-D) Ptree for R 11 is built by recording the truth of the predicate “pure 1” recursively on halves, until purity is reached nd half is not pure1  nd half of 1 st of 2 nd not  R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] Eg, to count, s, use “pure ”: level P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = level = level But it is pure (pure0) so this branch ends

Total Variation The Total Variation of a set X, TV(a) is the sum of the squared separations of objects in X from a, defined as follows: TV(a) =  x  X (x-a)o(x-a) We will use the concept of functional contours (in particular, the TV contours) in this presentation to identify a well- pruned, small superset of the Nearest Neighbor Set of an unclassified sample (which can then be efficiently scanned) First we will discuss functional contours in general then consider the specific TV contours.

Given f:R(A 1..A n )  Y and S  Y, define contour(f,S)  f -1 (S). From the derived attribute point of view, Contour(f,S) = SELECT A 1..A n FROM R* WHERE R*.A f  S. If S={a}, f -1 ({a}) is Isobar(f, a) There is a DUALITY between functions, f:R(A 1..A n )  Y and derived attributes, A f of R given by x.A f  f(x) where Dom(A f )=Y A 1 A 2 A n x 1 x 2 x n :... Y f(x) f A 1 A 2 A n A f x 1 x 2 x n f(x) :... R R* A 1 A 2 A n : :... YSYS f R A 1 ..  A n space Y S graph(f) = { (a 1,...,a n,f(a 1.a n )) | (a 1..a n )  R } contour(f,S)

TV(a) =  x  R (x-a)o(x-a) If we use d for a index variable over the dimensions, =  x  R  d=1..n (x d 2 - 2a d x d + a d 2 ) i,j,k bit slices indexes =  x  R  d=1..n (  k 2 k x dk )  x  R  d=1..n a d (  k 2 k x dk ) + |R||a| 2 =  x  d (  i 2 i x di )(  j 2 j x dj ) - 2  x  R  d=1..n a d (  k 2 k x dk ) + |R||a| 2 =  x  d  i,j 2 i+j x di x dj - 2  x,d,k 2 k a d x dk + |R||a| 2 =  x,d,i,j 2 i+j x di x dj - |R||a| 2 2  d a d  x,k 2 k x dk + TV(a) =  i,j,d 2 i+j |P di^dj | - |R||a| 2  k 2 k+1  d a d |P dk | + The first term does not depend upon a. Thus, the derived attribute coming from f(a)=TV-TV(  ) (which does not have that 1 st term at all) has identical contours as TV (just a lowered graph). We also find it useful to post-compose a log function to reduce the number of bit slices. The resulting functional is called the High-Dimension-ready Total Variation or HDTV(a). =  x,d,i,j 2 i+j x di x dj +  d a d a d ) |R|( -2  d a d  d +=  x,d,i,j 2 i+j x di x dj - |R|  d a d a d 2|R|  d a d  d +

Isobars are hyper-circles centered at  graph(g) is a log-shaped hyper-funnel: From equation 7, f(a)=TV(a)-TV(  )  d (a d a d -  d  d ) ) = |R| ( -2  d (a d  d -  d  d ) + TV(a) =  x,d,i,j 2 i+j x di x dj + |R| ( -2  d a d  d +  d a d a d ) +  d  d 2 )= |R|(  d a d  d  d a d f(  )=0 and g(a)  HDTV(a) = ln( f(a) )=ln|R| + ln|a-  | 2 = |R| |a-  | 2 so going inward and outward along a-  by  we arrive at inner point, b=  +(1-  /|a-  |)(a-  ) and outer point, c=  -(1+  /|a-  |)(a-  ).  -contour (radius  about a) a For an  -contour ring (radius  about a) g(b) and g(c) are the lower and upper endpoints of a vertical interval, S, defining the ε-contour shown. An easy P-tree calculation on that interval provides a P-tree mask for the  - contour (no scan requred). b c g(b) g(c)  x1x1 x2x2 g(x)

If more pruning is needed (i.e., HDTV(a) contour is still to big to scan) use a dimension projection contour (Dim-i projection P-trees are already computed = basic P-trees of R.A i. Form that contour_mask_P-tree; AND it with the HDTV contour P-tree. The result is a mask for the intersection).   -contour (radius  about a) a HDTV(b) HDTV(c) b c As pre-processing, calculate basic P-trees for the HDTV derived attribute. To classify a, 1. Calculate b and c (which depend upon a and  ) 2. Form the mask P-tree for training points with HDTV-values in [HDTV(b),HDTV(c)] (Note: the paper was submitted we were still doing this step by sorting TV(a) values. Now we use the contour approach which speeds up this step considerably. The performance evaluation graphs in this paper are still based on the old method. And w/o Gaussian vote weighting). 3. User that mask P-tree to prune down the candidate NNS. 4. If the root count of the candidate set is small enough, proceed to scan and assign class votes using, e.g., a Gaussian vote function, else prune further using a dimension projection). contour of dimension projection f(a)=a 1 x1x1 x2x2 HDTV(x) If more pruning is needed (i.e., HDTV(a) contour is still to big to scan)

Graphs of TV, TV-TV(  ) and HDTV TV(  )=TV(x 33 ) TV(x 15 ) X Y TV TV(x 15 )- TV(  ) X Y TV- TV(  ) 4 5  HDTV

Experiements: Dataset 1.KDDCUP-99 Dataset (Network Intrusion Dataset) – 4.8 millions records, 32 numerical attributes – 6 classes, each contains >10,000 records – Class distribution: – Testing set: 120 records, 20 per class – 4 synthetic datasets (randomly generated): - 10,000 records (SS-I) - 100,000 records (SS-II) - 1,000,000 records (SS-III) - 2,000,000 records (SS-IV) Normal972,780 IP sweep12,481 Neptune1,072,017 Port sweep10,413 Satan15,892 Smurf2,807,886

(k=5) Note: SMART-TV was done by sorting the derived attribute. Now we use the much faster P-tree interval mask. Algorithm x 1000 cardinality SMART-TV Vertical Closed-KNN KNN NA Speed or Scalability Machine used: Intel Pentium 4 CPU 2.6 GHz machine, 3.8GB RAM, running Red Hat Linux

Dataset (Cont.) 2.OPTICS dataset – ~8,000 points, 8 classes (CL-1, CL-2,…,CL-8) – 2 numerical attributes – Training set: 7,920 points – Testing set: 80 points, 10 per class

3.IRIS dataset – 150 samples – 3 classes (iris-setosa, iris- versicolor, and iris-virginica) – 4 numerical attributes – Training set: 120 samples – Testing set: 30 samples, 10 per class Dataset (Cont.)

Overall F-score Classification Accuracy Comparison (Note: SMART-TV class voting done with equal votes for each training neighbor – now we use a Gaussian vote weighting and get better accuracy than the other two). DatasetsSMART-TVPKNNKNN IRIS OPTICS SS-I SS-II SS-III SS-IV NI NA Overall Accuracy

A nearest-based classification algorithm that starts its classification steps by approximating the Nearest Neighbor Set. A total variation functional is used prune down the NNS candidate set. It finishes classification in the traditional way The algorithm is fast. It scales well to very large dataset. The classification accuracy is very comparable to that of Closed kNN (which is better than kNN). Summary