K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.

Slides:

Advertisements

Similar presentations

Advertisements

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

K-means method for Signal Compression: Vector Quantization

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Data Mining Techniques: Clustering

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Classification and Decision Boundaries

Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.

Lecture 09 Clustering-based Learning

Evaluating Performance for Data Mining Techniques

Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University

Module 04: Algorithms Topic 07: Instance-Based Learning

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.

Distance Metric Measures the dissimilarity between two data points. A metric is a fctn, d, of 2 points X and Y, such that d(X, Y) is positive definite:

Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.

Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.

Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.

Image Classification 영상분류

Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,

Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.

Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.

Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 6: Nearest and k-nearest Neighbor Classification.

Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.

Overview Data Mining - classification and clustering

Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Fuzzy Pattern Recognition. Overview of Pattern Recognition Pattern Recognition Procedure Feature Extraction Feature Reduction Classification (supervised)

Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Unsupervised Classification

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North.

Semi-Supervised Clustering

Clustering CSC 600: Data Mining Class 21.

Data Mining Motivation: “Necessity is the Mother of Invention”

Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.

Data Science Algorithms: The Basic Methods

Instance Based Learning

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Chapter 6 Classification and Prediction

K Nearest Neighbor Classification

Nearest-Neighbor Classifiers

Vertical K Median Clustering

Data Mining extracting knowledge from a large amount of data

Data Mining 資料探勘分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

COSC 4335: Other Classification Techniques

Vertical K Median Clustering

Nearest Neighbors CSC 576: Data Mining.

Text Categorization Berlin Chen 2003 Reference:

The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy

Presentation transcript:

k-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU

Introduction  We explored distance metric based computation using P-trees  Defined a new distance metric, called HOB distance  Revealed some useful properties of P-trees  A new method of nearest neighbor classification using P-tree - called Closed-KNN  A new algorithm for k-clustering using P-trees - efficient statistical computation from the P-trees

Overview 1.Data Mining - classification and clustering 2.Various distance metrics Minkowski, Manhattan, Euclidian, Max, Canberra, Cord, and HOB distance - Neighborhoods and decision boundaries 3.P-trees and its properties 4.k-nearest neighbor classification - Closed-KNN using Max and HOB distance 5.k-clustering - overview of existing algorithms - our new algorithm - computation of mean and variance from the P-trees

Data Mining extracting knowledge from a large amount of data Functionalities: feature selection, association rule mining, classification & prediction, cluster analysis, outlier analysis, evolution analysis Information Pyramid Raw data Useful Information Data Mining More data less information

Classification Predicting the class of a data object Bc3b3a3 Ac2b2a2 Ac1b1a1 ClassFeature3Feature2Feature1 Training data: Class labels are known Classifier cba Sample with unknown class: Predicted class Of the Sample also called Supervised learning

Types of Classifier Eager classifier: Builds a classifier model in advance e.g. decision tree induction, neural network Lazy classifier: Uses the raw training data e.g. k-nearest neighbor

Clustering The process of grouping objects into classes, with the objective: the data objects are similar to the objects in the same cluster dissimilar to the objects in the other clusters. A two dimensional space showing 3 clusters Clustering is often called unsupervised learning or unsupervised classification  the class labels of the data objects are unknown

Distance Metric Measures the dissimilarity between two data points. A distance metric is a function, d, of two n-dimensional points X and Y, such that d(X, Y) is positive definite: if (X  Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) is symmetric: d(X, Y) = d(Y, X) d(X, Y) holds triangle inequality: d(X, Y) + d(Y, Z)  d(X, Z)

Various Distance Metrics Minkowski distance or L p distance, Manhattan distance, Euclidian distance, Max distance, (P = 1) (P = 2) (P =  ) Let and

An Example A two-dimensional space: Manhattan, d 1 (X,Y) = XZ+ ZY = 4+3 = 7 Euclidian, d 2 (X,Y) = XY = 5 Max, d  (X,Y) = Max(XZ, ZY) = XZ = 4 X (2,1) Y (6,4) Z d1  d2  dd1  d2  d For any positive integer p,

Some Other Distances Canberra distance Squared cord distance Squared chi-squared distance

HOB Similarity Higher Order Bit (HOB) similarity: HOBS(A, B) = Bit position: x 1 : x 2 : y 1 : y 2 : HOBS(x 1, y 1 ) = 3 HOBS(x 2, y 2 ) = 4 A, B: two scalars (integer) a i, b i : i th bit of A and B (left to right) m : number of bits

HOB Distance The HOB distance between two scalar value A and B: d v (A, B) = m – HOB(A, B) The previous example: Bit position: x 1 : x 2 : y 1 : y 2 : HOBS(x 1, y 1 ) = 3 HOBS(x 2, y 2 ) = 4 d v (x 1, y 1 ) = 8 – 3 = 5 d v (x 2, y 2 ) = 8 – 4 = 4 The HOB distance between two points X and Y: In our example (considering 2-dimensional data): d h (X, Y) = max (5, 4) = 5

HOB Distance Is a Metric HOB distance is positive definite if (X = Y), = 0 if (X  Y), > 0 HOB distance is symmetric HOB distance holds triangle inequality

Neighborhood of a Point Neighborhood of a target point, T, is a set of points, S, such that X  S if and only if d(T, X)  r 2r2r T X T 2r2r X 2r2r T X T 2r2r X Manhattan Euclidian Max HOB If X is a point on the boundary, d(T, X) = r

Decision Boundary decision boundary between points A and B, is locus of the point X satisfying the condition d(A, X) = d(B, X) B X A D R2R2 R1R1 d(A,X)d(A,X) d(B,X)d(B,X)  > 45  Euclidian B A Max Manhattan  < 45  B A Euclidian Max Manhattan B A B A Decision boundary for HOB Distance. Perpendicular to the axis that makes maximum distance

Remotely Sensed Imagery Data An image is a collection of pixels Each pixel represent an square area in the ground Several attributes or bands associated with each pixel ex. red, green, blue reflectance values, soil moisture, nitrate Band Sequential (BSQ) file: one file for each band Bit Sequential (bSQ) file: one file each bit of each band B i,j is the bSQ file for j th bit of i th band

Peano count-Tree or P-tree We form one P-tree from each bSQ file P i,j is the basic P-tree for bit j of band I Root of the P-tree is the count of 1 bits in the entire image Root has 4 children with the counts of the 4 quadrants Recursively divide the quadrants until there is only one bit in the quadrant unless the node is pure0 or pure ____________/ / \ \___________ / _____/ \ ___ \ 16 ____8__ _15__ 16 / / | \ / | \ \ //|\ //|\ //|\ Pure1 node: All bits are 1 Root Count

Peano Mask Tree (PMT) 55 ____________/ / \ \___________ / _____/ \ ___ \ 16 ____8__ _15__ 16 / / | \ / | \ \ //|\ //|\ //|\ m ____________/ / \ \____________ / ____/ \ ____ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ P-tree PMT 0 represents Pure0 node 1 represents pure1 node m represents mixed node

P-tree ANDing m 1m0m Subtree1Subtree2 m m00m Subtree3Subtree4 m m00m Subtree3Subtree5 AND = ORing and COMPLEMENT operation are performed in similar way Also there are some other P-tree structured (such as PVT) and ANDing algorithms that are beyond the scope of this presentation

Value & Interval P-tree The value P-tree P i (v) represents the pixels that have value v for band i. there is a 1 in P i (v) at a pixel location, if that pixel have the value v for band i otherwise there is a 0 in P i (v). Let, b j = j th bit of the value v and and P i,j = the basic P-tree for band i bit j. Define Pt i,j = P i,j if b j = 1 = P i,j if b j = 0 Then P i (v) = Pt i,1 AND Pt i,2 AND Pt i,3 AND … AND Pt i,m The interval P-tree, P i (v 1, v 2 ) = P i (v 1 ) OR P i (v 1 +1) OR P i (v 1 +2) OR … OR P i (v 2 )

Notations P 1 & P 2 : P 1 AND P 2 P 1 | P 2 : P 1 OR P 2 P´ : COMPLEMENT of P P i, j : basic P-tree for band i bit j. P i (v) : value P-tree for value v of band i. P i (v 1, v 2 ) : interval P-tree for interval [v 1, v 2 ] of band i. P 0 : is pure0-tree, a P-tree having the root node which is pure0. P 1 : is pure1-tree, a P-tree having the root node which is pure1. rc(P) : root count of P-tree P N : number of pixels n : number of bands m : number of bits

Properties of P-trees 1. a) b) 2. a) b) c) d) 3. a) b) c) d) 4. rc(P 1 | P 2 ) = 0  rc(P 1 ) = 0 and rc(P 2 ) = 0 5.v 1  v 2  rc{P i (v 1 ) & P i (v 2 )} = 0 6.rc(P 1 | P 2 ) = rc(P 1 ) + rc(P 2 ) - rc(P 1 & P 2 ) 7.rc{P i (v 1 ) | P i (v 2 )} = rc{P i (v 1 )} + rc{P i (v 2 )}, where v 1  v 2

P-tree Header Header of a P-tree file to make a generalized P-tree structure 1 word2 words 4 words Format Code Fan- out # of levels Root countLength of the body in bytes Body of the P-tree

k-Nearest Neighbor Classification 1) Select a suitable value for k 2) Determine a suitable distance metric 3) Find k nearest neighbors of the sample using the selected metric 4) Find the plurality class of the nearest neighbors by voting on the class labels of the NNs 5) Assign the plurality class to the sample to be classified.

Closed-KNN T T is the target pixels. With k = 3, to find the third nearest neighbor, KNN arbitrarily select one point from the boundary line of the neighborhood Closed-KNN includes all points on the boundary Closed-KNN yields higher classification accuracy than traditional KNN

Searching Nearest Neighbors We begin searching by finding the exact matches. Let the target sample, T = The initial neighborhood is the point T. We expand the neighborhood along each dimension: along dimension i, [v i ] is expanded to the interval [v i – a i, v i +b i ], for some positive integers a i and b i. Continue expansion until there are at least k points in the neighborhood.

HOB Similarity Method for KNN In this method, we match bits of the target to the training data Fist we find matching in all 8 bits of each band (exact matching) let, b i,j = j th bit of the i th band of the target pixel. Define Pt i,j = P i,j, if b i,j = 1 = P i,j, otherwise And Pv i,1-j = Pt i,1 & Pt i,2 & Pt i,3 & … & Pt i,j Pnn = Pv 1,1-8 & Pv 2,1-8 & Pv 3,1-8 & … & Pv n,1-8 If rc(Pnn) < k, update Pnn = Pv 1,1-7 & Pv 2,1-7 & Pv 3,1-7 & … & Pv n,1-7

An Analysis of HOB Method Let i th band value of the target T, v i = 105 = b [ ] = [105, 105] 1 st expansion [ ] = [ , ] = [104, 105] 2 nd expansion [ ] = [ , ] = [104, 107]  Does not expand evenly in both side: Target = 105 and center of [104, 111] = ( ) / 2 =  And expands by power of 2.  Computationally very cheap

Perfect Centering Method Max distance metric provides better neighborhood by - keeping the target in the center - and expanding by 1 in both side Initial neighborhood P-tree (exact matching): Pnn = P 1 (v 1 ) & P 2 (v 2 ) & P 3 (v 3 ) & … & P n (v n ) If rc(Pnn) < k Pnn = P 1 (v 1 -1, v 1 +1) & P 2 (v 2 -1, v 2 +1) & … & P n (v n -1, v n +1) If rc(Pnn) < k Pnn = P 1 (v 1 -2, v 1 +2) & P 2 (v 2 -2, v 2 +2) & … & P n (v n -2, v n +2) Computationally costlier than HOB Similarity method But a little better classification accuracy

Finding the Plurality Class Let, P c (i) is the value P-trees for the class i Plurality class =

Performance Experimented on two sets of Arial photographs of The Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values: 0 to 7

Performance – Accuracy 1997 Dataset:

Performance - Accuracy (cont.) 1998 Dataset:

Performance - Time 1997 Dataset: both axis in logarithmic scale

Performance - Time (cont.) 1998 Dataset : both axis in logarithmic scale

k-Clustering Partitioning data into k clusters, C 1, C 2, …, C k as to minimizes some criterion function such as the sum of squared Euclidian distance measured from the centroid of the cluster or total variance, c i is the centroid or mean of C i or sum of the pair-wise weight c is the weight function usually the distance between p and q

k-Means Algorithm 1.Arbitrarily select k initial cluster centers 2.Assign each data point to its nearest center 3.Update the centers by the means of the clusters 4.Repeat step 2 & 3 until no change Good optimization, very slow Complexity O(nNkt), n = # of dimension, N = # of data points k = # of clusters, t = # of iterations To solve speed issues, some other algorithms have been proposed sacrificing quality

Divisive Approach 1. Initially consider the whole space as one hyperbox 2. Select a hyperbox to split 3. Select an axis and cut-point 4. Split the selected hyperbox by a hyperplane perpendicular to the selected axis through the selected cut-point 5. Repeat step 2-4 until there are k hyperboxes, each hyperbox is a cluster Mean-split algorithm, variance-based algorithm and our proposed new algorithm follow the divisive approach They differ in the strategies for selecting the hyperbox, axis and cut- point.

Mean-Split Algorithm The initial hyperbox (the whole space) is assigned a number k that is, k clusters will be formed from this hyperbox Let, L = number of clusters assigned to a hyperbox L i clusters are assigned to the i th sub-hyperbox where, i = 1, 2 0    1 n = # of points, V = volume 1.Select a hyperbox with L > 1 2.Select the axis with largest spread of projected data 3.Mean of the projected data is the cut-point Fast but poor optimization

Variance-Based Algorithm 1. Select the hyperbox with largest variance 2. By checking each point on each dimension of the selected hyperbox find the optimal cut-point, t opt, that gives maximum variance reduction on the projected data. where w i and are the weight and variance of the i th interval (i = 1, 2) Still computationally costly but optimization is closer to k-means

Our Algorithm When a new hyperbox is formed find two means m 1 and m 2 for each dimension using the projected data: a. Arbitrarily select two values for m 1 and m 2 (m 1 < m 2 ) b. Update m 1 = mean of the interval [0, (m 1 +m 2 )/2] c. Update m 2 = mean of the interval [(m 1 +m 2 )/2, upper_limit] d. Repeat step b & c until no change in m 1 and m 2. 1.Select the hyperbox and axis for which (m 2 – m 1 ) is largest 2.Cut-point = (m 1 + m 2 ) / 2

Our Algorithm (cont.) We represent each cluster by a P-tree the initial cluster is the pure1-tree, P 1 Let P ci is the P-tree for cluster c i the P-trees for the two new clusters after splitting along axis j: P Ci1 = P Ci & P j (0, (m 1 +m 2 )/2) P Ci2 = P Ci & P j ((m 1 +m 2 )/2, upper_limit) Note: P j ((m 1 +m 2 )/2, upper_limit) = complement of P j (0, (m 1 +m 2 )/2)

Computing Sum & Mean from P-trees for all points and for dimension or band i: sum = mean = For the points in a cluster: sum = mean = Here the template P-tree, P t = P-tree representing the cluster

Computing Variance from P-trees Variance = = For all points in the space: For the points in a cluster:

Performance Unlike variance based method, instead of checking each point on the axis, our method rapidly converges to the optimal cut point, t opt. avoids scanning database by computing sum and mean from the root count of the P-trees very much faster than variance-based method while optimization as good as variance-based method

Conclusion  Analyzed the effect of various distance metric  Used a new metric, HOB Distance for fast P-tree-based computation  Revealed useful properties of P-trees  using P-trees, a fast new method of KNN, called Closed- KNN, giving higher classification accuracy  Designed a new FAST k-clustering algorithm: computing sum, mean, variance from P-tree without scanning databases

Thank You