Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification Distance metric based computation using P-trees A new distance metric, called HOBbit distance Some useful properties of P-trees New P-tree Nearest Neighbor classification method - called Closed-KNN These notes contain NDSU confidential & Proprietary material. Patents pending on bSQ, Ptree technology
Data Mining extracting knowledge from a large amount of data Functionalities: feature selection, association rule mining, classification & prediction, cluster analysis, outlier analysis Information Pyramid Raw data Useful Information (sometimes 1 bit: Y/N) Data Mining More data volume = less information
Classification Predicting the class of a data object Bc3b3a3 Ac2b2a2 Ac1b1a1 ClassFeature3Feature2Feature1 Training data: Class labels are known and supervise the learning Classifier cba Sample with unknown class: Predicted class Of the Sample also called Supervised learning Eager classifier: Builds a classifier model in advance e.g. decision tree induction, neural network Lazy classifier: Uses the raw training data e.g. k-nearest neighbor
Clustering (unsupervised learning – cpt 8) The process of grouping objects into classes, with the objective: the data objects are similar to the objects in the same cluster dissimilar to the objects in the other clusters. A two dimensional space showing 3 clusters Clustering is often called unsupervised learning or unsupervised classification the class labels of the data objects are unknown
Distance Metric (used in both classification and clustering) Measures the dissimilarity between two data points. A metric is a fctn, d, of 2 n-dimensional points X and Y, such that d(X, Y) is positive definite: if (X Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) is symmetric: d(X, Y) = d(Y, X) d(X, Y) satisfies triangle inequality: d(X, Y) + d(Y, Z) d(X, Z)
Various Distance Metrics Minkowski distance or L p distance, Manhattan distance, Euclidian distance, Max distance, (P = 1) (P = 2) (P = )
An Example A two-dimensional space: Manhattan, d 1 (X,Y) = XZ+ ZY = 4+3 = 7 Euclidian, d 2 (X,Y) = XY = 5 Max, d (X,Y) = Max(XZ, ZY) = XZ = 4 X (2,1) Y (6,4) Z d1 d2 dd1 d2 d For any positive integer p,
Some Other Distances Canberra distance Squared cord distance Squared chi-squared distance
HOBbit Similarity Higher Order Bit (HOBbit) similarity: HOBbitS(A, B) = Bit position: x 1 : x 2 : y 1 : y 2 : HOBbitS(x 1, y 1 ) = 3 HOBbitS(x 2, y 2 ) = 4 A, B: two scalars (integer) a i, b i : i th bit of A and B (left to right) m : number of bits
HOBbit Distance (related to Hamming distance) The HOBbit distance between two scalar value A and B: d v (A, B) = m – HOBbit(A, B) The previous example: Bit position: x 1 : x 2 : y 1 : y 2 : HOBbitS(x 1, y 1 ) = 3 HOBbitS(x 2, y 2 ) = 4 d v (x 1, y 1 ) = 8 – 3 = 5 d v (x 2, y 2 ) = 8 – 4 = 4 The HOBbit distance between two points X and Y: In our example (considering 2-dimensional data): d h (X, Y) = max (5, 4) = 5
HOBbit Distance Is a Metric HOBbit distance is positive definite if (X = Y), = 0 if (X Y), > 0 HOBbit distance is symmetric HOBbit distance holds triangle inequality
Neighborhood of a Point Neighborhood of a target point, T, is a set of points, S, such that X S if and only if d(T, X) r 2r2r T X T 2r2r X 2r2r T X T 2r2r X Manhattan Euclidian Max HOBbit If X is a point on the boundary, d(T, X) = r
Decision Boundary decision boundary between points A and B, is the locus of the point X satisfying d(A, X) = d(B, X) B X A D R2R2 R1R1 d(A,X)d(A,X) d(B,X)d(B,X) > 45 Euclidian B A Max Manhattan < 45 B A Euclidian Max Manhattan B A B A Decision boundary for HOBbit Distance is perpendicular to axis that makes max distance Decision boundaries for Manhattan, Euclidean and max distance
Minkowski Metrics L p -metrics (aka: Minkowski metrics) d p (X,Y) = ( i=1 to n w i |x i - y i | p ) 1/p (weights, wi assumed =1) Unit Disks Dividing Lines p=1 (Manhattan) p=2 (Euclidean) p=3,4,…. Pmax (chessboard) P=½, ⅓, ¼, … d max ≡ max|x i - y i | d ≡ lim p d p (X,Y). Proof (sort of) lim p { i=1 to n a i p } 1/p max(a i ) ≡ b. For p large enough, other a i p << b p since y=x p increasingly concave, so i=1 to n a i p k*b p (k=duplicity of b in the sum), so { i=1 to n a i p } 1/p k 1/p *b and k 1/p 1
P>1 Minkowski Metrics q x1 y1 x2 y2 e^(LN(B2-C2)*A2)e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) E E MAX E E MAX MAX MAX ************** E MAX E E MAX E E **************************** 90 MAX
d 1/p (X,Y) = ( i=1 to n |x i - y i | 1/p ) p P<1 p=0 (lim as p 0) doesn’t exist (Does not converge.) q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) E P<1 Minkowski Metrics q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) E E q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) E
Min dissimilarity function The d min function (d min (X,Y) = min i=1 to n |x i - y i | is strange. It is not a psuedo- metric. The Unit Disk is: And the neighborhood of the blue point relative to the red point (dividing nbrhd - those points closer to the blue than the red). Major bifurcations!
Canberra metric: d c (X,Y) = ( i=1 to n |x i – y i | / (x i + y i ) - normalized manhattan distance Square Cord metric: d sc (X,Y) = i=1 to n ( x i – y i ) 2 - Already discussed as L p with p=1/2 Squared Chi-squared metric: d chi (X,Y) = i=1 to n (x i – y i ) 2 / (x i + y i ) HOBbit metric (Hi Order Binary bit) d H (X,Y) = max i=1 to n {n – HOB(x i - y i )} where, for m-bit integers, A=a 1..a m and B=b 1..b m HOB(A,B) = max i=1 to m {s: i(1 i s a i =b i )} (related to Hamming distance in coding theory) Scalar Product metric: d chi (X,Y) = X Y = i=1 to n x i * y i Hyperbolic metrics: (which map infinite space 1-1 onto a sphere) Which are rotationally invariant? Translationally invariant? Other? Other Interesting Metrics
Notations P 1 & P 2 : P 1 AND P 2 (also P 1 ^ P 2 ) P 1 | P 2 : P 1 OR P 2 P´ : COMPLEMENT P-tree of P P i, j : basic P-tree for band-i bit-j. P i (v) : value P-tree for value v of band i. P i ([v 1, v 2 ]) : interval P-tree for interval [v 1, v 2 ] of band i. P 0 : is pure0-tree, a P-tree having the root node which is pure0. P 1 : is pure1-tree, a P-tree having the root node which is pure1. rc(P) : root count of P-tree, P N : number of pixels n : number of bands m : number of bits
Properties of P-trees 1. a) b) 2. a) b) c) d) 3. a) b) c) d) 4. rc(P 1 | P 2 ) = 0 iff rc(P 1 ) = 0 and rc(P 2 ) = 0 5.v 1 v 2 rc{P i (v 1 ) & P i (v 2 )} = 0 6.rc(P 1 | P 2 ) = rc(P 1 ) + rc(P 2 ) - rc(P 1 & P 2 ) 7.rc{P i (v 1 ) | P i (v 2 )} = rc{P i (v 1 )} + rc{P i (v 2 )}, where v 1 v 2
k-Nearest Neighbor Classification and Closed-KNN 1) Select a suitable value for k 2) Determine a suitable distance metric 3) Find k nearest neighbors of the sample using the selected metric 4) Find the plurality class of the nearest neighbors by voting on the class labels of the NNs 5) Assign the plurality class to the sample to be classified. T T is the target pixels. With k = 3, to find the third nearest neighbor, KNN arbitrarily select one point from the bdry line of the nhbd Closed-KNN includes all points on the boundary Closed-KNN yields higher classification accuracy than traditional KNN
Searching Nearest Neighbors We begin searching by finding the exact matches. Let the target sample, T = The initial neighborhood is the point T. We expand the neighborhood along each dimension: along dim-i, [v i ] expanded to the interval [v i – a i, v i +b i ], for some pos integers a i and b i. Continue expansion until there are at least k points in the neighborhood.
HOBbit Similarity Method for KNN In this method, we match bits of the target to the training data First, find those matching in all 8 bits of each band (exact matches) let, b i,j = j th bit of the i th band of the target pixel. Define target-Ptree, Pt: Pt i,j = P i,j, if b i,j = 1 = P i,j, otherwise And precision-value-Ptree, Pv i,1 j = Pt i,1 & Pt i,2 & Pt i,3 & … & Pt i,j
An Analysis of HOBbit Method Let i th band value of the target T, v i = 105 = b [ ] [105, 105] 1 st expansion [ ] = [ , ] = [104, 105] 2 nd expansion [ ] = [ , ] = [104, 107] Does not expand evenly in both side: Target = 105 and center of [104, 111] = ( ) / 2 = And expands by power of 2. Computationally very cheap
Perfect Centering Method Max distance metric provides better neighborhood by - keeping the target in the center - and expanding by 1 in both side Initial neighborhood P-tree (exact matching): Pnn = P 1 (v 1 ) & P 2 (v 2 ) & P 3 (v 3 ) & … & P n (v n ) If rc(Pnn) < k, Pnn = P 1 (v 1 -1, v 1 +1) & P 2 (v 2 -1, v 2 +1) & … & P n (v n -1, v n +1) If rc(Pnn) < k, Pnn = P 1 (v 1 -2, v 1 +2) & P 2 (v 2 -2, v 2 +2) & … & P n (v n -2, v n +2) Computationally costlier than HOBbit Similarity method But a little better classification accuracy Let, P c (i) is the value P-trees for the class i Plurality class =
Performance Experimented on two sets of Arial photographs of The Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values: 0 to 7
Performance – Accuracy 1997 Dataset:
Performance - Accuracy (cont.) 1998 Dataset:
Performance - Time 1997 Dataset: both axis in logarithmic scale
Performance - Time (cont.) 1998 Dataset : both axis in logarithmic scale