Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

Data Mining Classification: Alternative Techniques
K-means method for Signal Compression: Vector Quantization
CLUSTERING PROXIMITY MEASURES
Distance and Similarity Measures
Classification and Decision Boundaries
Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)
1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the.
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
CS292 Computational Vision and Language Pattern Recognition and Classification.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Chapter 2: Pattern Recognition
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
1-NN Rule: Given an unknown sample X decide if for That is, assign X to category if the closest neighbor of X is from category i.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
Distance Measures Tan et al. From Chapter 2.
CES 514 – Data Mining Lec 9 April 14 Mid-term k nearest neighbor.
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
CS Instance Based Learning1 Instance Based Learning.
Nearest-Neighbor Classifiers Sec minutes of math... Definition: a metric function is a function that obeys the following properties: Identity:
Module 04: Algorithms Topic 07: Instance-Based Learning
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
1 Lazy Learning – Nearest Neighbor Lantz Ch 3 Wk 2, Part 1.
Distance Metric Measures the dissimilarity between two data points. A metric is a fctn, d, of 2 points X and Y, such that d(X, Y) is positive definite:
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
K Nearest Neighbors Classifier & Decision Trees
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
Efficient Equal Interval Neighborhood Ring (P-trees technology is patented by NDSU)
Chapter 2: Getting to Know Your Data
Types of Data How to Calculate Distance? Dr. Ryan Benton January 29, 2009.
Digital Image Processing & Pattern Analysis (CSCE 563) Introduction to Pattern Analysis Prof. Amr Goneid Department of Computer Science & Engineering The.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 6: Nearest and k-nearest Neighbor Classification.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Vector Quantization CAP5015 Fall 2005.
KNN Classifier.  Handed an instance you wish to classify  Look around the nearby region to see what other classes are around  Whichever is most common—make.
Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:
Overview Data Mining - classification and clustering
Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Fuzzy Pattern Recognition. Overview of Pattern Recognition Pattern Recognition Procedure Feature Extraction Feature Reduction Classification (supervised)
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North.
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
Lecture 2-2 Data Exploration: Understanding Data
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Instance Based Learning
Lecture 05: K-nearest neighbors
Classification and Prediction
School of Computer Science & Engineering
K Nearest Neighbor Classification
Nearest-Neighbor Classifiers
Data Mining extracting knowledge from a large amount of data
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Nearest Neighbors CSC 576: Data Mining.
Lecture 03: K-nearest neighbors
Data Mining Classification: Alternative Techniques
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Using Clustering to Make Prediction Intervals For Neural Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Presentation transcript:

Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees  A new distance metric, called HOBbit distance  Some useful properties of P-trees  New P-tree Nearest Neighbor classification method - called Closed-KNN These notes contain NDSU confidential & Proprietary material. Patents pending on bSQ, Ptree technology

Data Mining extracting knowledge from a large amount of data Functionalities: feature selection, association rule mining, classification & prediction, cluster analysis, outlier analysis Information Pyramid Raw data Useful Information (sometimes 1 bit: Y/N) Data Mining More data volume = less information

Classification Predicting the class of a data object Bc3b3a3 Ac2b2a2 Ac1b1a1 ClassFeature3Feature2Feature1 Training data: Class labels are known and supervise the learning Classifier cba Sample with unknown class: Predicted class Of the Sample also called Supervised learning Eager classifier: Builds a classifier model in advance e.g. decision tree induction, neural network Lazy classifier: Uses the raw training data e.g. k-nearest neighbor

Clustering (unsupervised learning – cpt 8) The process of grouping objects into classes, with the objective: the data objects are similar to the objects in the same cluster dissimilar to the objects in the other clusters. A two dimensional space showing 3 clusters Clustering is often called unsupervised learning or unsupervised classification  the class labels of the data objects are unknown

Distance Metric (used in both classification and clustering) Measures the dissimilarity between two data points. A metric is a fctn, d, of 2 n-dimensional points X and Y, such that d(X, Y) is positive definite: if (X  Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) is symmetric: d(X, Y) = d(Y, X) d(X, Y) satisfies triangle inequality: d(X, Y) + d(Y, Z)  d(X, Z)

Various Distance Metrics Minkowski distance or L p distance, Manhattan distance, Euclidian distance, Max distance, (P = 1) (P = 2) (P =  )

An Example A two-dimensional space: Manhattan, d 1 (X,Y) = XZ+ ZY = 4+3 = 7 Euclidian, d 2 (X,Y) = XY = 5 Max, d  (X,Y) = Max(XZ, ZY) = XZ = 4 X (2,1) Y (6,4) Z d1  d2  dd1  d2  d For any positive integer p,

Some Other Distances Canberra distance Squared cord distance Squared chi-squared distance

HOBbit Similarity Higher Order Bit (HOBbit) similarity: HOBbitS(A, B) = Bit position: x 1 : x 2 : y 1 : y 2 : HOBbitS(x 1, y 1 ) = 3 HOBbitS(x 2, y 2 ) = 4 A, B: two scalars (integer) a i, b i : i th bit of A and B (left to right) m : number of bits

HOBbit Distance (related to Hamming distance) The HOBbit distance between two scalar value A and B: d v (A, B) = m – HOBbit(A, B) The previous example: Bit position: x 1 : x 2 : y 1 : y 2 : HOBbitS(x 1, y 1 ) = 3 HOBbitS(x 2, y 2 ) = 4 d v (x 1, y 1 ) = 8 – 3 = 5 d v (x 2, y 2 ) = 8 – 4 = 4 The HOBbit distance between two points X and Y: In our example (considering 2-dimensional data): d h (X, Y) = max (5, 4) = 5

HOBbit Distance Is a Metric HOBbit distance is positive definite if (X = Y), = 0 if (X  Y), > 0 HOBbit distance is symmetric HOBbit distance holds triangle inequality

Neighborhood of a Point Neighborhood of a target point, T, is a set of points, S, such that X  S if and only if d(T, X)  r 2r2r T X T 2r2r X 2r2r T X T 2r2r X Manhattan Euclidian Max HOBbit If X is a point on the boundary, d(T, X) = r

Decision Boundary decision boundary between points A and B, is the locus of the point X satisfying d(A, X) = d(B, X) B X A D R2R2 R1R1 d(A,X)d(A,X) d(B,X)d(B,X)  > 45  Euclidian B A Max Manhattan  < 45  B A Euclidian Max Manhattan B A B A Decision boundary for HOBbit Distance is perpendicular to axis that makes max distance Decision boundaries for Manhattan, Euclidean and max distance

Minkowski Metrics L p -metrics (aka: Minkowski metrics) d p (X,Y) = (  i=1 to n w i |x i - y i | p ) 1/p (weights, wi assumed =1) Unit Disks Dividing Lines p=1 (Manhattan) p=2 (Euclidean) p=3,4,…. Pmax (chessboard) P=½, ⅓, ¼, … d max ≡ max|x i - y i |  d  ≡ lim p   d p (X,Y). Proof (sort of) lim p   {  i=1 to n a i p } 1/p max(a i ) ≡ b. For p large enough, other a i p << b p since y=x p increasingly concave, so  i=1 to n a i p  k*b p (k=duplicity of b in the sum), so {  i=1 to n a i p } 1/p  k 1/p *b and k 1/p  1

P>1 Minkowski Metrics q x1 y1 x2 y2 e^(LN(B2-C2)*A2)e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) E E MAX E E MAX MAX MAX ************** E MAX E E MAX E E **************************** 90 MAX

d 1/p (X,Y) = (  i=1 to n |x i - y i | 1/p ) p P<1 p=0 (lim as p  0) doesn’t exist (Does not converge.) q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) E P<1 Minkowski Metrics q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) E E q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) E

Min dissimilarity function The d min function (d min (X,Y) = min i=1 to n |x i - y i | is strange. It is not a psuedo- metric. The Unit Disk is: And the neighborhood of the blue point relative to the red point (dividing nbrhd - those points closer to the blue than the red). Major bifurcations!

Canberra metric: d c (X,Y) = (  i=1 to n |x i – y i | / (x i + y i ) - normalized manhattan distance Square Cord metric: d sc (X,Y) =  i=1 to n (  x i –  y i ) 2 - Already discussed as L p with p=1/2 Squared Chi-squared metric: d chi (X,Y) =  i=1 to n (x i – y i ) 2 / (x i + y i ) HOBbit metric (Hi Order Binary bit) d H (X,Y) = max i=1 to n {n – HOB(x i - y i )} where, for m-bit integers, A=a 1..a m and B=b 1..b m HOB(A,B) = max i=1 to m {s:  i(1  i  s  a i =b i )} (related to Hamming distance in coding theory) Scalar Product metric: d chi (X,Y) = X Y =  i=1 to n x i * y i Hyperbolic metrics: (which map infinite space 1-1 onto a sphere) Which are rotationally invariant? Translationally invariant? Other? Other Interesting Metrics

Notations P 1 & P 2 : P 1 AND P 2 (also P 1 ^ P 2 ) P 1 | P 2 : P 1 OR P 2 P´ : COMPLEMENT P-tree of P P i, j : basic P-tree for band-i bit-j. P i (v) : value P-tree for value v of band i. P i ([v 1, v 2 ]) : interval P-tree for interval [v 1, v 2 ] of band i. P 0 : is pure0-tree, a P-tree having the root node which is pure0. P 1 : is pure1-tree, a P-tree having the root node which is pure1. rc(P) : root count of P-tree, P N : number of pixels n : number of bands m : number of bits

Properties of P-trees 1. a) b) 2. a) b) c) d) 3. a) b) c) d) 4. rc(P 1 | P 2 ) = 0 iff rc(P 1 ) = 0 and rc(P 2 ) = 0 5.v 1  v 2  rc{P i (v 1 ) & P i (v 2 )} = 0 6.rc(P 1 | P 2 ) = rc(P 1 ) + rc(P 2 ) - rc(P 1 & P 2 ) 7.rc{P i (v 1 ) | P i (v 2 )} = rc{P i (v 1 )} + rc{P i (v 2 )}, where v 1  v 2

k-Nearest Neighbor Classification and Closed-KNN 1) Select a suitable value for k 2) Determine a suitable distance metric 3) Find k nearest neighbors of the sample using the selected metric 4) Find the plurality class of the nearest neighbors by voting on the class labels of the NNs 5) Assign the plurality class to the sample to be classified. T T is the target pixels. With k = 3, to find the third nearest neighbor, KNN arbitrarily select one point from the bdry line of the nhbd Closed-KNN includes all points on the boundary Closed-KNN yields higher classification accuracy than traditional KNN

Searching Nearest Neighbors We begin searching by finding the exact matches. Let the target sample, T = The initial neighborhood is the point T. We expand the neighborhood along each dimension: along dim-i, [v i ] expanded to the interval [v i – a i, v i +b i ], for some pos integers a i and b i. Continue expansion until there are at least k points in the neighborhood.

HOBbit Similarity Method for KNN In this method, we match bits of the target to the training data First, find those matching in all 8 bits of each band (exact matches) let, b i,j = j th bit of the i th band of the target pixel. Define target-Ptree, Pt: Pt i,j = P i,j, if b i,j = 1 = P i,j, otherwise And precision-value-Ptree, Pv i,1  j = Pt i,1 & Pt i,2 & Pt i,3 & … & Pt i,j

An Analysis of HOBbit Method Let i th band value of the target T, v i = 105 = b [ ]  [105, 105] 1 st expansion [ ] = [ , ] = [104, 105] 2 nd expansion [ ] = [ , ] = [104, 107]  Does not expand evenly in both side: Target = 105 and center of [104, 111] = ( ) / 2 =  And expands by power of 2.  Computationally very cheap

Perfect Centering Method Max distance metric provides better neighborhood by - keeping the target in the center - and expanding by 1 in both side Initial neighborhood P-tree (exact matching): Pnn = P 1 (v 1 ) & P 2 (v 2 ) & P 3 (v 3 ) & … & P n (v n ) If rc(Pnn) < k, Pnn = P 1 (v 1 -1, v 1 +1) & P 2 (v 2 -1, v 2 +1) & … & P n (v n -1, v n +1) If rc(Pnn) < k, Pnn = P 1 (v 1 -2, v 1 +2) & P 2 (v 2 -2, v 2 +2) & … & P n (v n -2, v n +2) Computationally costlier than HOBbit Similarity method But a little better classification accuracy Let, P c (i) is the value P-trees for the class i Plurality class =

Performance Experimented on two sets of Arial photographs of The Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values: 0 to 7

Performance – Accuracy 1997 Dataset:

Performance - Accuracy (cont.) 1998 Dataset:

Performance - Time 1997 Dataset: both axis in logarithmic scale

Performance - Time (cont.) 1998 Dataset : both axis in logarithmic scale