The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Machine Learning Instance Based Learning & Case Based Reasoning Exercise Solutions.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)

Lazy vs. Eager Learning Lazy vs. eager learning

Classification and Decision Boundaries

Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.

Chapter 2: Pattern Recognition

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.

Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.

Aprendizagem baseada em instâncias (K vizinhos mais próximos)

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Instance Based Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán) 1.

Recommender systems Ram Akella November 26 th 2008.

INSTANCE-BASE LEARNING

What is machine learning? 1. A very trivial machine learning tool K-Nearest-Neighbors (KNN) The predicted class of the query sample depends on the voting.

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

This week: overview on pattern recognition (related to machine learning)

Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.

K Nearest Neighborhood (KNNs)

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

1 Instance Based Learning Ata Kaban The University of Birmingham.

Visual Information Systems Recognition and Classification.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.

1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

CpSc 810: Machine Learning Instance Based Learning.

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Outline K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.

Lazy Learners K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.

Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,

Overview Data Mining - classification and clustering

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

CS Machine Learning Instance Based Learning (Adapted from various sources)

1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Efficient Image Classification on Vertically Decomposed Data

Decision Tree Induction for High-Dimensional Data Using P-Trees

Efficient Ranking of Keyword Queries Using P-trees

Efficient Ranking of Keyword Queries Using P-trees

K Nearest Neighbors and Instance-based methods

North Dakota State University Fargo, ND USA

Yue (Jenny) Cui and William Perrizo North Dakota State University

K Nearest Neighbor Classification

Nearest-Neighbor Classifiers

Vertical K Median Clustering

A Fast and Scalable Nearest Neighbor Based Classification

Data Mining extracting knowledge from a large amount of data

COSC 4335: Other Classification Techniques

North Dakota State University Fargo, ND USA

Vertical K Median Clustering

Nearest Neighbors CSC 576: Data Mining.

North Dakota State University Fargo, ND USA

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

pTrees predicate Tree technologies

Presentation transcript:

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda Lu Computer Science North Dakota State University Fargo, ND USA

Example: Medical Expert System (Ask a Nurse) Symptoms plus past diagnoses are collected into a table called CASES. For each undiagnosed new_symptoms,CASES is searched for matches: SELECT DIAGNOSIS FROM CASES WHERE CASES.SYMPTOMS = new_symptoms If there is a predominant DIAGNOSIS, then report it elseIf there's no predominant DIAGNOSIS, then Classify instead of Query i.e., find the fuzzy matches (near neighbors) SELECT DIAGNOSIS FROM CASES WHERE CASES.SYMPTOMS ≅ new_symptoms else call your doctor in the morning

Almost universally, Decision Making is, consulting a database of past expert decisions (your own or that of other experts)) Past similar decisions are collected into a table called CASES (either explicitly or in one’s head). For each new decision CASE, search for descriptive feature matches (or near matches). Decide based on the predominant case. Sometimes this is called “CASE-BASED REASONING”. SELECT CASE FROM CASES WHERE CASES.DESCRIPTIVE_FEATURES = [or ~=] new_descriptive_features If there is a predominant CASE, then report it elseIf there's no predominant CASE, then Classify instead of Query i.e., find the fuzzy matches (near neighbors) SELECT DIAGNOSIS FROM CASES WHERE CASES.DESCR_FEATURES = [or ~=] new_descr_features else make default decision.

Near Neighbor Classification Given a (large) TRAINING SET T(A 1,..., A n, C) with CLASS, C, and FEATURES A=(A 1,...,A n ), C Classification of an unclassified sample, (a 1,..., a n ) is just: SELECT Max (Count (T.C i )) FROM T WHERE T.A 1 =a 1 AND T.A 2 =a 2... AND T.A n =a n GROUP BY T.C; i.e., It is just a SELECTION, since C-Classification is assigning to (a 1,..., a n ) the most frequent C-value in RA=(a 1,..., a n ). But, if the EQUALITY SELECTION is empty, then we need a FUZZY QUERY to find NEAR NEIGHBORs instead of exact matches. That's Nearest Neighbor Classification (NNC). Based on the definition of “Near” Essentially all classification and prediction algorithms are nearest neighbour vote based algorithms.

Data mining Data mining, has 3 general methodologies for extracting information and knowledge from data. –Rule Mining: Strong antecedent consequent relationships among the subsets of the column attributes. –Classification & Prediction: Discovering signatures for the individual values in a specified column (class attribute) or attribute from values of the other attributes (feature attributes). –Clustering: using some notion of tuple similarity to group together training table rows so that within a group (a cluster) there is high.

Prediction and Classification The Classification and Prediction problem is a very interesting data mining problem. The problem is to predict a class label based on past (assumed correct) prediction activities. Typically the training datasets of past predictions are extremely large (which is good for accuracy but bad for speed of prediction). Immediately one runs into the famous problems, the curse of cardinality and the curse of dimensionality. The curse of cardinality if the number of horizontal records in a training file is very large, standard vertical processing of horizontally record structures can take an unacceptably long time (e.g., if there are millions or billions of horizontal records to scan); and the gold standard method prediction/classification method, k-Nearest Neighbor, will not yield the most accurate result unless a second scan is made. Thus the curse of cardinality is both a time curse and an accuracy curse.

The curse of cardinality The curse of cardinality is a very serious problem for Near Neighbor Classifiers (NNCs) of horizontal data (speed and accuracy). NNCs for horizontally structured data require a scan of the entire training set to determine the [k] nearest neighbors Take records 1,2,...,k as the initial k Nearest Nbr set (kNN set) Get the k+1 st record. If it is closer than any one in the kNN set, replace. Get k+2 nd. If it is closer than any one in the kNN set, replace. Get k+3 rd... ……) With horizontally structured data and only one vertical scan, the best one can do is determine one of many “k nearest neighbor sets” i.e., there may be many training points that are just as near as the k th one selected). Our solution is to use a vertical data organization so that one horizontal scan yields the Closed KNN set (CKNN) immediately (all neighbors within a given distance). Cost is dependent on the number of attributes This ameliorates the Curse of Cardinality problem considerably (PA-KDD 2003 paper).

Curse of Cardinality - 2 With horizontally structured data, he only way to get a fair classification vote in which all neighbors at a given distance get equal vote, is to make a second vertical scan of the entire training set, which is expensive. For that reason, most kNNC implementations on horizontally structured data disregard the other neighbors at the same distance from the unclassified sample, not because the k voters suffice, but because it is too expensive to find the other neighbors at that same distance and to enfranchise them for voting as well. Of course, if the training set is such that any neighbor gives a representative vote for all neighbors at that same distance, then kNN is just as good as Closed kNN. But that would have to mean that all neighbors at the same distance have the same class, which means the classes are concentric rings around the unclassified sample. If that is known, no sophisticated analysis is required to classify. So, we solve the curse of cardinality (speed and accuracy) by using vertical data. However, it is more common to ”solve” the curse of cardinality by resorting to model-based classification (in which, first, a compact model is built to represent the training set information (the training phase) then that closed form, compact model is used over and over again to classify and predict (the prediction phase).

k-Nearest Neighbor Classification (kNNC) and closed-k-Nearest Neighbor Classification (ckNNC) 1) Select a suitable value for k 2) Determine a suitable distance or similarity notion (definition of near) 3) Find the k nearest neighbor set [closed] of the unclassified sample. 4) Find the plurality class in the nearest neighbor set. 5) Assign the plurality class as the predicted class of the sample T That's 1 ! Let T be the unclassified point or sample. Using Euclidean distance and k = 3: Find the 3 closest neighbors. Move out from T until ≥ 3 neighbors are found. 3NN arbitrarily select one point from this boundary line as the 3 rd nearest neighbor, whereas, C3NN includes all points on this boundary line. That's 2 ! That's 6 (more than 3 !)

closed k Nearest Neighbor Classification (ckNNC) Often yields much higher classification accuracy than traditional kNNC. At what additional cost? Actually, it can be at a negative additional cost i.e., It can be faster AND more accurate when using vertical data organization (because it yields the closed nearest neighbor set with just one horizontal scan of the vertical data structures). The NEXT SLIDE describes our Ptree vertical data organization which facilitates this faster and more accurate ckNNC.)

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  R Horizontally AND basic Ptrees Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans) Top-down construction of the 1-dimensional Ptree representation of R 11, denoted, P 11, is built by recording the truth of the universal predicate pure 1 in a tree recursively on halves (1/2 1 subsets), until purity is achieved. 3. Right half pure1? false  R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] But it is pure (pure0) so this branch ends then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R 11 into P 11 goes as follows: P 11 pure1? false=0 pure1? true=1 pure1? false=0 R (A 1 A 2 A 3 A 4 ) Horizontally structured records Scanned vertically = Base 10 Base 2 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^^ ^ ^ ^ ^^

Curse of Dimensionality Curse of dimensionality: As files get very wide (have many columns) Intuition breaks down The critical notion of “near” ceases to work in Near Neighbor Classification. Why? To what limit does the volume of a unit disk go, as n -->  ? For n=1, the volume is 2 (length of line from -1 to 1). For n=2, the volume is π r 2 or ~ For n=3, the volume is 4/3 π r 3 ~4.1888, etc. Intuition might tell us That volume is heading toward  as n goes to . Or that it will top out asymptotically at some number > The volume of the unit disk goes to 0 as n --> , Reaching its maximum at dimension = 5. +1

Therefore, there are almost no [Euclidean distance] Nbrs in high Dimensions The volume of the unit disk goes to 0 as n --> , with max at dim = 5. But the volume of the unit cube goes to  as n --> . This is not intuitive to most people It has consequences in NNC. This tells us that there's a lot of volume in a high dimensional unit cube, but, in its inscribed unit disk, there is almost nothing (no “space”, no volume, and therefore no points (relatively speaking). ). E.g, there are lots of potential near neighbors in a high dimensional unit cube. Thus, when using, e.g., L 1 or Hamming distance to define neighborhoods, there are plenty of neighbors, but there are almost no near neighbors in a high dimensional unit disk when using Euclidean distance. Our solution: use cubes, not disks as neighborhoods (this is typically the distance of choice in Genomics, Bioinformatics and BioMedical Informatics). Ptree vertical technology facilitates the construction and analysis of Cube neighborhoods = =2 2 3 = n. 

All classification is Near Neighbor Classification ? NNC approach takes into consideration typical training data set size characteristics, which cause: cardinality (curse of cardinality) dimension (curse of dimensionality) Model-based classifiers are [often less accurate, in general, and are really] Near Neighbor Classifiers using an alternate idea of “near”. We conclude that the improved speed of CNNC using the Ptree vertical approach a good choice since It retains (improves) the speed. It retains (improves) the accuracy of the Near Neighbor approach.

Near Neighbor Classification Given a (large) TRAINING SET T(A 1,..., A n, C) with CLASS, C, and FEATURES A=(A 1,...,A n ), C Classification (in psuedo-SQL) of an unclassified sample, (a 1,..., a n ) is just: SELECT Max (Count (T.C i )) FROM T WHERE T.A 1 =a 1 AND T.A 2 =a 2... AND T.A n =a n GROUP BY T.C; i.e., It is just a SELECTION, since C-Classification is assigning to (a 1,..., a n ) the most frequent [neighboring] C-value in RA=(a 1,..., a n ). If the EQUALITY SELECTION is empty, then we need a FUZZY QUERY to find NEAR NEIGHBORs instead of exact matches. That's Nearest Neighbor Classification (NNC).

Nearest Neighbor Classification (sample-based) and Eager Classification (model-based) Given a TRAINING SET, R(A 1,..,A n, C), with C = CLASSES and (A 1,...,A n )=FEATURES Nearest Neighbor Classification (NNC) Selecting a set of R-tuples with similar features Letting the corresponding class values vote. Nearest Neighbor Classification won't work very well if The vote is inconclusive (close to a tie) or if similar (near) is not well defined, Then we build a MODEL of TRAINING SET (at, possibly, great 1-time [build phase] expense?) When a MODEL is built - Eager classification uses the model to assign class. Model-less methods like Nearest Neighbor - Lazy or Sample-based.

Eager Classification (model-based) Eager Classifiers models examples Decision trees, Probabilistic models (Bayesian Classifier, Neural Networks, Support Vector Machines, etc.) How do you decide when an EAGER model is good enough to use? How do you decide if a Nearest Neighbor Classifier is working well enough? We have a TEST PHASE. Typically, we set aside some training tuples as a Test Set (Test tuples cannot be used in model building or and cannot be used as nearest neighbors). If the classifier passes the test (a high enough % of Test tuples are correctly classified by the classifier) it is accepted.

EXAMPLE: Computer Ownership The TRAINING SET for predicting who owns a computer is: Customer ( Age Salary Job Owns Computer ) | 24 | 55,000 | Programmer | yes | | 58 | 94,000 | Doctor | no | | 48 | 14,000 | Laborer | no | | 58 | 19,000 | Domestic | no | | 28 | 18,000 | Builder | no | A Decision Tree (model) classifier built from TRAINING: Is this a Near Neighbor Classifier? Where are Near Neighborhoods involved? Training subset at the bottom of each decision path represents a near neighborhood of any unclassified sample that traverses the decision tree to that leaf. The concept of “near” or “highly correlated” : The unclassified sample meets the same set of conditions or criteria as the near neighbors at the bottom of that path of condition criteria. We are using a different (accumulative) “correlation” definition along each branch of the decision tree and the subsets at the leaf of each branch are true Near Neighbor sets for the respective correlations or notions of nearness.

Neural Network classifiers Neural Network classifier Training - Adjusting the weights and biases Through back-propagation Until an acceptable performance. Matrix of weights and biases – Determiners of our near neighbor sets. We continue to train by adjusting weights and biases until: Near neighbor sets – inputs producing same class Are sufficiently “near” to each other To give us a level of accuracy.

Support Vector Machine (SVM) classifiers The very first step in Support Vector Machines (SVM) classification is To isolate a neighborhood in which to examine the boundary and the margins of the boundary between classes (assuming a binary classification problem). Thus, Support Vector Machines are Nearest Neighbor Classifiers also.

CONCLUSIONS AND FUTURE WORK We have made the case that classification and prediction algorithms are nearest neighbor vote classification and predictions. The conclusion depends upon how one defines “near”. We have shown that there are clearly “nearness” or “correlations” or “similarities that provide these definitions. Broadly speaking, this (NNC) is the way we always proceed in Classification. Faced with a classification or prediction problem? Head off the standard way of approaching Classification, That of using a model-based classification method unless it just doesn’t work well enough and only then using Nearest Neighbor Classification. “It is all Nearest Neighbor Classification” That standard NNC should be used UNLESS it takes too long. Only then should one consider giving up accuracy (of your near neighbor set) for speed by using a model (Decision Tree or Neural Network). With a vertical data structure like P-trees NNC can be applied in most cases efficiently.