The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda Lu Computer Science North Dakota State University Fargo, ND 58108 USA

Prediction and Classification The Classification and Predicion problem is a very interesting data mining problem. The problem is to predict a class label based on past (assumed correct) prediction activities. Typically the training datasets of past predictions are extremely large (which is good for accuracy but bad for speed of prediction). Immediately one runs into the famous problems, the curse of cardinality and the curse of dimensionality. The curse of cardinality refers to the fact that, if the number of records in a training file is very large, standard vertical processing of horizontally record structures (get 1st record, process 1st record, get next record, process it, get next, process it,...) can take an unacceptably long time.

The curse of cardinality The curse of cardinality is a very serious problem for Near Neighbor Classifiers (NNCs) of horizontal data since NNCs requires a scan of the entire training set to determine the [k] nearest neighbors (Take records 1,2,...,k as the initial k Nearest Nbr set (kNN set). Get the k+1 st record. If it is closer than any one in the kNN set, replace. Get k+2 nd. If it is closer than any one in the kNN set, replace. Get k+3 rd... ) and with only one scan, the best one can do is determine one of many “k nearest neighbor sets” (i.e., there may be many training points that are just as near as the k th one selected). Our solution is to use a vertical data organization so that the cost is dependent on the number of attributes (which typically does not change over time) and is nearly independent of the number of records (which is extremely large and typically gets larger over time as the business grows). This ameliorates the Curse of Cardinality problem considerably.

Curse of Cardinality - 2 The only way to get a fair classification vote in which all neighbors at a given distance get equal vote, is to make a second scan of the entire training set, which is expensive. For that reason, most kNNC implementations disregard the other neighbors at the same distance from the unclassified sample, not because the k voters suffice, but because is too expensive to find the other neighbors at that same distance and to enfranchise them for voting as well. Of course, if the training set is such that any neighbor gives a representative vote for all neighbors at that same distance, then kNN is just as good as Closed kNN. But that would have to mean that all neighbors at the same distance have the same class, which means the classes are concentric rings around the unclassified sample. If that is known, no sophisticated analysis is required to classify. So, we solve the curse of cardinality by using vertical data. However, it is more common to solve the curse of cardinality by resorting to model- based classification (in which, first, a compact model is built to represent the training set information (the training phase) then that closed form, compact model is used over and over again to classify and predict (the prediction phase).

k-Nearest Neighbor Classification (kNNC) and closed-k-Nearest Neighbor Classification (ckNNC) 1) Select a suitable value for k 2) Determine a suitable distance or similarity notion (definition of near) 3) Find the k nearest neighbor set [closed] of the unclassified sample. 4) Find the plurality class in the nearest neighbor set. 5) Assign the plurality class as the predicted class of the sample T That's 1 ! Let T be the unclassified point or sample. Using Euclidean distance and k = 3: Find the 3 closest neighbors. Move out from T until ≥ 3 neighbors are found. 3NN arbitrarily select one point from this boundary line as the 3 rd nearest neighbor, whereas, C3NN includes all points on this boundary line. That's 2 ! That's 6 (more than 3 !)

closed k Nearest Neighbor Classification (ckNNC) often yields much higher classification accuracy than traditional kNNC. At what additional cost? Actually, it can be at a negative additional cost i.e., It can be faster AND more accurate using vertical data organization! The NEXT SLIDE describes our Ptree vertical data organization which facilitates this faster and more accurate ckNNC.)

0 0 0 0 1 P 11 4. Left half of rt half ? false  0 0 2. Left half pure1? false  0 0 0 1. Whole is pure1? false  0 5. Rt half of right half? true  1 0 0 1 R 11 0 1 Horizontally AND basic Ptrees Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans) Top-down construction of the 1-dimensional Ptree representation of R 11, denoted, P 11, is built by recording the truth of the universal predicate pure 1 in a tree recursively on halves (1/2 1 subsets), until purity is achieved. 3. Right half pure1? false  0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 But it is pure (pure0) so this branch ends then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R 11 into P 11 goes as follows: P 11 pure1? false=0 pure1? true=1 pure1? false=0 R (A 1 A 2 A 3 A 4 ) 2 7 6 1 6 7 6 0 3 7 5 1 2 7 5 7 3 2 1 4 2 2 1 5 7 0 1 4 Horizontally structured records Scanned vertically 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 = Base 10 Base 2 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 01 0 0 0 0 01 0 0 0 0 1 0 0 10 01 ^^ ^ ^ ^ ^^

Curse of Dimensionality The curse of dimensionality refers to the fact that, as files get very wide (have many columns), intuition breaks down and the critical notion of “near” ceases to work in Near Neighbor Classification. Why? To what limit does the volume of a unit disk go, as the dimension, n, goes to infinity? Just to get some “idea” let us consider a few initial volume numbers. For n=1, the volume is 2 (length of line from -1 to 1). For n=2, the volume is π r 2 or ~3.1416 For n=3, the volume is 4/3 π r 3 ~4.1888, etc. Intuition might tell us that volume is heading toward  as n goes to . Or it might suggest to us that it will top out asymptotically at some number > 4.1888. In fact, the volume of the unit disk goes to 0 as n --> , reaching its maximum at dimension = 5. +1

Therefore, there are almost no [Euclidean distance] Nbrs in high Dimensions The volume of the unit disk goes to 0 as n --> , with max at dim = 5. But the volume of the unit cube goes to  as n --> . This is not intuitive to most people, and it is more than a counter intuitive party conversation piece. It has consequences in NNC. This tells us that there's a lot of volume in a high dimensional unit cube, but, in its inscribed unit disk, there is almost nothing (no “space”, no volume, and therefore no points (relatively speaking). ). E.g, there are lots of potential near neighbors in a high dimensional unit cube. Thus, when using, e.g., L 1 or Hamming distance to define neighborhoods, there are plenty of neighbors, but there are almost no near neighbors in a high dimensional unit disk when using Euclidean distance. Our solution: use cubes, not disks as neighborhoods (this is typically the distance of choice in Genomics, Bioinformatics and BioMedical Informatics). Ptree vertical technology faciliates the construction and analysis of Cube neighborhoods. 2 2 2 2 =4 +1 2 2 1 =2 2 3 =8 16 32 64. 2 n. 

But all classification is Near Neighbor Classification! So far we have described our Ptree, patented, vertical data approach to prediction and classification. The approach takes into consideration typical training data set size characteristics, which cause: cardinality (curse of cardinality) dimension (curse of dimensionality) Next we show that, model-based classifiers are really (less accurate) Near Neighbor Classifiers in which an alternate idea of “near” is employed. After having made that case, we conclude that the improved speed of CNNC using the Ptree vertical approach a good choice since it retains (in fact, often improves) the speed compared to the so-called model based methods and it retains the accuracy (and with closed NNC, often improves it) of the Near Neighbor approach.

Near Neighbor Classification Given a (large) TRAINING SET T(A1,..., An, C) with CLASS, C, and FEATURES A=(A1,...,An), C Classification of an unclassified sample, (a1,...,an) is just: SELECT Max (Count (T.Ci)) FROM T WHERE T.A1=a1 AND T.A2=a2... AND T.An=an GROUP BY T.C; i.e., It is just a SELECTION, since C-Classification is assigning to (a1..an) the most frequent C-value in RA=(a1..an). But, if the EQUALITY SELECTION is empty, then we need a FUZZY QUERY to find NEAR NEIGHBORs (NNs) instead of exact matches. That's Nearest Neighbor Classification (NNC).

Near Neighbor Classification Given a (large) TRAINING SET T(A 1,..., A n, C) with CLASS, C, and FEATURES A=(A 1,...,A n ), C Classification of an unclassified sample, (a 1,..., a n ) is just: SELECT Max (Count (T.C i )) FROM T WHERE T.A 1 =a 1 AND T.A 2 =a 2... AND T.A n =a n GROUP BY T.C; i.e., It is just a SELECTION, since C-Classification is assigning to (a 1,..., a n ) the most frequent C-value in RA=(a 1,..., a n ). But, if the EQUALITY SELECTION is empty, then we need a FUZZY QUERY to find NEAR NEIGHBORs instead of exact matches. That's Nearest Neighbor Classification (NNC).

Example: Medical Expert System (Ask a Nurse) Symptoms plus past diagnoses are collected into a table called CASES. For each undiagnosed new_symptoms,CASES is searched for matches: SELECT DIAGNOSIS FROM CASES WHERE CASES.SYMPTOMS = new_symptoms If there is a predominant DIAGNOSIS, then report it elseIf there's no predominant DIAGNOSIS, then Classify instead of Query i.e., find the fuzzy matches (near neighbors) SELECT DIAGNOSIS FROM CASES WHERE CASES.SYMPTOMS ≅ new_symptoms else call your doctor in the morning

Nearest Neighbor Classification (sample-based) and Eager Classification (model-based) Given a TRAINING SET, R(A 1,..,A n, C), with C = CLASSES and (A 1,...,A n )=FEATURES Nearest Neighbor Classification (NNC) = selecting a set of R-tuples with similar features (to the unclassified sample) and then letting the corresponding class values vote. Nearest Neighbor Classification won't work very well if the vote is inconclusive (close to a tie) or if similar (near) is not well defined, then we build a MODEL of TRAINING SET (at, possibly, great 1-time expense?) When a MODEL is built first the technique is called Eager classification, whereas model-less methods like Nearest Neighbor are called Lazy or Sample-based.

Eager Classification (model-based) Eager Classifiers models can be: decision trees, probabilistic models (Bayesian Classifier, Neural Networks, Support Vector Machines, etc.) How do you decide when an EAGER model is good enough to use? How do you decide if a Nearest Neighbor Classifier is working well enough? We have a TEST PHASE. Typically, we set aside some training tuples as a Test Set (then, of course, those Test tuples cannot be used in model building or and cannot be used as nearest neighbors). If the classifier passes the test (a high enough % of Test tuples are correctly classified by the classifier) it is accepted.

EXAMPLE: Computer Ownership The TRAINING SET for predicting who owns a computer is: Customer ( Age Salary Job Owns Computer ) | 24 | 55,000 | Programmer | yes | | 58 | 94,000 | Doctor | no | | 48 | 14,000 | Laborer | no | | 58 | 19,000 | Domestic | no | | 28 | 18,000 | Builder | no | A Decision Tree (model) classifier built from TRAINING: Is this a Near Neighbor Classifier Where are Near Neighborhoods involved? In actuality, what we are doing is saying that the training subset at the bottom of each decision path represents a near neighborhood of any unclassified sample that traverses the decision tree to that leaf. The concept of “near” or “correlation” used is that the unclassified sample meets the same set of condition criteria as the near neighbors at the bottom of that path of condition criteria. Thus, in a real sense, we are using a different (accumulative) “correlation” definition along each branch of the decision tree and the subsets at the leaf of each branch are true Near Neighbor sets for the respective correlations or notions of nearness.

Neural Network classifiers For any Neural Network classifier we train the NN, e.g., by adjusting the weights and biases through back-propagation until we reach an acceptable level of performance. In so doing we are using the matrix of weights and biases as the determiners of our near neighbor sets. We don’t stop training until those near neighbor sets (the sets of inputs that produce the same class output), are sufficiently “near” to each other to give us a level of accuracy that is sufficient for our needs.

Support Vector Machine (SVM) classifiers The very first step in Support Vector Machines (SVM) classification is often to isolate a neighborhood in which to examine the boundary and the margins of the boundary between classes (assuming a binary classification problem). Thus, Support Vector Machines are Nearest Neighbor Classifiers also.

CONCLUSIONS AND FUTURE WORK We have made the case that classification and prediction algorithms are nearest neighbor vote classification and predictions. The conclusion depends upon how one defines “near”. We have shown that there are clearly “nearness” or “correlations” or “similarities that provide these definitions. Broadly speaking, this (NNC) is the way we always proceed in Classification. This is important because the first decision that needs to be made when faced with a classification or prediction problem is to decide which classification or prediction algorithm to employ. What good does this understanding do for someone faced with a classification or prediction problem? In a real sense the point of this paper is to head off the standard way of approaching Classification, which seem to be that of using a model-based classification method unless it just doesn’t work well enough and only then using Nearest Neighbor Classification. Our point is that “It is all Nearest Neighbor Classification” essentially and that standard NNC should be used UNLESS it takes too long. Only then should one consider giving up accuracy (of your near neighbor set) for speed by using a model (Decision Tree or Neural Network).

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

Similar presentations

Presentation on theme: "The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

Similar presentations

Presentation on theme: "The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda."— Presentation transcript:

Similar presentations

About project

Feedback