pTrees predicate Tree technologies

Slides:

Advertisements

Similar presentations

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Advertisements

Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree Yong-Sheng Chen Yi-Ping Hung Chiou-Shann Fuh 8 th International Conference on Computer.

Classification and Decision Boundaries

1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,

Data Mining Classification: Alternative Techniques

These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Entity Tables, Relationship Tables We Classify using any Table (as the Training Table) on any of its columns, the class label column. Medical Expert System:

CS Instance Based Learning1 Instance Based Learning.

Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Chapter 9 Neural Network.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Machine Learning Neural Networks (3). Understanding Supervised and Unsupervised Learning.

David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques Nearest Neighbor Editing and Condensing Techniques 1.Nearest Neighbor.

K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.

Lecture 27: Recognition Basics CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li

Intelligent Database Systems Lab N.Y.U.S.T. I. M. A fast nearest neighbor classifier based on self-organizing incremental neural network (SOINN) Neuron.

1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.

On Utillizing LVQ3-Type Algorithms to Enhance Prototype Reduction Schemes Sang-Woon Kim and B. John Oommen* Myongji University, Carleton University*

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

What is Unsupervised Learning? Learning without a teacher. No feedback to indicate the desired outputs. The network must by itself discover the relationship.

Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.

Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Local Naïve Bayes Nearest Neighbor for image classification Scancho McCann David G.Lowe University of British Columbia 2012 CVPR WonJun Na.

Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.

Overview Data Mining - classification and clustering

Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.

Data Mining By: Johan Johansson. Mining Techniques Association Rules Association Rules Decision Trees Decision Trees Clustering Clustering Nearest Neighbor.

Single Document Key phrase Extraction Using Neighborhood Knowledge.

Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

Fuzzy Logic in Pattern Recognition

Self-Organizing Network Model (SOM) Session 11

Ananya Das Christman CS311 Fall 2016

Data Science Algorithms: The Basic Methods

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Decision Tree Induction for High-Dimensional Data Using P-Trees

Information Retrieval

K Nearest Neighbors and Instance-based methods

Yue (Jenny) Cui and William Perrizo North Dakota State University

pTrees predicate Tree technologies

PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)

K Nearest Neighbor Classification

Classification Nearest Neighbor

Nearest-Neighbor Classifiers

Lecture 22 Clustering (3).

Research Areas Christoph F. Eick

Prepared by: Mahmoud Rafeek Al-Farra

Prepared by: Mahmoud Rafeek Al-Farra

Data Mining extracting knowledge from a large amount of data

C93 1 C922 1 C921 1 C91 1 unclassified sample disk: C83 1

Entity Tables, Relationship Tables is in Course Student Enrollments

DATA MINING Introductory and Advanced Topics Part II - Clustering

Outline Introduction Background Our Approach Experimental Results

COSC 4335: Other Classification Techniques

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

Nearest Neighbors CSC 576: Data Mining.

CSE4334/5334 Data Mining Lecture 7: Classification (4)

Data Mining CSCI 307, Spring 2019 Lecture 23

Presentation transcript:

pTrees predicate Tree technologies provide fast, accurate horizontal processing of compressed, data-mining-ready, vertical data structures. Applications: PINE Podium Incremental Neighborhood Evaluator uses pTrees for Closed k Nearest Neighbor Classification. FAUST Fast Accurate Unsupervised, Supervised Treemining uses pTtrees for classification and clustering of spatial data. 13 12 1 document 2 3 4 5 course Text person Enroll Buy MYRRH ManY-Relationship-Rule Harvester uses pTrees for association rule mining of multiple relationships. PGP-D Pretty Good Protection of Data protects vertical pTree data. 5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ... key=array(offset,pad) ConCur Concurrency Control uses pTrees for ROCC and ROLL concurrency control. DOVE DOmain VEctors Uses pTrees for database query processing.

PINE Podium Incremental Neighborhood Evaluator uses pTrees for Closed k Nearest Neighbor Classification (CkNNC) First 3NN using horizontal data to classify an unclassified sample, a =( 0 0 0 0 0 0 ). a5 a6 a10=C a11 a12 a13 a14 dis from a=000000 area for 3 nearest nbrs t12 0 0 1 0 1 1 0 2 C=1 wins! t13 0 0 1 0 1 0 0 1 t53 0 0 0 0 1 0 0 1 t15 0 0 1 0 1 0 1 2 0 1 Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10=C a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 distance=2, don’t replace 0 0 0 0 0 0 distance=4, don’t replace 0 0 0 0 0 0 distance=4, don’t replace 0 0 0 0 0 0 distance=3, don’t replace 0 0 0 0 0 0 distance=3, don’t replace 0 0 0 0 0 0 distance=2, don’t replace 0 0 0 0 0 0 distance=3, don’t replace 0 0 0 0 0 0 distance=2, don’t replace 0 0 0 0 0 0 distance=1, replace 0 0 0 0 0 0 distance=2, don’t replace 0 0 0 0 0 0 distance=2, don’t replace 0 0 0 0 0 0 distance=3, don’t replace 0 0 0 0 0 0 distance=2, don’t replace 0 0 0 0 0 0 distance=2, don’t replace

Next C3NN using horizontal data: (a second pass is necessary to find all other voters that are at distance 2 from a) Vote after 1st scan. t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 a5 a6 a10=C a11 a12 a13 a14 distance t53 0 0 0 0 1 0 0 1 Unclassified sample: 0 0 0 0 0 0 3NN set after 1st scan 0 1 Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10=C a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 d=2, already voted 0 0 0 0 0 0 d=1, already voted 0 0 0 0 0 0 d=2, include it also 0 0 0 0 0 0 d=2, include it also 0 0 0 0 0 0 d=4, don’t include 0 0 0 0 0 0 d=4, don’t include 0 0 0 0 0 0 d=3, don’t include 0 0 0 0 0 0 d=3, don’t include 0 0 0 0 0 0 d=2, include it also 0 0 0 0 0 0 d=3, don’t include 0 0 0 0 0 0 d=2, include it also 0 0 0 0 0 0 d=1, already voted 0 0 0 0 0 0 d=2, include it also 0 0 0 0 0 0 d=2, include it also 0 0 0 0 0 0 d=3, don’t replace 0 0 0 0 0 0 d=2, include it also 0 0 0 0 0 0 d=2, include it also C=0 wins now!

PINE: a Closed 3NN method using pTrees (vertically data structures). 1st: pTree-based C3NN goes as follows: First let all training points at distance=0 vote, then distance=1, then distance=2, ... until  3 votes are cast. For distance=0 (exact matches) constructing the P-tree, Ps then AND with PC and PC’ to compute the vote. a14 1 a13 1 No neighbors at distance=0 a12 1 a11 1 C' 1 a6 1 C 1 C 1 a5 1 Ps key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 t75 a1 1 a2 1 a3 1 a4 1 a5 1 a6 1 a7 1 a8 1 a9 1 a11 1 a12 1 a13 1 a14 1 a15 1 a16 1 a17 1 a18 1 a19 1 a20 1

pTree-based C3NN: = OR PS(si,1)   S(sj,0) a14 1 a14 1 a13 a12 a11 a6 find all distance=1 nbrs: Construct Ptree, PS(s,1) = OR Pi = P|si-ti|=1; |sj-tj|=0, ji = OR PS(si,1)   S(sj,0) i=5,6,11,12,13,14 i=5,6,11,12,13,14 j{5,6,11,12,13,14}-{i} P5 P6 P11 P12 P13 P14 0 1 a14 1 a14 1 a13 a12 a11 a6 0 0 a5 a14 1 a13 a12 a11 a6 a5 a14 1 a13 a12 1 1 a11 a6 a5 a14 1 a13 1 0 a12 a11 a6 a5 a14 0 0 1 a13 a12 a11 a6 a5 a13 1 a12 1 a11 1 C' 1 a6 1 C 1 a10 =C 1 a5 1 PD(s,1) 1 key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 t75 a1 1 a2 1 a3 1 a4 1 a5 1 a6 1 a7 1 a8 1 a9 1 a11 1 a12 1 a13 1 a14 1 a15 1 a16 1 a17 1 a18 1 a19 1 a20 1 OR

pTree-based C3NN, dist=2 nbrs: OR{all double-dim interval-Ptrees}; PD(s,2) = OR Pi,j Pi,j = PS(si,1) S(sj,1)  S(sk,0) k{5,6,11,12,13,14}-{i,j} i,j{5,6,11,12,13,14} pTree-based C3NN, dist=2 nbrs: PINE=CkNN in which all training samples vote weighted by their nearness to a (~Olympic podiums) We now have the C3NN set and we can declare C=0 the winner! We now have 3 nearest nbrs. We could quite and declare C=1 winner? 0 1 P5,6 P5,11 P5,12 P5,13 P5,14 P6,11 P6,12 P6,13 P6,14 P11,12 P11,13 P11,14 P12,13 P12,14 P13,14 a14 1 a13 a12 a11 a6 0 0 a5 a14 1 a13 a12 a11 a6 a5 a14 1 a13 a12 1 1 a11 a6 a5 a14 1 a13 1 0 a12 a11 a6 a5 a14 0 0 1 a13 a12 a11 a6 a5 a14 1 a13 a12 a11 a6 0 0 a5 a14 1 a13 a12 a11 a6 0 0 a5 a14 1 a13 a12 a11 a6 0 0 a5 a14 1 a13 a12 a11 a6 0 0 a5 a14 1 a13 a12 a11 a6 a5 a14 1 a13 a12 a11 a6 a5 a14 1 a13 a12 a11 a6 a5 a14 1 a13 1 0 a12 a11 a6 a5 a14 0 0 1 a13 a12 a11 a6 a5 a14 0 0 1 a13 a12 a11 a6 a5 a10 C 1 key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 t75 a5 1 a6 1 a11 1 a12 1 a13 1 a14 1