Dr. William Perrizo North Dakota State University

Slides:

Advertisements

Similar presentations

Major Operations of Digital Image Processing (DIP) Image Quality Assessment Radiometric Correction Geometric Correction Image Classification Introduction.

Advertisements

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor:

Evaluating Performance for Data Mining Techniques

Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.

Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.

DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.

Chapter 9 – Classification and Regression Trees

Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)

Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.

K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.

Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.

TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

SGM as an Affordable Alternative to LiDAR

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.

Unsupervised Classification

Near Neighbor Classifiers and FAUST Faust is really a Near Neighbor Classifier (NNC) in which, for each class, we construct a big box neighborhood (bbn)

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

Gilad Lerman Math Department, UMN

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

TITLE What should be in Objective, Method and Significant

Decision Trees.

Hierarchical Clustering

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Decision Tree Induction for High-Dimensional Data Using P-Trees

Efficient Ranking of Keyword Queries Using P-trees

Jeremy Fisher & John Mustard Geological Sciences - Brown University

Efficient Ranking of Keyword Queries Using P-trees

North Dakota State University Fargo, ND USA

Yue (Jenny) Cui and William Perrizo North Dakota State University

Evaluating Land-Use Classification Methodology Using Landsat Imagery

K-means and Hierarchical Clustering

PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)

Efficient Image Classification on Vertically Decomposed Data

Make a deal for FAUST_p today!!!

Physical Database Design

Vertical K Median Clustering

A Fast and Scalable Nearest Neighbor Based Classification

Vertical K Median Clustering

Image Information Extraction

Outline Introduction Background Our Approach Experimental Results

PA>c = Pm om ... Pk+1 ok+1 Pk

North Dakota State University Fargo, ND USA

FAUST_p (FAUST using pTrees)

Functional Analytic Unsupervised and Supervised data mining Technology

Thresholds to use for pkmc se < 54 < ve < 59 < vi sl

Vertical K Median Clustering

Analysis of Algorithms

North Dakota State University Fargo, ND USA

Full IRIS corrected setosa versicol verginic

FAUST{pdq,std} (FAUST{pdq} using number of gap standard deviations)

pTree-k-means-classification-sequential (pkmc-s)

pTree-k-means-classification-sequential (pkmc-s)

FAUST{pms,std} (FAUST{pms} using # gap std

The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy

PAj>c=Pj,m om...ok+1Pj,k oi is AND iff bi=1, k is rightmost bit position with bit-value "0", ops are right binding. c = bm ...

Integrating Query Processing and Data Mining in Relational DBMSs

Data Mining CSCI 307, Spring 2019 Lecture 21

se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi

FAUST{pms,std} (FAUST{pms} using # gap std

FAUST{pms,std} (FAUST{pms} using # gap std

Presentation transcript:

Dr. William Perrizo North Dakota State University Fast Attribute-based Unsupervised and Supervised Table Clustering using P-Trees Fast Attribute-based Unsupervised and Supervised Table Clustering using P-Trees. Dr. William Perrizo North Dakota State University

Contents Introduction Predicate-Trees FAUST_P Algorithm Performance Conclusion and Future Work Contents of the this presentation include: Introduction Brief overview of P-Trees The novel FAUST_P algorithm using Iris Dataset Performance measure Conclusion and Future Work

Introduction Exponential growth in image data For e.g. NASA capturing Earth images down to 15m resolution since 1970s Data archived much before proper analysis Existing clustering algorithms are slow Since the advent of digital image technology and remote sensing imagery (RSI), massive amount of image data has been collected worldwide. For exam- ple, since 1972, NASA and U.S. Geological Survey through the Landsat Data Continuity Mission, has been capturing images of Earth down to 15 meters resolution. Also, consider the current scenario of usage of Unmanned Air Vehicles (UAVs)for security purpose where there is massive data collection and classifying objects of interest such as tanks, enemy hideouts, etc. is of utmost importance. Due to slowness of existing clustering algorithms, much of these data is archived without proper analysis. Thus, there is a sudden demand for a fast clustering algorithm to cope up with massive image data collection.

P-Trees Predicate-Trees are data-mining-ready, lossless and compressed data structures Effective tool in Horizontal Processing of Vertical data Tree obtained by recursively partitioning the vertical strips of data

FAUST_P PREM=pure1 MT cl at mn gL gH gR (not yet sorted on gR) 1.attr, class, calculate means, gaps. MT(attr,class,mean,gapL,gapH,gapREL) sorted desc on gapREL =(gapL+gapH)/2*mean) gapL=low gap, gapH = hi gap. 2. MT record with max gapREL cL=mn-gapL/2 cH=mn+gapH/2 PCLASS = PA>cL & P'A>cH & PREM PREM= PREM &P'CLASS 3. Repeat 2 til all classes pTree. 4. Repeat 1,2,3 til converge se 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 se 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 se 54 36 14 2 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 se 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 se 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 se 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 se 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 Sepal Lth Sepal Wth Pedal Lth Pedal Wth ve 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 ve 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 ve 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 ve 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 ve 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 ve 63 33 47 16 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 ve 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 ve 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 ve 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 1 0 ve 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 vi 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 vi 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 vi 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 vi 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 vi 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 vi 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 vi 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 vi 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 vi 72 36 61 25 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 vi 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 1 PREM PA>cH 1 Pse = P'A>cH sLN m mg se 51 12 vi 63 7 ve 70 sWD m mg ve 32 1 vi 33 2 se 35 pLN m mg se 14 33 ve 47 13 vi 60 pWD m mg se 2 ve 14 11 vi 25 1.attr, class, calc means, mean_gaps. MT cl at mn gL gH gR (not yet sorted on gR) se sSL 51 12 12 .26 ( 12+12)/(2*51) se sWD 35 2 2 .06 ( 2+ 2)/(2*35) se pSL 14 33 33 2.36 (33+33)/(2*14) se pWD 2 12 12 6 (12+12)/(2* 2) vi sSL 63 8 7 .12 ( 8+ 7)/(2*63) vi sWD 33 1 2 .05 ( 1+ 2)/(2*33) vi pSL 60 13 13 2.2 (13+13)/(2*60) vi pWD 25 11 11 .44 (11+11)/(2*25) ve sSL 70 7 7 .1 ( 7+ 7)/(2*70) ve sWD 32 1 1 .03 ( 1+ 1)/(2*32) ve pSL 47 33 13 .94 (33+13)/(2*47) ve pWD 14 12 11 .82 (12+11)/(2*14) x's fill ins. MT at cl mn gL gH gR We're separating out setosa class 2. MT rec w max gapREL cL= mean - gapL/2 cH=mean+gapH/2 se pWD 2 12 12 6 = 2 - 12/2 = -4 PA>cL =Ppure1 = 2 + 12/2 = 8 = 0 1 0 0 0 PA>cH =(P4,4|(P4,3&(P4,2|(P4,1|P4,0)))) se pWD 2 12 12 6 se pSL 14 33 33 2.36 vi pSL 60 13 13 2.2 ve pSL 47 33 13 .94 ve pWD 14 12 11 .82 vi pWD 25 11 11 .44 se sSL 51 12 12 .26 vi sSL 63 8 7 .12 ve sSL 70 7 7 .1 se sWD 35 2 2 .06 vi sWD 33 1 2 .05 ve sWD 32 1 1 .03 MTscl at mn gL gH gR (sortws desc gR) Psetosa =PA>cL & P'A>cH & PREM =Ppure1 & P'A>cH & Ppure1 = P'A>cH PREM= PREM &P'CLASS = Ppure1 &P'setosa = P'setosa

Performance O(k) where k is the number of attributes ‘k’ significantly smaller in comparison to horizontal methods where ‘n’ is in order of billions 95% accuracy achieved in the first epoch As can be analyzed from the FAUST_P algorithm, it has a complexity of O(k) where k is the number of attributes or columns. This is extremely fast considering the fact that all the horizontal methods have at least O(n) assuming no suitable indexing. The value of k is generally small ranging from 2 to 7 (in case of Landsat data). Even high-attributed images with k of for e.g. 200 can be rapidly classiﬁed in comparison to horizontal methods where n are of the order of 1 billion or even more. Our algorithm achieves an accuracy of 95% on IRIS dataset with only 1 epoch. Higher accuracy can be achieved at the cost of time.

Conclusion and Future Work Extremely fast supervised clustering algorithm based on P-Trees Future work to include standard deviation in place of mean for better accuracy In this paper, we propose a fast attribute-based unsupervised and supervised clustering algorithm. Our algorithm is extremely fast with a small compromise on the accuracy. In our future work, we plan to propose a divisive method which considers all data points in one cluster initially and splits depending on maximum gap. We are also in the process of using standard deviation for better accuracy.