Dr. William Perrizo North Dakota State University

Slides:



Advertisements
Similar presentations
Major Operations of Digital Image Processing (DIP) Image Quality Assessment Radiometric Correction Geometric Correction Image Classification Introduction.
Advertisements

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor:
Evaluating Performance for Data Mining Techniques
Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Chapter 9 – Classification and Regression Trees
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
SGM as an Affordable Alternative to LiDAR
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Unsupervised Classification
Near Neighbor Classifiers and FAUST Faust is really a Near Neighbor Classifier (NNC) in which, for each class, we construct a big box neighborhood (bbn)
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
Gilad Lerman Math Department, UMN
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
TITLE What should be in Objective, Method and Significant
Decision Trees.
Hierarchical Clustering
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Jeremy Fisher & John Mustard Geological Sciences - Brown University
Efficient Ranking of Keyword Queries Using P-trees
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
Evaluating Land-Use Classification Methodology Using Landsat Imagery
K-means and Hierarchical Clustering
PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)
Efficient Image Classification on Vertically Decomposed Data
Make a deal for FAUST_p today!!!
Physical Database Design
Vertical K Median Clustering
A Fast and Scalable Nearest Neighbor Based Classification
Vertical K Median Clustering
Image Information Extraction
Outline Introduction Background Our Approach Experimental Results
PA>c = Pm om ... Pk+1 ok+1 Pk
North Dakota State University Fargo, ND USA
FAUST_p (FAUST using pTrees)
Functional Analytic Unsupervised and Supervised data mining Technology
Thresholds to use for pkmc se < 54 < ve < 59 < vi sl
Vertical K Median Clustering
Analysis of Algorithms
North Dakota State University Fargo, ND USA
Full IRIS corrected setosa versicol verginic
FAUST{pdq,std} (FAUST{pdq} using number of gap standard deviations)
pTree-k-means-classification-sequential (pkmc-s)
pTree-k-means-classification-sequential (pkmc-s)
FAUST{pms,std} (FAUST{pms} using # gap std
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
PAj>c=Pj,m om...ok+1Pj,k oi is AND iff bi=1, k is rightmost bit position with bit-value "0", ops are right binding. c = bm ...
Integrating Query Processing and Data Mining in Relational DBMSs
Data Mining CSCI 307, Spring 2019 Lecture 21
se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi
FAUST{pms,std} (FAUST{pms} using # gap std
FAUST{pms,std} (FAUST{pms} using # gap std
Presentation transcript:

Dr. William Perrizo North Dakota State University Fast Attribute-based Unsupervised and Supervised Table Clustering using P-Trees Fast Attribute-based Unsupervised and Supervised Table Clustering using P-Trees. Dr. William Perrizo North Dakota State University

Contents Introduction Predicate-Trees FAUST_P Algorithm Performance Conclusion and Future Work Contents of the this presentation include: Introduction Brief overview of P-Trees The novel FAUST_P algorithm using Iris Dataset Performance measure Conclusion and Future Work

Introduction Exponential growth in image data For e.g. NASA capturing Earth images down to 15m resolution since 1970s Data archived much before proper analysis Existing clustering algorithms are slow Since the advent of digital image technology and remote sensing imagery (RSI), massive amount of image data has been collected worldwide. For exam- ple, since 1972, NASA and U.S. Geological Survey through the Landsat Data Continuity Mission, has been capturing images of Earth down to 15 meters resolution. Also, consider the current scenario of usage of Unmanned Air Vehicles (UAVs)for security purpose where there is massive data collection and classifying objects of interest such as tanks, enemy hideouts, etc. is of utmost importance. Due to slowness of existing clustering algorithms, much of these data is archived without proper analysis. Thus, there is a sudden demand for a fast clustering algorithm to cope up with massive image data collection.

P-Trees Predicate-Trees are data-mining-ready, lossless and compressed data structures Effective tool in Horizontal Processing of Vertical data Tree obtained by recursively partitioning the vertical strips of data

FAUST_P PREM=pure1 MT cl at mn gL gH gR (not yet sorted on gR) 1.attr, class, calculate means, gaps. MT(attr,class,mean,gapL,gapH,gapREL) sorted desc on gapREL =(gapL+gapH)/2*mean) gapL=low gap, gapH = hi gap. 2. MT record with max gapREL cL=mn-gapL/2 cH=mn+gapH/2 PCLASS = PA>cL & P'A>cH & PREM PREM= PREM &P'CLASS 3. Repeat 2 til all classes pTree. 4. Repeat 1,2,3 til converge se 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 se 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 se 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 se 54 36 14 2 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 se 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 se 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 se 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 se 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 se 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 se 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 Sepal Lth Sepal Wth Pedal Lth Pedal Wth ve 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 ve 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 ve 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 ve 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 ve 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 ve 63 33 47 16 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 ve 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 ve 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 ve 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 1 0 ve 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 vi 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 vi 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 vi 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 vi 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 vi 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 vi 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 vi 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 vi 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 vi 72 36 61 25 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 vi 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 1 PREM PA>cH 1 Pse = P'A>cH sLN m mg se 51 12 vi 63 7 ve 70 sWD m mg ve 32 1 vi 33 2 se 35 pLN m mg se 14 33 ve 47 13 vi 60 pWD m mg se 2 ve 14 11 vi 25 1.attr, class, calc means, mean_gaps. MT cl at mn gL gH gR (not yet sorted on gR) se sSL 51 12 12 .26 ( 12+12)/(2*51) se sWD 35 2 2 .06 ( 2+ 2)/(2*35) se pSL 14 33 33 2.36 (33+33)/(2*14) se pWD 2 12 12 6 (12+12)/(2* 2) vi sSL 63 8 7 .12 ( 8+ 7)/(2*63) vi sWD 33 1 2 .05 ( 1+ 2)/(2*33) vi pSL 60 13 13 2.2 (13+13)/(2*60) vi pWD 25 11 11 .44 (11+11)/(2*25) ve sSL 70 7 7 .1 ( 7+ 7)/(2*70) ve sWD 32 1 1 .03 ( 1+ 1)/(2*32) ve pSL 47 33 13 .94 (33+13)/(2*47) ve pWD 14 12 11 .82 (12+11)/(2*14) x's fill ins. MT at cl mn gL gH gR We're separating out setosa class 2. MT rec w max gapREL cL= mean - gapL/2 cH=mean+gapH/2 se pWD 2 12 12 6 = 2 - 12/2 = -4 PA>cL =Ppure1 = 2 + 12/2 = 8 = 0 1 0 0 0 PA>cH =(P4,4|(P4,3&(P4,2|(P4,1|P4,0)))) se pWD 2 12 12 6 se pSL 14 33 33 2.36 vi pSL 60 13 13 2.2 ve pSL 47 33 13 .94 ve pWD 14 12 11 .82 vi pWD 25 11 11 .44 se sSL 51 12 12 .26 vi sSL 63 8 7 .12 ve sSL 70 7 7 .1 se sWD 35 2 2 .06 vi sWD 33 1 2 .05 ve sWD 32 1 1 .03 MTscl at mn gL gH gR (sortws desc gR) Psetosa =PA>cL & P'A>cH & PREM =Ppure1 & P'A>cH & Ppure1 = P'A>cH PREM= PREM &P'CLASS = Ppure1 &P'setosa = P'setosa

Performance O(k) where k is the number of attributes ‘k’ significantly smaller in comparison to horizontal methods where ‘n’ is in order of billions 95% accuracy achieved in the first epoch As can be analyzed from the FAUST_P algorithm, it has a complexity of O(k) where k is the number of attributes or columns. This is extremely fast considering the fact that all the horizontal methods have at least O(n) assuming no suitable indexing. The value of k is generally small ranging from 2 to 7 (in case of Landsat data). Even high-attributed images with k of for e.g. 200 can be rapidly classified in comparison to horizontal methods where n are of the order of 1 billion or even more. Our algorithm achieves an accuracy of 95% on IRIS dataset with only 1 epoch. Higher accuracy can be achieved at the cost of time.

Conclusion and Future Work Extremely fast supervised clustering algorithm based on P-Trees Future work to include standard deviation in place of mean for better accuracy In this paper, we propose a fast attribute-based unsupervised and supervised clustering algorithm. Our algorithm is extremely fast with a small compromise on the accuracy. In our future work, we plan to propose a divisive method which considers all data points in one cluster initially and splits depending on maximum gap. We are also in the process of using standard deviation for better accuracy.