Dr. William Perrizo North Dakota State University

Dr. William Perrizo North Dakota State University
Fast Attribute-based Unsupervised and Supervised Table Clustering using P-Trees Fast Attribute-based Unsupervised and Supervised Table Clustering using P-Trees. Dr. William Perrizo North Dakota State University

Contents Introduction Predicate-Trees FAUST_P Algorithm Performance
Conclusion and Future Work Contents of the this presentation include: Introduction Brief overview of P-Trees The novel FAUST_P algorithm using Iris Dataset Performance measure Conclusion and Future Work

Introduction Exponential growth in image data
For e.g. NASA capturing Earth images down to 15m resolution since 1970s Data archived much before proper analysis Existing clustering algorithms are slow Since the advent of digital image technology and remote sensing imagery (RSI), massive amount of image data has been collected worldwide. For exam- ple, since 1972, NASA and U.S. Geological Survey through the Landsat Data Continuity Mission, has been capturing images of Earth down to 15 meters resolution. Also, consider the current scenario of usage of Unmanned Air Vehicles (UAVs)for security purpose where there is massive data collection and classifying objects of interest such as tanks, enemy hideouts, etc. is of utmost importance. Due to slowness of existing clustering algorithms, much of these data is archived without proper analysis. Thus, there is a sudden demand for a fast clustering algorithm to cope up with massive image data collection.

P-Trees Predicate-Trees are data-mining-ready, lossless and compressed
data structures Effective tool in Horizontal Processing of Vertical data Tree obtained by recursively partitioning the vertical strips of data

FAUST_P PREM=pure1 MT cl at mn gL gH gR (not yet sorted on gR)
1.attr, class, calculate means, gaps. MT(attr,class,mean,gapL,gapH,gapREL) sorted desc on gapREL =(gapL+gapH)/2*mean) gapL=low gap, gapH = hi gap. 2. MT record with max gapREL cL=mn-gapL/2 cH=mn+gapH/ PCLASS = PA>cL & P'A>cH & PREM PREM= PREM &P'CLASS 3. Repeat 2 til all classes pTree. 4. Repeat 1,2,3 til converge se se se se se se se se se se Sepal Lth Sepal Wth Pedal Lth Pedal Wth ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 PREM PA>cH 1 Pse = P'A>cH sLN m mg se 51 12 vi 63 7 ve 70 sWD m mg ve 32 1 vi 33 2 se 35 pLN m mg se 14 33 ve 47 13 vi 60 pWD m mg se 2 ve 14 11 vi 25 1.attr, class, calc means, mean_gaps. MT cl at mn gL gH gR (not yet sorted on gR) se sSL ( 12+12)/(2*51) se sWD ( 2+ 2)/(2*35) se pSL (33+33)/(2*14) se pWD (12+12)/(2* 2) vi sSL ( 8+ 7)/(2*63) vi sWD ( 1+ 2)/(2*33) vi pSL (13+13)/(2*60) vi pWD (11+11)/(2*25) ve sSL ( 7+ 7)/(2*70) ve sWD ( 1+ 1)/(2*32) ve pSL (33+13)/(2*47) ve pWD (12+11)/(2*14) x's fill ins. MT at cl mn gL gH gR We're separating out setosa class 2. MT rec w max gapREL cL= mean - gapL/2 cH=mean+gapH/2 se pWD = /2 = -4 PA>cL =Ppure1 = /2 = 8 = PA>cH =(P4,4|(P4,3&(P4,2|(P4,1|P4,0)))) se pWD se pSL vi pSL ve pSL ve pWD vi pWD se sSL vi sSL ve sSL se sWD vi sWD ve sWD MTscl at mn gL gH gR (sortws desc gR) Psetosa =PA>cL & P'A>cH & PREM =Ppure1 & P'A>cH & Ppure1 = P'A>cH PREM= PREM &P'CLASS = Ppure1 &P'setosa = P'setosa

Performance O(k) where k is the number of attributes
‘k’ significantly smaller in comparison to horizontal methods where ‘n’ is in order of billions 95% accuracy achieved in the first epoch As can be analyzed from the FAUST_P algorithm, it has a complexity of O(k) where k is the number of attributes or columns. This is extremely fast considering the fact that all the horizontal methods have at least O(n) assuming no suitable indexing. The value of k is generally small ranging from 2 to 7 (in case of Landsat data). Even high-attributed images with k of for e.g. 200 can be rapidly classiﬁed in comparison to horizontal methods where n are of the order of 1 billion or even more. Our algorithm achieves an accuracy of 95% on IRIS dataset with only 1 epoch. Higher accuracy can be achieved at the cost of time.

Conclusion and Future Work
Extremely fast supervised clustering algorithm based on P-Trees Future work to include standard deviation in place of mean for better accuracy In this paper, we propose a fast attribute-based unsupervised and supervised clustering algorithm. Our algorithm is extremely fast with a small compromise on the accuracy. In our future work, we plan to propose a divisive method which considers all data points in one cluster initially and splits depending on maximum gap. We are also in the process of using standard deviation for better accuracy.

Dr. William Perrizo North Dakota State University

Similar presentations

Presentation on theme: "Dr. William Perrizo North Dakota State University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr. William Perrizo North Dakota State University

Similar presentations

Presentation on theme: "Dr. William Perrizo North Dakota State University"— Presentation transcript:

Similar presentations

About project

Feedback