Download presentation
Presentation is loading. Please wait.
Published byBritney Booth Modified over 8 years ago
1
Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada July 23, 2002 *P-tree technology is patent pending by NDSU
2
July 23, 2002Multimedia Data Mining using P-trees* Outline Multimedia Data Mining Peano Count Trees (P-trees) Properties of P-trees Data Mining Techniques Using P-trees Implementation Issues and Performance Conclusion
3
July 23, 2002Multimedia Data Mining using P-trees* Multimedia Data Mining – Extract high-level, information from large multimedia data sets. Typically done in two steps: – Capture specific features of the data as feature vectors or tuples in a table or feature space. – Mine those tuples for info/knowledge Association rule mining (ARM) clustering or classification on feature vectors
4
July 23, 2002Multimedia Data Mining using P-trees* Multimedia Data Remotely Sensed Imagery (RSI) – Usually 2-D (or 3-D) and relatively smooth – Large datasets (e.g., Landsat ETM+ ~100,000,000 pixels). Video-Audio data mining – Usually result in high dimensional feature spaces – Multimedia datasets are usually very large. Text mining (Feature space is high dimensional but sparse). P-trees are well suited for representing such feature spaces – Lossless compressed representation – Good at manipulating high dimensional data set
5
July 23, 2002Multimedia Data Mining using P-trees* Precision Agriculture Dataset: TIFF Image and other measurements (1320×1320) RGB Moisture Yield Nitrate
6
July 23, 2002Multimedia Data Mining using P-trees* The Peano Count Tree (P-tree) P-tree represents feature vector data bit-by-bit, in a recursive, quadrant-by-quadrant, losslessly-compressed manner. – First: given a feature vector space, vertically fragment by column. Storage Decomposition Model (e.g., Bubba, circa 1985). In SDM, each column is a separate file retaining original row order. –Sometimes called “Vertical Database Model” – Second: For P-trees, we vertically fragment further by bit position Bit-SDM: each bit position of each column is a file-retain original row order Each resulting file is called a bit-SeQuential (bSQ) of bSDM file. –The high-order bSQ file IS data. The others are DELTAs (ala, MPEG).
7
July 23, 2002Multimedia Data Mining using P-trees* An example of a P-tree Quadrant-based, Pure (Pure-1/Pure-0) quadrant Peano or Z-ordering Root Count 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 39 168150 30414434 11100010110 1 16 0 39 0 4 444 158 11 10 300 10 1 11 3 0 1 11111111111111111110000011110010111111111111111111111111111111111111111111111111111000001111001011111111111111111111111111111111 1-D bSQ file 2-D bSQ file (same file in 2-D raster order)
8
July 23, 2002Multimedia Data Mining using P-trees* 55 1681516 30414434 11100010110 1 Peano or Z-ordering Pure-1/Pure-0 quadrant Root Count Level Fan-out QID (Quadrant ID) 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0123 111 ( 7, 1 ) ( 111, 001 ) 10.10.11 2 3 2. 2. 3 001
9
July 23, 2002Multimedia Data Mining using P-trees* Peano Mask Tree or PM-tree (3-value logic) Pure1-Trees (most compressed, 2-value logic) pure1-quad=1 else 0 Truth- or Predicate-Trees (2-value logic: 1-bit=T, 0-bit=F) – Given any condition (e.g., 0, mixed, 0, 1) for each quadrant, if condition is true, 1-bit, else 0-bit. – All are lossless compressed representations of the dataset
10
July 23, 2002Multimedia Data Mining using P-trees* P-tree Operations Count-tree 55 Mask-tree m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 16 __8____ _15__ 16 1 m m 1 / / | \ / | \ \ / / \ \ / / \ \ 3 0 4 1 4 4 3 4 m 0 1 m 1 1 m 1 //|\ //|\ //|\ //|\ //|\ //|\ 1110 0010 1101 1110 0010 1101 P-tree-1 st bit: m ______/ / \ \______ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 P-tree-2 nd bit: m ______/ / \ \______ / / \ \ 1 0 m 0 / / \ \ 1 1 1 m //|\ 0100 AND-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 OR-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 m 1 1 / / \ \ m 0 1 m //|\ //|\ 1110 0010 Complements 9 m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 0 __8___ __1__ 0 0 m m 0 / / | \ / | \ \ / / \ \ / / \ \ 1 4 0 3 0 0 1 0 m 1 0 m 0 0 m 0 //|\ //|\ //|\ //|\ //|\ //|\ 0001 1101 0010 0001 1101 0010
11
July 23, 2002Multimedia Data Mining using P-trees* Ptree ANDing Operation PM-tree1: m ______/ / \ \______ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 PM-tree2: m ______/ / \ \______ / / \ \ 1 0 m 0 / / \ \ 1 1 1 m //|\ 0100 Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231 RESULT 0 0 0 20 20 20 21 21 21 220 221 223 22 220 221 223 23 231 231 Depth-first Pure-1 path code Parallel software implementations on computer clusters are very fast. Hardware implementations are being developed
12
July 23, 2002Multimedia Data Mining using P-trees* Various P-trees Basic P-trees P i, j Value P-trees P i (v) Tuple P-trees P(v 1, v 2, …, v n ) AND COMPLEMENT AND Interval P-trees P i (v 1, v 2 ) Cube P-trees P([v 11, v 12 ], …, [v N1, v N2 ]) OR AND AND, OR, COMPLEMENT AND, OR Predicate P-trees P(p) COMPLEMENT ONE MULTIWAY AND, OR, COMPLEMENT PROG.
13
July 23, 2002Multimedia Data Mining using P-trees* Scalability of P-tree Operations Software multi-way ANDing 0.00 10.00 20.00 30.00 40.00 50.00 60.00 024681012141618 Dataset size in million Tuples Time in ms Beowulf cluster of 16 dual P2 266 MHz processors with 128 MB RAM.
14
July 23, 2002Multimedia Data Mining using P-trees* Properties of P-trees 1. a) b) 2. a) b) c) d) 3. a) b) c) d) 4. rc(P 1 | P 2 ) = 0 rc(P 1 ) = 0 and rc(P 2 ) = 0 5.v 1 v 2 rc{P i (v 1 ) & P i (v 2 )} = 0 6.rc(P 1 | P 2 ) = rc(P 1 ) + rc(P 2 ) - rc(P 1 & P 2 ) 7.rc{P i (v 1 ) | P i (v 2 )} = rc{P i (v 1 )} + rc{P i (v 2 )}, where v 1 v 2
15
July 23, 2002Multimedia Data Mining using P-trees* Notations P 1 & P 2 : P 1 AND P 2 P 1 | P 2 : P 1 OR P 2 P´ : COMPLEMENT of P P i, j : basic P-tree for band i bit j. P i (v) : value P-tree for value v of band i. P i (v 1, v 2 ) : interval P-tree for interval [v 1, v 2 ] of band i. P 0 : is pure0-tree, a P-tree having the root node which is pure0. P 1 : is pure1-tree, a P-tree having the root node which is pure1. rc(P) : root count of P-tree P N : number of pixels n : number of bands m : number of bits
16
July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees DTI Classifiers Bayesian Classifiers ARM KNN and Closed KNN Classifiers
17
July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees DTI Classifiers – For large amounts of multimedia data and data streams, standard DTI is very limited in effectiveness. – Fast calculation of measurements, such as information gain through P-tree ANDing, enables P-tree technology to handle large quantities of data and streaming data. – The P-tree based decision tree induction classification method was shown to be significantly faster than existing DTI classification methods.
18
July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees Bayesian Classifiers – Computing conditional probabilities can be prohibitive for many multimedia applications, since the volume is often large. From the very 1st paper in 2002 KDD proceedings (“For massive datasets Bayesian methods still begin by a ‘load data into memory’ step, make compromising assumptions, or resort to subsampling to skirt the issue”). – Naïve Bayesian Classification is used to minimize computational costs, but can give poor results (compromising assumption!) – P-tree technology avoids the need to use Naïve Bayesian or subsampling, since conditional probability values derive directly from anding P-trees
19
July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees Association Rule Mining (ARM) – In most cases multimedia data sizes are too large to be mined in reasonable time using existing algorithms. – P-tree techniques used in an efficient association rule mining algorithm, P-ARM, has shown significant improvement compared with FP-growth and Apriori.
20
July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees KNN and Closed KNN Classifiers – KNN classifiers typically have a very high cost associated with re- building the classifier when new data arrives (e.g., data streams). The construction of the neighborhood is the high cost operation – P-tree technologys find closed-KNN neighborhoods quickly. – Experimental results have shown P-tree closed-KNN yields higher classification accuracy as well as significantly higher speed. – Our P-KNN algorithm, combined with GAs, earned honorable mention in the 2002 KDD-cup competition (task-2) and actually won one of the two subproblems (“broad classification problem”). KDD-cup-2 data was very much multimedia (Hierarchical categorical data, undirected graph data, text data (medline abstracts).
21
July 23, 2002Multimedia Data Mining using P-trees* Closed-KNN T The black dot is the target pixel. For k = 3, to find 3 rd nearest neighbor, standard KNN arbitrarily select one point from the boundary as the 3 rd neighbor. Closed-KNN includes all points on the boundary Closed-KNN yields a surprisingly higher classification accuracy than traditional KNN and the closed neighborhood is naturally yielded by P- KNN, while traditional KNN require another full dataset scan to find the closed neighborhood. Therefore, P-KNN is both faster and more accurate.
22
July 23, 2002Multimedia Data Mining using P-trees* Performance – Accuracy 40 45 50 55 60 65 70 75 80 256102440961638465536262144 Training Set Size (no. of pixels) Accuracy (%) KNN-Manhattan (L 1 )KNN-Euclidian (L 2 ) KNN-Max (L )KNN-Hobbit (Hi-order basic bit) P-tree: Perfect Center ( closed -KNN)P-tree: Hobbit ( closed -KNN) 1997 TIFF-Yield Dataset:
23
July 23, 2002Multimedia Data Mining using P-trees* Performance - Accuracy (cont.) 1998 TIFF-Yield Dataset: 20 25 30 35 40 45 50 55 60 65 256102440961638465536262144 Training Set Size (no of pixels) Accuracy (%) KNN-ManhattanKNN-Euclidian KNN-MaxKNN-Hobbit P-tree: Perfect Center (closed-KNN)P-tree: Hobbit (closed-KNN)
24
July 23, 2002Multimedia Data Mining using P-trees* Performance - Time 1997 Dataset: both axis in logarithmic scale 0.00001 0.0001 0.001 0.01 0.1 1 256102440961638465536262144 Training Set Size (no. of pixels) Per Sample Classification time (sec) KNN-Manhattan KNN-Euclidian KNN-Max KNN-Hobbit P-tree: Perfect Centering (cosed-KNN) P-tree: Hobbit (closed-KNN)
25
July 23, 2002Multimedia Data Mining using P-trees* Performance - Time (cont.) 0.00001 0.0001 0.001 0.01 0.1 1 256102440961638465536262144 Training Set Size (no. of pixels) Per Sample Classification Time (sec) KNN-Manhattan KNN-Euclidian KNN-Max KNN-Hobbit P-tree: Perfect Centering (closed-KNN) P-tree: Hobbit (closed-KNN) 1998 Dataset : both axis in logarithmic scale
26
July 23, 2002Multimedia Data Mining using P-trees* Conclusion One of the major issues of multimedia data mining is the sheer size of the resulting feature space. The P-tree, a data-mining-ready structure, deals with this issue and facilitates efficient data mining of streams. P-tree methods can be faster and more accurate at the same time.
27
July 23, 2002Multimedia Data Mining using P-trees* Questions? William.Perrizo@ndsu.nodak.edu Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.