Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada July 23, 2002 *P-tree technology is patent pending by NDSU
July 23, 2002Multimedia Data Mining using P-trees* Outline Multimedia Data Mining Peano Count Trees (P-trees) Properties of P-trees Data Mining Techniques Using P-trees Implementation Issues and Performance Conclusion
July 23, 2002Multimedia Data Mining using P-trees* Multimedia Data Mining – Extract high-level, information from large multimedia data sets. Typically done in two steps: – Capture specific features of the data as feature vectors or tuples in a table or feature space. – Mine those tuples for info/knowledge Association rule mining (ARM) clustering or classification on feature vectors
July 23, 2002Multimedia Data Mining using P-trees* Multimedia Data Remotely Sensed Imagery (RSI) – Usually 2-D (or 3-D) and relatively smooth – Large datasets (e.g., Landsat ETM+ ~100,000,000 pixels). Video-Audio data mining – Usually result in high dimensional feature spaces – Multimedia datasets are usually very large. Text mining (Feature space is high dimensional but sparse). P-trees are well suited for representing such feature spaces – Lossless compressed representation – Good at manipulating high dimensional data set
July 23, 2002Multimedia Data Mining using P-trees* Precision Agriculture Dataset: TIFF Image and other measurements (1320×1320) RGB Moisture Yield Nitrate
July 23, 2002Multimedia Data Mining using P-trees* The Peano Count Tree (P-tree) P-tree represents feature vector data bit-by-bit, in a recursive, quadrant-by-quadrant, losslessly-compressed manner. – First: given a feature vector space, vertically fragment by column. Storage Decomposition Model (e.g., Bubba, circa 1985). In SDM, each column is a separate file retaining original row order. –Sometimes called “Vertical Database Model” – Second: For P-trees, we vertically fragment further by bit position Bit-SDM: each bit position of each column is a file-retain original row order Each resulting file is called a bit-SeQuential (bSQ) of bSDM file. –The high-order bSQ file IS data. The others are DELTAs (ala, MPEG).
July 23, 2002Multimedia Data Mining using P-trees* An example of a P-tree Quadrant-based, Pure (Pure-1/Pure-0) quadrant Peano or Z-ordering Root Count D bSQ file 2-D bSQ file (same file in 2-D raster order)
July 23, 2002Multimedia Data Mining using P-trees* Peano or Z-ordering Pure-1/Pure-0 quadrant Root Count Level Fan-out QID (Quadrant ID) ( 7, 1 ) ( 111, 001 )
July 23, 2002Multimedia Data Mining using P-trees* Peano Mask Tree or PM-tree (3-value logic) Pure1-Trees (most compressed, 2-value logic) pure1-quad=1 else 0 Truth- or Predicate-Trees (2-value logic: 1-bit=T, 0-bit=F) – Given any condition (e.g., 0, mixed, 0, 1) for each quadrant, if condition is true, 1-bit, else 0-bit. – All are lossless compressed representations of the dataset
July 23, 2002Multimedia Data Mining using P-trees* P-tree Operations Count-tree 55 Mask-tree m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 16 __8____ _15__ 16 1 m m 1 / / | \ / | \ \ / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ //|\ //|\ //|\ P-tree-1 st bit: m ______/ / \ \______ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ P-tree-2 nd bit: m ______/ / \ \______ / / \ \ 1 0 m 0 / / \ \ m //|\ 0100 AND-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ OR-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 m 1 1 / / \ \ m 0 1 m //|\ //|\ Complements 9 m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 0 __8___ __1__ 0 0 m m 0 / / | \ / | \ \ / / \ \ / / \ \ m 1 0 m 0 0 m 0 //|\ //|\ //|\ //|\ //|\ //|\
July 23, 2002Multimedia Data Mining using P-trees* Ptree ANDing Operation PM-tree1: m ______/ / \ \______ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ PM-tree2: m ______/ / \ \______ / / \ \ 1 0 m 0 / / \ \ m //|\ 0100 Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ & RESULT 0 0 231 Depth-first Pure-1 path code Parallel software implementations on computer clusters are very fast. Hardware implementations are being developed
July 23, 2002Multimedia Data Mining using P-trees* Various P-trees Basic P-trees P i, j Value P-trees P i (v) Tuple P-trees P(v 1, v 2, …, v n ) AND COMPLEMENT AND Interval P-trees P i (v 1, v 2 ) Cube P-trees P([v 11, v 12 ], …, [v N1, v N2 ]) OR AND AND, OR, COMPLEMENT AND, OR Predicate P-trees P(p) COMPLEMENT ONE MULTIWAY AND, OR, COMPLEMENT PROG.
July 23, 2002Multimedia Data Mining using P-trees* Scalability of P-tree Operations Software multi-way ANDing Dataset size in million Tuples Time in ms Beowulf cluster of 16 dual P2 266 MHz processors with 128 MB RAM.
July 23, 2002Multimedia Data Mining using P-trees* Properties of P-trees 1. a) b) 2. a) b) c) d) 3. a) b) c) d) 4. rc(P 1 | P 2 ) = 0 rc(P 1 ) = 0 and rc(P 2 ) = 0 5.v 1 v 2 rc{P i (v 1 ) & P i (v 2 )} = 0 6.rc(P 1 | P 2 ) = rc(P 1 ) + rc(P 2 ) - rc(P 1 & P 2 ) 7.rc{P i (v 1 ) | P i (v 2 )} = rc{P i (v 1 )} + rc{P i (v 2 )}, where v 1 v 2
July 23, 2002Multimedia Data Mining using P-trees* Notations P 1 & P 2 : P 1 AND P 2 P 1 | P 2 : P 1 OR P 2 P´ : COMPLEMENT of P P i, j : basic P-tree for band i bit j. P i (v) : value P-tree for value v of band i. P i (v 1, v 2 ) : interval P-tree for interval [v 1, v 2 ] of band i. P 0 : is pure0-tree, a P-tree having the root node which is pure0. P 1 : is pure1-tree, a P-tree having the root node which is pure1. rc(P) : root count of P-tree P N : number of pixels n : number of bands m : number of bits
July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees DTI Classifiers Bayesian Classifiers ARM KNN and Closed KNN Classifiers
July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees DTI Classifiers – For large amounts of multimedia data and data streams, standard DTI is very limited in effectiveness. – Fast calculation of measurements, such as information gain through P-tree ANDing, enables P-tree technology to handle large quantities of data and streaming data. – The P-tree based decision tree induction classification method was shown to be significantly faster than existing DTI classification methods.
July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees Bayesian Classifiers – Computing conditional probabilities can be prohibitive for many multimedia applications, since the volume is often large. From the very 1st paper in 2002 KDD proceedings (“For massive datasets Bayesian methods still begin by a ‘load data into memory’ step, make compromising assumptions, or resort to subsampling to skirt the issue”). – Naïve Bayesian Classification is used to minimize computational costs, but can give poor results (compromising assumption!) – P-tree technology avoids the need to use Naïve Bayesian or subsampling, since conditional probability values derive directly from anding P-trees
July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees Association Rule Mining (ARM) – In most cases multimedia data sizes are too large to be mined in reasonable time using existing algorithms. – P-tree techniques used in an efficient association rule mining algorithm, P-ARM, has shown significant improvement compared with FP-growth and Apriori.
July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees KNN and Closed KNN Classifiers – KNN classifiers typically have a very high cost associated with re- building the classifier when new data arrives (e.g., data streams). The construction of the neighborhood is the high cost operation – P-tree technologys find closed-KNN neighborhoods quickly. – Experimental results have shown P-tree closed-KNN yields higher classification accuracy as well as significantly higher speed. – Our P-KNN algorithm, combined with GAs, earned honorable mention in the 2002 KDD-cup competition (task-2) and actually won one of the two subproblems (“broad classification problem”). KDD-cup-2 data was very much multimedia (Hierarchical categorical data, undirected graph data, text data (medline abstracts).
July 23, 2002Multimedia Data Mining using P-trees* Closed-KNN T The black dot is the target pixel. For k = 3, to find 3 rd nearest neighbor, standard KNN arbitrarily select one point from the boundary as the 3 rd neighbor. Closed-KNN includes all points on the boundary Closed-KNN yields a surprisingly higher classification accuracy than traditional KNN and the closed neighborhood is naturally yielded by P- KNN, while traditional KNN require another full dataset scan to find the closed neighborhood. Therefore, P-KNN is both faster and more accurate.
July 23, 2002Multimedia Data Mining using P-trees* Performance – Accuracy Training Set Size (no. of pixels) Accuracy (%) KNN-Manhattan (L 1 )KNN-Euclidian (L 2 ) KNN-Max (L )KNN-Hobbit (Hi-order basic bit) P-tree: Perfect Center ( closed -KNN)P-tree: Hobbit ( closed -KNN) 1997 TIFF-Yield Dataset:
July 23, 2002Multimedia Data Mining using P-trees* Performance - Accuracy (cont.) 1998 TIFF-Yield Dataset: Training Set Size (no of pixels) Accuracy (%) KNN-ManhattanKNN-Euclidian KNN-MaxKNN-Hobbit P-tree: Perfect Center (closed-KNN)P-tree: Hobbit (closed-KNN)
July 23, 2002Multimedia Data Mining using P-trees* Performance - Time 1997 Dataset: both axis in logarithmic scale Training Set Size (no. of pixels) Per Sample Classification time (sec) KNN-Manhattan KNN-Euclidian KNN-Max KNN-Hobbit P-tree: Perfect Centering (cosed-KNN) P-tree: Hobbit (closed-KNN)
July 23, 2002Multimedia Data Mining using P-trees* Performance - Time (cont.) Training Set Size (no. of pixels) Per Sample Classification Time (sec) KNN-Manhattan KNN-Euclidian KNN-Max KNN-Hobbit P-tree: Perfect Centering (closed-KNN) P-tree: Hobbit (closed-KNN) 1998 Dataset : both axis in logarithmic scale
July 23, 2002Multimedia Data Mining using P-trees* Conclusion One of the major issues of multimedia data mining is the sheer size of the resulting feature space. The P-tree, a data-mining-ready structure, deals with this issue and facilitates efficient data mining of streams. P-tree methods can be faster and more accurate at the same time.
July 23, 2002Multimedia Data Mining using P-trees* Questions? Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada