Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North.

Slides:



Advertisements
Similar presentations
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Advertisements

Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,
Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.
Data Mining and Data Warehousing Many-to-Many Relationships Applications William Perrizo Dept of Computer Science North Dakota State Univ.
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,
P-Tree Implementation Anne Denton. So far: Logical Definition C.f. Dr. Perrizo’s slides Logical definition Defines node information Representation of.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
Peano Count Trees and Association Rule Mining for Gene Expression Profiling using DNA Microarray Data Dr. William Perrizo, Willy Valdivia, Dr. Edward Deckard,
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Accelerating Multilevel Secure Database Queries using P-Tree Technology Imad Rahal and Dr. William Perrizo Computer Science Department North Dakota State.
Overview Data Mining - classification and clustering
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
Bootstrapped Optimistic Algorithm for Tree Construction
Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Data Mining Motivation: “Necessity is the Mother of Invention”
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
Computer Science Higher
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Image Classification on Vertically Decomposed Data
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Efficient Ranking of Keyword Queries Using P-trees
Proximal Support Vector Machine for Spatial Data Using P-trees1
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
Firmer Mathematical Foundation: HistoTrees
Bayesian Classification Using P-tree
Classification and Prediction
Using Tensorflow to Detect Objects in an Image
Efficient Image Classification on Vertically Decomposed Data
Communication and Memory Efficient Parallel Decision Tree Construction
Jewels, Himalayas and Fireworks, Extending Methods for
Vertical K Median Clustering
A Fast and Scalable Nearest Neighbor Based Classification
Data Mining extracting knowledge from a large amount of data
Vertical K Median Clustering
Outline Introduction Background Our Approach Experimental Results
A Spatial Data and Sensor Network Application:
North Dakota State University Fargo, ND USA
Efficient Document Analytics on Compressed Data:
Fast and Exact K-Means Clustering
Vertical K Median Clustering
Jewels, Himalayas and Fireworks, Extending Methods for
North Dakota State University Fargo, ND USA
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
Integrating Query Processing and Data Mining in Relational DBMSs
Notes from 02_CAINE conference
Presentation transcript:

Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada July 23, 2002 *P-tree technology is patent pending by NDSU

July 23, 2002Multimedia Data Mining using P-trees* Outline Multimedia Data Mining Peano Count Trees (P-trees) Properties of P-trees Data Mining Techniques Using P-trees Implementation Issues and Performance Conclusion

July 23, 2002Multimedia Data Mining using P-trees* Multimedia Data Mining – Extract high-level, information from large multimedia data sets. Typically done in two steps: – Capture specific features of the data as feature vectors or tuples in a table or feature space. – Mine those tuples for info/knowledge Association rule mining (ARM) clustering or classification on feature vectors

July 23, 2002Multimedia Data Mining using P-trees* Multimedia Data Remotely Sensed Imagery (RSI) – Usually 2-D (or 3-D) and relatively smooth – Large datasets (e.g., Landsat ETM+ ~100,000,000 pixels). Video-Audio data mining – Usually result in high dimensional feature spaces – Multimedia datasets are usually very large. Text mining (Feature space is high dimensional but sparse). P-trees are well suited for representing such feature spaces – Lossless compressed representation – Good at manipulating high dimensional data set

July 23, 2002Multimedia Data Mining using P-trees* Precision Agriculture Dataset: TIFF Image and other measurements (1320×1320) RGB Moisture Yield Nitrate

July 23, 2002Multimedia Data Mining using P-trees* The Peano Count Tree (P-tree) P-tree represents feature vector data bit-by-bit, in a recursive, quadrant-by-quadrant, losslessly-compressed manner. – First: given a feature vector space, vertically fragment by column. Storage Decomposition Model (e.g., Bubba, circa 1985). In SDM, each column is a separate file retaining original row order. –Sometimes called “Vertical Database Model” – Second: For P-trees, we vertically fragment further by bit position Bit-SDM: each bit position of each column is a file-retain original row order Each resulting file is called a bit-SeQuential (bSQ) of bSDM file. –The high-order bSQ file IS data. The others are DELTAs (ala, MPEG).

July 23, 2002Multimedia Data Mining using P-trees* An example of a P-tree Quadrant-based, Pure (Pure-1/Pure-0) quadrant Peano or Z-ordering Root Count D bSQ file 2-D bSQ file (same file in 2-D raster order)

July 23, 2002Multimedia Data Mining using P-trees* Peano or Z-ordering Pure-1/Pure-0 quadrant Root Count  Level  Fan-out  QID (Quadrant ID) ( 7, 1 ) ( 111, 001 )

July 23, 2002Multimedia Data Mining using P-trees* Peano Mask Tree or PM-tree (3-value logic) Pure1-Trees (most compressed, 2-value logic) pure1-quad=1 else 0 Truth- or Predicate-Trees (2-value logic: 1-bit=T, 0-bit=F) – Given any condition (e.g., 0, mixed,  0,  1) for each quadrant, if condition is true, 1-bit, else 0-bit. – All are lossless compressed representations of the dataset

July 23, 2002Multimedia Data Mining using P-trees* P-tree Operations Count-tree 55 Mask-tree m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 16 __8____ _15__ 16 1 m m 1 / / | \ / | \ \ / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ //|\ //|\ //|\ P-tree-1 st bit: m ______/ / \ \______ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ P-tree-2 nd bit: m ______/ / \ \______ / / \ \ 1 0 m 0 / / \ \ m //|\ 0100 AND-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ OR-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 m 1 1 / / \ \ m 0 1 m //|\ //|\ Complements 9 m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 0 __8___ __1__ 0 0 m m 0 / / | \ / | \ \ / / \ \ / / \ \ m 1 0 m 0 0 m 0 //|\ //|\ //|\ //|\ //|\ //|\

July 23, 2002Multimedia Data Mining using P-trees* Ptree ANDing Operation PM-tree1: m ______/ / \ \______ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ PM-tree2: m ______/ / \ \______ / / \ \ 1 0 m 0 / / \ \ m //|\ 0100 Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ &  RESULT 0 0      231 Depth-first Pure-1 path code Parallel software implementations on computer clusters are very fast. Hardware implementations are being developed

July 23, 2002Multimedia Data Mining using P-trees* Various P-trees Basic P-trees P i, j Value P-trees P i (v) Tuple P-trees P(v 1, v 2, …, v n ) AND COMPLEMENT AND Interval P-trees P i (v 1, v 2 ) Cube P-trees P([v 11, v 12 ], …, [v N1, v N2 ]) OR AND AND, OR, COMPLEMENT AND, OR Predicate P-trees P(p) COMPLEMENT ONE MULTIWAY AND, OR, COMPLEMENT PROG.

July 23, 2002Multimedia Data Mining using P-trees* Scalability of P-tree Operations Software multi-way ANDing Dataset size in million Tuples Time in ms Beowulf cluster of 16 dual P2 266 MHz processors with 128 MB RAM.

July 23, 2002Multimedia Data Mining using P-trees* Properties of P-trees 1. a) b) 2. a) b) c) d) 3. a) b) c) d) 4. rc(P 1 | P 2 ) = 0  rc(P 1 ) = 0 and rc(P 2 ) = 0 5.v 1  v 2  rc{P i (v 1 ) & P i (v 2 )} = 0 6.rc(P 1 | P 2 ) = rc(P 1 ) + rc(P 2 ) - rc(P 1 & P 2 ) 7.rc{P i (v 1 ) | P i (v 2 )} = rc{P i (v 1 )} + rc{P i (v 2 )}, where v 1  v 2

July 23, 2002Multimedia Data Mining using P-trees* Notations P 1 & P 2 : P 1 AND P 2 P 1 | P 2 : P 1 OR P 2 P´ : COMPLEMENT of P P i, j : basic P-tree for band i bit j. P i (v) : value P-tree for value v of band i. P i (v 1, v 2 ) : interval P-tree for interval [v 1, v 2 ] of band i. P 0 : is pure0-tree, a P-tree having the root node which is pure0. P 1 : is pure1-tree, a P-tree having the root node which is pure1. rc(P) : root count of P-tree P N : number of pixels n : number of bands m : number of bits

July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees DTI Classifiers Bayesian Classifiers ARM KNN and Closed KNN Classifiers

July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees DTI Classifiers – For large amounts of multimedia data and data streams, standard DTI is very limited in effectiveness. – Fast calculation of measurements, such as information gain through P-tree ANDing, enables P-tree technology to handle large quantities of data and streaming data. – The P-tree based decision tree induction classification method was shown to be significantly faster than existing DTI classification methods.

July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees Bayesian Classifiers – Computing conditional probabilities can be prohibitive for many multimedia applications, since the volume is often large. From the very 1st paper in 2002 KDD proceedings (“For massive datasets Bayesian methods still begin by a ‘load data into memory’ step, make compromising assumptions, or resort to subsampling to skirt the issue”). – Naïve Bayesian Classification is used to minimize computational costs, but can give poor results (compromising assumption!) – P-tree technology avoids the need to use Naïve Bayesian or subsampling, since conditional probability values derive directly from anding P-trees

July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees Association Rule Mining (ARM) – In most cases multimedia data sizes are too large to be mined in reasonable time using existing algorithms. – P-tree techniques used in an efficient association rule mining algorithm, P-ARM, has shown significant improvement compared with FP-growth and Apriori.

July 23, 2002Multimedia Data Mining using P-trees* Techniques Using P-trees KNN and Closed KNN Classifiers – KNN classifiers typically have a very high cost associated with re- building the classifier when new data arrives (e.g., data streams). The construction of the neighborhood is the high cost operation – P-tree technologys find closed-KNN neighborhoods quickly. – Experimental results have shown P-tree closed-KNN yields higher classification accuracy as well as significantly higher speed. – Our P-KNN algorithm, combined with GAs, earned honorable mention in the 2002 KDD-cup competition (task-2) and actually won one of the two subproblems (“broad classification problem”). KDD-cup-2 data was very much multimedia (Hierarchical categorical data, undirected graph data, text data (medline abstracts).

July 23, 2002Multimedia Data Mining using P-trees* Closed-KNN T The black dot is the target pixel. For k = 3, to find 3 rd nearest neighbor, standard KNN arbitrarily select one point from the boundary as the 3 rd neighbor. Closed-KNN includes all points on the boundary Closed-KNN yields a surprisingly higher classification accuracy than traditional KNN and the closed neighborhood is naturally yielded by P- KNN, while traditional KNN require another full dataset scan to find the closed neighborhood. Therefore, P-KNN is both faster and more accurate.

July 23, 2002Multimedia Data Mining using P-trees* Performance – Accuracy Training Set Size (no. of pixels) Accuracy (%) KNN-Manhattan (L 1 )KNN-Euclidian (L 2 ) KNN-Max (L  )KNN-Hobbit (Hi-order basic bit) P-tree: Perfect Center ( closed -KNN)P-tree: Hobbit ( closed -KNN) 1997 TIFF-Yield Dataset:

July 23, 2002Multimedia Data Mining using P-trees* Performance - Accuracy (cont.) 1998 TIFF-Yield Dataset: Training Set Size (no of pixels) Accuracy (%) KNN-ManhattanKNN-Euclidian KNN-MaxKNN-Hobbit P-tree: Perfect Center (closed-KNN)P-tree: Hobbit (closed-KNN)

July 23, 2002Multimedia Data Mining using P-trees* Performance - Time 1997 Dataset: both axis in logarithmic scale Training Set Size (no. of pixels) Per Sample Classification time (sec) KNN-Manhattan KNN-Euclidian KNN-Max KNN-Hobbit P-tree: Perfect Centering (cosed-KNN) P-tree: Hobbit (closed-KNN)

July 23, 2002Multimedia Data Mining using P-trees* Performance - Time (cont.) Training Set Size (no. of pixels) Per Sample Classification Time (sec) KNN-Manhattan KNN-Euclidian KNN-Max KNN-Hobbit P-tree: Perfect Centering (closed-KNN) P-tree: Hobbit (closed-KNN) 1998 Dataset : both axis in logarithmic scale

July 23, 2002Multimedia Data Mining using P-trees* Conclusion One of the major issues of multimedia data mining is the sheer size of the resulting feature space. The P-tree, a data-mining-ready structure, deals with this issue and facilitates efficient data mining of streams. P-tree methods can be faster and more accurate at the same time.

July 23, 2002Multimedia Data Mining using P-trees* Questions? Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada