The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy

Slides:

Advertisements

Similar presentations

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Advertisements

Group 3 Akash Agrawal and Atanu Roy 1 Raster Database.

Digital Data Introduction to Remote Sensing Instructor: Dr. Cheng-Chien LiuCheng-Chien Liu Department of Earth Sciences National Cheng Kung University.

Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.

Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.

Raster Data Chapter 7. Introduction  Vector – discrete  Raster – continuous  Continuous –precipitation –elevation –soil erosion  Regular grid cell.

GUS: 0262 Fundamentals of GIS Lecture Presentation 6: Raster Data Model Jeremy Mennis Department of Geography and Urban Studies Temple University.

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING

Hyperspectral Imagery Compression Using Three Dimensional Discrete Transforms Tong Qiao Supervisor: Dr. Jinchang Ren 04/07/2013.

Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.

Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.

Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,

Raster data models Rasters can be different types of tesselations SquaresTrianglesHexagons Regular tesselations.

Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.

Data Mining and Data Warehousing Many-to-Many Relationships Applications William Perrizo Dept of Computer Science North Dakota State Univ.

K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.

Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.

TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.

Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,

Efficient Equal Interval Neighborhood Ring (P-trees technology is patented by NDSU)

Clustering using Wavelets and Meta-Ptrees Anne Denton, Fang Zhang.

P-Tree Implementation Anne Denton. So far: Logical Definition C.f. Dr. Perrizo’s slides Logical definition Defines node information Representation of.

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Peano Count Trees and Association Rule Mining for Gene Expression Profiling using DNA Microarray Data Dr. William Perrizo, Willy Valdivia, Dr. Edward Deckard,

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

Accelerating Multilevel Secure Database Queries using P-Tree Technology Imad Rahal and Dr. William Perrizo Computer Science Department North Dakota State.

Overview Data Mining - classification and clustering

Window based software for Neuro-Fuzzy Classification of Remotely Sensed Image (Stand along application and extension for ArcGIS) Xiaogang Yang POEC 6387.

What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”

Content  Hierarchical Triangle Mesh (HTM)  Perrizo Triangle Mesh Tree (PTM-tree)  SDSS.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.

Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North.

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

Data Mining Motivation: “Necessity is the Mother of Invention”

Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Digital Data Format and Storage

Decision Tree Induction for High-Dimensional Data Using P-Trees

Efficient Ranking of Keyword Queries Using P-trees

Yue (Jenny) Cui and William Perrizo North Dakota State University

Proximal Support Vector Machine for Spatial Data Using P-trees1

Mean Shift Segmentation

North Dakota State University Fargo, ND USA

Yue (Jenny) Cui and William Perrizo North Dakota State University

Firmer Mathematical Foundation: HistoTrees

PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)

3. Vertical Data LECTURE 2 Section 3.

Vertical K Median Clustering

A Fast and Scalable Nearest Neighbor Based Classification

Data compression Why compress (or ‘zip’)? Lossy vs. lossless

Vertical K Median Clustering

Storage Structure and Efficient File Access

3. Vertical Data LECTURE 2 Section 3.

A Spatial Data and Sensor Network Application:

North Dakota State University Fargo, ND USA

The Multi-hop closure theorem for the Rolodex Model using pTrees

Vertical K Median Clustering

North Dakota State University Fargo, ND USA

Spatial Databases - Representation

Spatial Databases - Representation

Integrating Query Processing and Data Mining in Relational DBMSs

Presentation transcript:

The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy * William Perrizo Department of Computer Science North Dakota State University, USA (P-tree technology is patented by NDSU)

Outline P-tree Structure and variations P-tree Algebra P-tree Properties P-tree Performance

Introduction to P-tree Peano Count Tree (P-tree) Lossless representation of the original spatial data Data-mining-ready structure

Spatial Data Remotely sensed imagery data Ground data Satellite images Aerial photography Ground data Yield production Moisture Nitrate Temperature

Remotely Sensed Imagery data TIFF image Yield Map

Spatial Data Formats Existing formats New format BSQ (Band Sequential) BIL (Band Interleaved by Line) BIP (Band Interleaved by Pixel) New format bSQ (bit Sequential)

Spatial Data Formats (Cont.) BAND-1 54 127 (1111 1110) (0111 1111) 4 193 (0000 1110) (1100 0001) BAND-2 7 240 (0010 0101) (1111 0000) 00 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19

Spatial Data Formats (Cont.) BAND-1 54 127 (1111 1110) (0111 1111) 4 193 (0000 1110) (1100 0001) BAND-2 7 240 (0010 0101) (1111 0000) 00 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19

Spatial Data Formats (Cont.) BAND-1 54 127 (1111 1110) (0111 1111) 4 193 (0000 1110) (1100 0001) BAND-2 7 240 (0010 0101) (1111 0000) 00 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19

Spatial Data Formats (Cont.) BAND-1 54 127 (1111 1110) (0111 1111) 4 193 (0000 1110) (1100 0001) BAND-2 7 240 (0010 0101) (1111 0000) 00 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19 bSQ format (16 files) B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1

bSQ Format Split each band into eight separate files, one for each bit position. Reasons of using bSQ format Different bits contribute to the value differently. bSQ format facilitates the representation of a precision hierarchy (from 1 bit up to 8 bit precision). bSQ format facilitates the creation of an efficient data structure P-tree.

Peano Count Tree (P-tree) P-tree represents Spatial data bit-by-bit in a recursive quadrant-by-quadrant arrangement. P-tree is a lossless structure of original data. P-tree is a compressed structure.

An example of a P-tree Peano or Z-ordering bSQ 1 Arranged in 2-D space in raster order 55 16 8 15 3 4 1 55 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 3 1 16 15 16 8 1 4 1 4 3 4 4 1 1 Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID)

An example of Ptree Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant 001 55 16 8 15 3 4 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 2 3 2 3 2 . 2 . 3 111 Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID) ( 7, 1 ) ( 111, 001 ) 10.10.11

P-tree variation – PM-tree 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 m ____________/ / \ \___________ / ___ / \___ \ / / \ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 Peano Mask tree (PM-tree) uses mask instead of count. 1 denotes pure-1, 0 denotes pure-0 and m denotes mixed. It provides an efficient way for ANDing.

P-tree variations – P1-tree and P0-tree In P1-tree, we use 1 to indicate the pure-1 quadrant while use 0 to indicate others. In P0-tree, we use 1 to indicate the pure-0 quadrant while use 0 to indicate others. P1-tree 0 P0-tree 0 ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 1 __0____ _0__ 0 0 0 0 1 / / | \ / | \ \ / / \ \ / / \ \ 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 //|\ //|\ //|\ //|\ //|\ //|\ 1100 0010 0001 1100 0010 0001

Ptree Algebra And Or Complement Other (XOR, etc) 64 - 55 = 9 Ptree: 55 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 Complement: 9 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 0 ____8__ __1__ 0 / / | \ / | \ \ 1 4 0 3 0 0 1 0 //|\ //|\ //|\ 0001 1101 0010

P-tree Algebra (Cont.) P-tree-1: m ______/ / \ \______ / / \ \ / / \ \ ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 P-tree-2: m 1 0 m 0 / / \ \ 1 1 1 m //|\ 0100 AND-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 OR-Result: m 1 m 1 1 / / \ \ m 0 1 m //|\ //|\ 1110 0010

P-tree ANDing Operation Operand 1 (quadrant) Operand 2 (quadrant) Result (quadrant) 1 X2 X2 X2 X1 1 X1 X1 m m 0 if four sub-quadrants result in 0; Otherwise m

Ptree ANDing Operation (Cont.) PM-tree1: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 PM-tree2: m 1 0 m 0 / / \ \ 1 1 1 m //|\ 0100 Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 Depth-first Pure-1 path code 0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231  RESULT 0 0  0 20 20  20 21 21  21 220 221 223 22  220 221 223 23 231  231

Value P-tree Value P-tree, Pi (v), is the P-tree of value v in bandi. Value v can be expressed in 1-to-8 bit precision. Value P-trees can be constructed by ANDing basic P-trees or their complements. Pi (110) = Pi,1 AND Pi,2 AND Pi,3’

Tuple P-tree Tuple P-tree, P (v1, v2, …, vn), is the P-tree of a value vi at band i, for all i from 1 to n. P (v1,v2,…,vn)= P1(v1) AND P2(v2) …AND Pn(vn) If value vj is not given, it means it could be any value in Band j, P (v1, v2,…,vj-1,, vj+1,…,vn), and then the jth AND operand is simply omitted.

Pi (v1, v2) = OR Pi (v), for all v in [v1, v2]. Interval P-tree A interval P-tree Pi (v1, v2), is the P-tree for value in the interval of [v1, v2] of band i. We have, Pi (v1, v2) = OR Pi (v), for all v in [v1, v2].

Peano Cube (P-cube) The (v1,v2,v3)th cell of the P-cube contains P(v1,v2,v3) = P1,v1 AND P2,v2 AND P3,v3 where e.g., Pi,vi = Pi,110 = Pi,1 AND Pi,2 AND P’i,3 (P-cube above shows just root counts of the P-trees) P-cube can be rolled-up (on left), sliced, diced… Characteristic function applied to the NPZ truth tree is bit-map index for each attribute.

P-tree Properties Lemma 1: For any two P-trees P1 and P2, rc(P1 | P2) = 0  rc(P1) = 0 and rc(P2) = 0. More strictly, rc(P1 | P2) = 0, if and only if rc(P1) = 0 and rc(P2) = 0.

P-tree Properties (Cont.) Lemma 2: a) rc(P1 ) = 0 or rc(P2 ) = 0  rc(P1 & P2 ) = 0 b) rc(P1 ) = 0 and rc(P2 ) = 0  rc(P1 & P2 ) = 0. c) rc( ) = 0 d) rc( ) = N e) f) g) h) i) j)

P-tree Properties (Cont.) Lemma 3: v1  v2  rc{Pi (v1) & Pi(v2)}=0, for any band i. Lemma 4: rc(P1 | P2) = rc(P1) + rc(P2) - rc(P1 & P2). Theorem: rc{Pi (v1) | Pi(v2)} = rc{Pi (v1)} + rc{Pi(v2)}, where v1  v2.

P-tree Performance Comparison of file size for different bits of Band 1 & 2 of a TIFF image

P-tree Performance (Cont.) Comparison of file size for different bits of Band 3 & 4 of a SPOT image

P-tree Performance (Cont.) Comparison of file size for different bits of Band 5 & 6 of a TM image

P-tree Performance (Cont.) Time Vs Lowest Bit Number 4 3 PC-Tree 2 PMT Peano Sequence 1 1 2 3 4 5 6 7 8 Lowest Bit Number Times required to perform ANDing operation on a TM file (40 million pixels)

P-tree Performance (Cont.) Average time required to perform ANDing operation on a TM file (40 million pixels)

Related Work Related Structure Similarities Difference Quadtrees and its variants (point quadtrees, region quadtrees) HH-codes Similarities Quadrant based Difference P-trees focus on the count. P-trees aren’t indexes, rather they are representations of datasets themselves. P-trees are particularly useful for data mining because they contain the aggregate information needed for data mining.

Conclusion P-tree algebra and properties P-tree for efficient data mining Association rule mining Classification Clustering P-tree application from spatial data to non-spatial Precision agriculture DNA Microarray data VLSI test data analysis Stock market data imagery