Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

Random Forest Predrag Radenković 3237/10
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Lecture 5 (Classification with Decision Trees)
Decision Trees an Introduction.
Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.
CS690L Data Mining: Classification
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
Peano Count Trees and Association Rule Mining for Gene Expression Profiling using DNA Microarray Data Dr. William Perrizo, Willy Valdivia, Dr. Edward Deckard,
Lecture Notes for Chapter 4 Introduction to Data Mining
Accelerating Multilevel Secure Database Queries using P-Tree Technology Imad Rahal and Dr. William Perrizo Computer Science Department North Dakota State.
Bootstrapped Optimistic Algorithm for Tree Construction
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Data Mining Motivation: “Necessity is the Mother of Invention”
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
Efficient Ranking of Keyword Queries Using P-trees
Data Science Algorithms: The Basic Methods
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
Bayesian Classification Using P-tree
Classification and Prediction
Advanced Artificial Intelligence
Classification by Decision Tree Induction
Roberto Battiti, Mauro Brunato
Data Mining – Chapter 3 Classification
A Spatial Data and Sensor Network Application:
North Dakota State University Fargo, ND USA
Statistical Learning Dong Liu Dept. EEIS, USTC.
North Dakota State University Fargo, ND USA
INTRODUCTION TO Machine Learning 2nd Edition
©Jiawei Han and Micheline Kamber
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
Presentation transcript:

Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science North Dakota State University, USA (P-tree technology is patented by NDSU)

Decision Tree A flow-chart-like hierarchical tree structure Root: represents the entire dataset A node without children is called a leaf node. Otherwise called an internal node. Internal nodes: denote a test on one test attribute Branch: represents an outcome of the test Leaf nodes: represent class labels or class distribution

Data Stream Large quantities of data Open-ended (data continues to arrive) Updated periodically Need to mine in real time or near real-time E.g., Spatial Data Streams e.g. AVHRR (e.g., used for forest fire detection) E.g., stock market data

Building Decision Tree Classifiers Classifiers are built using training data (already classified). Once the classifier is built (trained) it is used on new samples (unclassified) to predict the class. Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At the start, all training samples are at the root Training sample set is partitioned recursively based on a selected attribute (test attribute) at each inode Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)

E.g., Binary Tree Building Algorithm Input: training-dataset S, split-selection-method Output: decision tree Top-Down Decision Tree Induction Scheme (Binary Splits): BuildTree(S, split-selection-method) (1) If (all points in S are in the same class) then return; (2) Use split-selection-method to evaluate splits for each attribute (3) Use best split found to partition S into S1 and S2; (4) BuildTree(S1, split-selection-method) (5) BuildTree(S2, split-selection-method) Build the whole tree by calling: BuildTree(dataset-TrainingData, split-selection-method) Splits not always binary (may be the entire decision-attribute value-set) and split-selection method can be info-gain, gain-ratio, gini-index, other..)

Information Gain (ID3) T = Training set; S = any set of samples freq(Ci, S) = the number of samples that belong to class Ci |S| = the number of samples in set S Information of set S is defined: consider a similar measurement after T has been partitioned into {Ti} in accordance with the n outcomes of a test X (e.g. Ai-value > ai). The expected information requirement can be found as the weighted sum over the subsets: The quantity measures the information that is gained by partitioning T in accordance with the test X. The gain criterion, then, selects a test to maximize this information gain. Gain(X) = Info(T) – Entropy(X)

Background on Spatial Data Band – attribute Pixel – transaction (tuple) Value – usually 0~255 (one byte) Different images have different numbers of bands TM4/5: 7 bands (B, G, R, NIR, MIR, TIR, MIR2) TM7: 8 bands (B, G, R, NIR, MIR, TIR, MIR2, PC) TIFF: 3 bands (B, G, R) Ground data: individual bands (Yield, Soil-Moisture, Nitrate, Temperature)

Spatial Data Formats Existing formats New format BSQ (Band Sequential) BIL (Band Interleaved by Line) BIP (Band Interleaved by Pixel) New format bSQ (bit Sequential)

Spatial Data Formats (Cont.) BAND-1 54 127 (1111 1110) (0111 1111) 4 193 (0000 1110) (1100 0001) BAND-2 7 240 (0010 0101) (1111 0000) 00 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19

Spatial Data Formats (Cont.) BAND-1 54 127 (1111 1110) (0111 1111) 4 193 (0000 1110) (1100 0001) BAND-2 7 240 (0010 0101) (1111 0000) 00 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19

Spatial Data Formats (Cont.) BAND-1 54 127 (1111 1110) (0111 1111) 4 193 (0000 1110) (1100 0001) BAND-2 7 240 (0010 0101) (1111 0000) 00 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19

Spatial Data Formats (Cont.) BAND-1 54 127 (1111 1110) (0111 1111) 4 193 (0000 1110) (1100 0001) BAND-2 7 240 (0010 0101) (1111 0000) 00 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19 bSQ format (16 files) B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1

bSQ Format and P-trees bSQ format separates each band into 8 files, 1 for each bit-slice. Reasons of using bSQ format Different bits contribute to the value differently. bSQ format facilitates the representation of a precision hierarchy (from 1 bit up to 8 bit precision). bSQ format facilitates the creation of an efficient data mining-ready P-tree data structure, a P-tree algebra and a data-cube of P-tree root counts. P-trees (Peano-trees) represent bit-slices, band-values and tuples in a recursive quadrant-by-quadrant, compressed but lossless representation of the original data.

An example of a P-tree Peano or Z-ordering bSQ 1 Arranged in 2-D space in raster order 55 16 8 15 3 4 1 55 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 3 1 16 15 16 8 1 4 1 4 3 4 4 1 1 Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID)

An example of Ptree Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant 001 55 16 8 15 3 4 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 2 3 2 3 2 . 2 . 3 111 Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID) ( 7, 1 ) ( 111, 001 ) 10.10.11

P-tree variation – PM-tree 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 Peano Mask tree (PM-tree) uses a mask instead of count. 1 denotes pure-1, 0 denotes pure-0 and m denotes mixed. 3-value logic It provides an efficient representation for ANDing.

Ptree Algebra And, Or, Complement Other (XOR, etc) PCT 55 PMT m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 16 __8____ _15__ 16 1 m m 1 / / | \ / | \ \ / / \ \ / / \ \ 3 0 4 1 4 4 3 4 m 0 1 m 1 1 m 1 //|\ //|\ //|\ //|\ //|\ //|\ 1110 0010 1101 1110 0010 1101 PMT1: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 PMT2: m 1 0 m 0 / / \ \ 1 1 1 m //|\ 0100 AND-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 OR-Result: m 1 m 1 1 m 0 1 m //|\ //|\ 1110 0010 Complement 9 m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 0 __8____ _1__ 0 0 m m 0 / / | \ / | \ \ / / \ \ / / \ \ 1 4 0 3 0 0 1 0 m 1 0 m 0 0 m 0 //|\ //|\ //|\ //|\ //|\ //|\ 0001 1101 0010 0001 1101 0010

Peano Cube (P-cube) The (v1,v2,v3)th cell of the P-cube contains P(v1,v2,v3) = P1,v1 AND P2,v2 AND P3,v3 where e.g., Pi,vi = Pi,110 = Pi,1 AND Pi,2 AND P’i,3 (P-cube above shows just root counts of the P-trees) P-cube can be rolled-up (on left), sliced, diced…

Data Smoothing Using P-tree Bottom-up purity shifting Replacing 3 counts with 4 Replacing 1 counts with 0 Data is smoothed P-tree is more compressed

Decision Tree Induction Using P-trees Basic Ideas Calculate information gain (gain ratio or gini index or..) by using the count information recorded in P-trees. P-tree generation replaces sub-sample set creation. Done once for the life of the dataset (cost amortized over life of data) P-tree also determine if all samples are in same class. Without additional database scan

Some Notation and Definitions Store the Decision Path for each node Decision Path for Node N09 is: Band 2, value 0011, Band 3, value 1000. Decision Path for ROOT: empty Class attribute – B[0] Given Node N’s Decision Path: (B1, v1), (B2, v2), …, (Bt, vt) then, P-tree of the dataset represented by Node N is: For ROOT node R, where T is the whole training set, PT is the full P-tree (unit P-tree) N12 N01 N03 N11 N10 N09 N08 N07 N06 N05 N04 N02 N13 N14 B2 0010 0011 0111 1010 1011   B1 B3 B1 B1 B1 0111 0100 1000 0011 1111 0010 B1 B1 0111 0011 P = P ( v [ 1 ]) Ù P ( v [ 2 ]) Ù Ù P ( v [ t ]) Set ( N ) B [ 1 ] B [ 2 ] B [ t ]

Using P-trees to Calculate Information Gain P = Node N’s P-tree, RC = root count function Node N’s information is: where pi = RC(P^PB, v[i])/RC(P). B=decision variable and v[1],...,v[n] are decision values at node i. Information Gain of attribute A at Node N: Gain(A) = I - E(A), where entropy Here vA[1], ... ,vA[n] are possible A values if classified by attribute A at node N

Example – Phase 1 bSQ: Basic P-trees: Value P-trees: BSQ Training Set: 0,0 | 0011| 0111| 1000| 1011 0,1 | 0011| 0011| 1000| 1111 0,2 | 0111| 0011| 0100| 1011 0,3 | 0111| 0010| 0101| 1011 1,0 | 0011| 0111| 1000| 1011 1,1 | 0011| 0011| 1000| 1011 1,2 | 0111| 0011| 0100| 1011 1,3 | 0111| 0010| 0101| 1011 2,0 | 0010| 1011| 1000| 1111 2,1 | 0010| 1011| 1000| 1111 2,2 | 1010| 1010| 0100| 1011 2,3 | 1111| 1010| 0100| 1011 3,0 | 0010| 1011| 1000| 1111 3,1 | 1010| 1011| 1000| 1111 3,2 | 1111| 1010| 0100| 1011 3,3 | 1111| 1010| 0100| 1011 bSQ: B11 B12 B13 B14 0000 0011 1111 1111 0011 0001 1111 0001 0111 0011 1111 0011 B21, B22, B23, B24 Basic P-trees: P11 , P12 , P13 , P14 P21 , P22 , P23 , P24 P31 , P32 , P33 , P34 P41 , P42 , P43 , P44 P11’ , P12’ , P13’ , P14’ P21’ , P22’’, P23 , P24’ P31’ , P32’ , P33’ , P34’ P41’ , P42’ , P43’ , P44’ Value P-trees: P1,0000 P1,0100 P1,1000 P1,1100 P1,0010 P1,0110 P1,1010 P1,1110 P1,0001 P1,0101 P1,1001 P1,1101 P1,0011 P1,0111 P1,1011 P1,1111 P2,0000 P2,0100 P2,1000 P2,1100 P2,0010 P2,0110 P2,1010 P2,1110 P2,0001 P2,0101 P2,1001 P2,1101 P2,0011 P2,0111 P2,1011 P2,1111 P3,0000 P3,0100 P3,1000 P3,1100 P3,0010 P3,0110 P3,1010 P3,1110 P3,0001 P3,0101 P3,1001 P3,1101 P3,0011 P3,0111 P3,1011 P3,1111 P4,0000 P4,0100 P4,1000 P4,1100 P4,0010 P4,0110 P4,1010 P4,1110 P4,0001 P4,0101 P4,1001 P4,1101 P4,0011 P4,0111 P4,1011 P4,1111

Example – Phase 2 Initially the decision tree is a single node representing the entire training set. Start with A=B2. Calculate I, E, Gain(B2) using root count of value P-trees. Calculate Gain(B3) and Gain(B4). Select the best split attribute, say B2. Branches are created for each value of B2 and samples are partitioned accordingly. B2=0010  SampleSet1 B2=0011  SampleSet2 B2=0111  SampleSet3 B2=1010  SampleSet4 B2=1011  SampleSet5 Advancing the algorithm recursively to each sub-sample set Stopping rule: E.g., when all samples belong to same class or no remaining attribs. This is the final decision tree: B2=0010  B1=0111 B2=0011  B3=0100  B1=0111 B3=1000  B1=0011 B2=0111  B1=0011 B2=1010  B1=1111 B2=1011  B1=0010

Preliminary Performance Study P-Classifier versus ID3 Classification Time 100 200 300 400 500 600 700 20 40 60 80 Size of data (Million samples) ID3 P-Classifer Classification cost with respect to the dataset size

Conclusion Decision tree classification on spatial data streams using P-trees Especially efficient for stream data