Decision Tree Induction for High-Dimensional Data Using P-Trees

Slides:



Advertisements
Similar presentations
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Advertisements

Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.
Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo Dept of Computer Science North Dakota State Univ.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING
Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,
Data Mining and Data Warehousing Many-to-Many Relationships Applications William Perrizo Dept of Computer Science North Dakota State Univ.
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Clustering using Wavelets and Meta-Ptrees Anne Denton, Fang Zhang.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,
Data Mining and Data Warehousing of Many-to-Many Relationships and some Applications William Perrizo Dept of Computer Science North Dakota State Univ.
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Accelerating Multilevel Secure Database Queries using P-Tree Technology Imad Rahal and Dr. William Perrizo Computer Science Department North Dakota State.
Overview Data Mining - classification and clustering
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,
Content  Hierarchical Triangle Mesh (HTM)  Perrizo Triangle Mesh Tree (PTM-tree)  SDSS.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
A Generic Approach to Big Data Alarms Prioritization
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Data Mining Motivation: “Necessity is the Mother of Invention”
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
Astronomy Application: (National Virtual Observatory data)
School of Computer Science & Engineering
Prepared by: Mahmoud Rafeek Al-Farra
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
DataSURG (Database Systems Users and Research Group)
Efficient Image Classification on Vertically Decomposed Data
Efficient Ranking of Keyword Queries Using P-trees
The Age of Infinite Storage or the age of data mining
Proximal Support Vector Machine for Spatial Data Using P-trees1
Data Mining on a Data Warehouse vs
William Perrizo Dept of Computer Science North Dakota State Univ.
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)
3. Vertical Data LECTURE 2 Section 3.
Efficient Image Classification on Vertically Decomposed Data
Vertical K Median Clustering
A Fast and Scalable Nearest Neighbor Based Classification
Data Mining extracting knowledge from a large amount of data
Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data Imad Rahal, Dongmei Ren, Amal Perera, Hassan Najadat and.
Vertical K Median Clustering
3. Vertical Data LECTURE 2 Section 3.
Outline Introduction Background Our Approach Experimental Results
North Dakota State University Fargo, ND USA
The Multi-hop closure theorem for the Rolodex Model using pTrees
Vertical K Median Clustering
Nearest Neighbors CSC 576: Data Mining.
William Perrizo Dept of Computer Science North Dakota State Univ.
North Dakota State University Fargo, ND USA
Avoid Overfitting in Classification
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
A task of induction to find patterns
Integrating Query Processing and Data Mining in Relational DBMSs
Presentation transcript:

Decision Tree Induction for High-Dimensional Data Using P-Trees Anne Denton, William Perrizo Computer Science North Dakota State University 1. Introduction to data mining and vertical data 2. P-trees 3. P-tree Lazy DTI algorithm 4. Performance

Introduction: Data Mining versus Querying There is a whole spectrum of techniques to get information from data: Fractals, … Querying Searching and Aggregating Machine Learning Data Mining Data Prospecting Association Rule Mining OLAP (rollup, drilldown, slice/dice.. Supervised Learning – classification regression SQL SELECT FROM WHERE Complex queries (nested, EXISTS..) FUZZY query, Search engines, BLAST searches Unsupervised Learning - clustering On the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02). On the Data Mining end, the surface has barely been scratched. But even those scratches had a great impact – One of the early scatchers became the biggest corporation in the world last year. A Non-scratcher filed for bankruptcy Walmart vs. KMart Our Approach: Vertical, compressed data structures (Predicate or Peano -trees1) processed horizontally (whereas, DBMSs process horizontal data vertically) Ptrees address curses of scalability and dimensionality. 1 Ptree Technology is patent pending by North Dakota State University

But it is pure (pure0) so this branch ends Predicate tree technology: vertically project each attribute, Current practice in data mining Data is structured into horizontal records. then vertically project each bit position of each attribute, then compress each bit slice, using a predicate, into a basic P-tree. e.g., compress R11 into P11 (using the universal predicate, pure1) Then Process vertically (vertical scans) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 Horizontally structured records are scanned vertically R11 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 pure1? false=0 pure1? true=1 The 1-Dimension P-tree version of R11, P11, is built by recording the truth of the predicate “pure 1” in a tree recursively on halves, until purity is achieved. pure1? false=0 pure1? false=0 pure1? false=0 Horizontally AND basic Ptrees 1. Whole is pure1? false  0 0 0 0 1 10 ^ P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 0 0 0 1 01 10 1 0 0 1 0 1 0 0 10 01 ^ 2. Left half pure1? false  0 3. Right half pure1? false  0 0 0 P11 And it’s pure so branch ends 7. Rt half of lf of rt? false0 0 0 0 1 10 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 0 0 0 1 6. Lf half of lf of rt? true1 0 0 0 1 1 To count occurrences of 7,0,1,4 use pure111000001100: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2 01 21-level ^ 7 0 1 4 But it is pure (pure0) so this branch ends

Vertical Data Structures History In the 1980’s vertical data structures were proposed for record-based workloads Decomposition Storage Model (DSM, Copeland et al) Bit Transposed File (BTF, Wang et al); Viper Vertical auxiliary and system structures Domain & Request Vectors (DVA/ROLL/ROCC Perrizo, Shi, et al) Bit Mapped Indexes (BMIs) (In fact, all indexes are vertical) Why Vertical Data Structures: For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A1 A2 A3 A4) In data mining, result is often a bit (Yes/No) or another unstructured result, where horizontal structuring has no advantage. 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1

2-Dimensional Pure1-trees Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file (from, e.g., high-order bit of the RED band of a 2-D image) 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order looks like: Run-length compress it into a quadrant tree using Peano order. 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1

Applying P-tree technology to telescope data (the Celestial Sphere) using a variant of the Peano space filling curve on triangulations Traverse southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point. RA dec This “Peano ordering” produces a sphere-surface filling curve with good continuity characteristics.

Applying P-trees to astronomical data using a standard Recession Angle and Declination coordinates North Plane South Plane 90o 0o -90o dec 0o 360o recession angle Sphere  Cylinder  Plane Z Z

3-Dimensional Ptrees (to be used if the data is naturally 3-D (e. g 3-Dimensional Ptrees (to be used if the data is naturally 3-D (e.g., solids)

Logical Operations on Ptrees (are used to get counts of any pattern) Ptree 1 Ptree 2 AND result OR result AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes).

Our DataMIME™ System (DataMIMEtm = information without noise) DII (Data Integration Interface) Data Integration Language DIL YOUR DATA DMI (Data Mining Interface) YOUR DATA MINING Ptree (Predicates) Query Language PQL Internet P-tree Base lossless, compressed, distributed, vertically-structured database

Raster Sorting: Attributes 1st Bit position 2nd Generalized Raster and Peano Sorting: generalizes to any table with numeric attributes (not just images). Raster Sorting: Attributes 1st Bit position 2nd Peano Sorting: Bit position 1st Attributes 2nd Unsorted relation

Generalize Peano Sorting works! Classification speed improvement (using 5 UCI Machine Learning Repository data sets) 120 100 Unsorted 80 Generalized Raster Time in Seconds 60 40 20 Generalized Peano adult spam mushroom function crop

ACM KDD-Cup 2002 win using Ptree technology NDSU Team

P-trees Lazy Decision Tree Induction method Our algorithm uses decision tree induction in selecting attributes successively based on their relevance to the classification task. Data points that match the unclassified sample in all selected attributes are considered relevant to the prediction task and the class label of the sample of interest is determined from the plurality of votes by those points. Attribute selection is based on optimizing info gain. In contrast to conventional decision tree induction tree branches are constructed as needed based on the sample that is to be classified (lazy decision tree induction). Continuous attributes are treated using a window function and a count calculation using P-trees. Bagging is used to improve the result (multiple runs using different training bags).

Results: Accuracy Comparable to C4.5 after much less development time 20 paths adult 16 14.9 spam 7.2 7.1 sick-euthyroid 2.2 2.9 kr-vs-kp 0.8 0.84 mushroom gene-fctn 15.7 15.5 crop 19 Comparable to C4.5 after much less development time 5 data sets from UCI Machine Learning Repository 2 additional data sets Crop Gene-function Improvement thru multiple path bagging (20)

Results: Speed Used on largest UCI data sets (spam) Scaling of execution time as a function of training set size