Decision Tree Induction for High-Dimensional Data Using P-Trees

Decision Tree Induction for High-Dimensional Data Using P-Trees
Anne Denton, William Perrizo Computer Science North Dakota State University 1. Introduction to data mining and vertical data 2. P-trees 3. P-tree Lazy DTI algorithm 4. Performance

Introduction: Data Mining versus Querying
There is a whole spectrum of techniques to get information from data: Fractals, … Querying Searching and Aggregating Machine Learning Data Mining Data Prospecting Association Rule Mining OLAP (rollup, drilldown, slice/dice.. Supervised Learning – classification regression SQL SELECT FROM WHERE Complex queries (nested, EXISTS..) FUZZY query, Search engines, BLAST searches Unsupervised Learning - clustering On the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02). On the Data Mining end, the surface has barely been scratched. But even those scratches had a great impact – One of the early scatchers became the biggest corporation in the world last year. A Non-scratcher filed for bankruptcy Walmart vs. KMart Our Approach: Vertical, compressed data structures (Predicate or Peano -trees1) processed horizontally (whereas, DBMSs process horizontal data vertically) Ptrees address curses of scalability and dimensionality. 1 Ptree Technology is patent pending by North Dakota State University

But it is pure (pure0) so this branch ends
Predicate tree technology: vertically project each attribute, Current practice in data mining Data is structured into horizontal records. then vertically project each bit position of each attribute, then compress each bit slice, using a predicate, into a basic P-tree. e.g., compress R11 into P11 (using the universal predicate, pure1) Then Process vertically (vertical scans) R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] Horizontally structured records are scanned vertically R11 1 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 pure1? false=0 pure1? true=1 The 1-Dimension P-tree version of R11, P11, is built by recording the truth of the predicate “pure 1” in a tree recursively on halves, until purity is achieved. pure1? false=0 pure1? false=0 pure1? false=0 Horizontally AND basic Ptrees 1. Whole is pure1? false  0 0 0 0 1 10 ^ P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 1 0 0 1 ^ 2. Left half pure1? false  0 3. Right half pure1? false  0 0 0 P11 And it’s pure so branch ends 7. Rt half of lf of rt? false0 0 0 0 1 10 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 0 0 0 1 6. Lf half of lf of rt? true1 0 0 0 1 1 To count occurrences of 7,0,1,4 use pure : level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = level =2 level ^ But it is pure (pure0) so this branch ends

Vertical Data Structures History
In the 1980’s vertical data structures were proposed for record-based workloads Decomposition Storage Model (DSM, Copeland et al) Bit Transposed File (BTF, Wang et al); Viper Vertical auxiliary and system structures Domain & Request Vectors (DVA/ROLL/ROCC Perrizo, Shi, et al) Bit Mapped Indexes (BMIs) (In fact, all indexes are vertical) Why Vertical Data Structures: For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 R( A1 A2 A3 A4) In data mining, result is often a bit (Yes/No) or another unstructured result, where horizontal structuring has no advantage. R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1

2-Dimensional Pure1-trees
Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file (from, e.g., high-order bit of the RED band of a 2-D image) Which, in spatial raster order looks like: Run-length compress it into a quadrant tree using Peano order. 1 1 1 1 1 1 1 1 1 1 1

Applying P-tree technology to telescope data (the Celestial Sphere) using a variant of the Peano space filling curve on triangulations Traverse southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point. RA dec This “Peano ordering” produces a sphere-surface filling curve with good continuity characteristics.

Applying P-trees to astronomical data using a standard Recession Angle and Declination coordinates
North Plane South Plane 90o 0o -90o dec 0o o recession angle Sphere  Cylinder  Plane Z Z

3-Dimensional Ptrees (to be used if the data is naturally 3-D (e. g
3-Dimensional Ptrees (to be used if the data is naturally 3-D (e.g., solids)

Logical Operations on Ptrees (are used to get counts of any pattern)
Ptree Ptree AND result OR result AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes).

Our DataMIME™ System (DataMIMEtm = information without noise)
DII (Data Integration Interface) Data Integration Language DIL YOUR DATA DMI (Data Mining Interface) YOUR DATA MINING Ptree (Predicates) Query Language PQL Internet P-tree Base lossless, compressed, distributed, vertically-structured database

Raster Sorting: Attributes 1st Bit position 2nd
Generalized Raster and Peano Sorting: generalizes to any table with numeric attributes (not just images). Raster Sorting: Attributes 1st Bit position 2nd Peano Sorting: Bit position 1st Attributes 2nd Unsorted relation

Generalize Peano Sorting works!
Classification speed improvement (using 5 UCI Machine Learning Repository data sets) 120 100 Unsorted 80 Generalized Raster Time in Seconds 60 40 20 Generalized Peano adult spam mushroom function crop

ACM KDD-Cup 2002 win using Ptree technology
NDSU Team

P-trees Lazy Decision Tree Induction method
Our algorithm uses decision tree induction in selecting attributes successively based on their relevance to the classification task. Data points that match the unclassified sample in all selected attributes are considered relevant to the prediction task and the class label of the sample of interest is determined from the plurality of votes by those points. Attribute selection is based on optimizing info gain. In contrast to conventional decision tree induction tree branches are constructed as needed based on the sample that is to be classified (lazy decision tree induction). Continuous attributes are treated using a window function and a count calculation using P-trees. Bagging is used to improve the result (multiple runs using different training bags).

Results: Accuracy Comparable to C4.5 after much less development time
20 paths adult 16 14.9 spam 7.2 7.1 sick-euthyroid 2.2 2.9 kr-vs-kp 0.8 0.84 mushroom gene-fctn 15.7 15.5 crop 19 Comparable to C4.5 after much less development time 5 data sets from UCI Machine Learning Repository 2 additional data sets Crop Gene-function Improvement thru multiple path bagging (20)

Results: Speed Used on largest UCI data sets (spam)
Scaling of execution time as a function of training set size

Decision Tree Induction for High-Dimensional Data Using P-Trees

Similar presentations

Presentation on theme: "Decision Tree Induction for High-Dimensional Data Using P-Trees"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decision Tree Induction for High-Dimensional Data Using P-Trees

Similar presentations

Presentation on theme: "Decision Tree Induction for High-Dimensional Data Using P-Trees"— Presentation transcript:

Similar presentations

About project

Feedback