RainForest ارائه از: مهدی تیموری و کیانوش ملکوتی استاد: دکتر وحیدی پور

Slides:



Advertisements
Similar presentations
Rule Generation from Decision Tree Decision tree classifiers are popular method of classification due to it is easy understanding However, decision tree.
Advertisements

Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
IT 433 Data Warehousing and Data Mining
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
“RainForest – A Framework for Fast Decision Tree Construction of Large Datasets” J. Gehrke, R. Ramakrishnan, V. Ganti. ECE 594N – Data Mining Spring 2003.
Scalable Classification Robert Neugebauer David Woo.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Lecture Notes for Chapter 4 Introduction to Data Mining
Xyleme A Dynamic Warehouse for XML Data of the Web.
Classification and Prediction
CSci 8980: Data Mining (Fall 2002)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Lecture 5 (Classification with Decision Trees)
Classification II.
Classification.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
and Confidential NOTICE: Proprietary and Confidential This material is proprietary to A. Teredesai and GCCIS, RIT. Slide 1 Decision.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Classification supplemental. Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) – builds an index for each.
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
1 Appendix D: Application of Genetic Algorithm in Classification Duong Tuan Anh 5/2014.
Decision Tree (Rule Induction)
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification using Decision Trees 1.Data Mining and Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Classification Today: Basic Problem Decision Trees.
Bootstrapped Optimistic Algorithm for Tree Construction
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Privacy-Preserving Data Mining
Parallel Databases.
Prepared by: Mahmoud Rafeek Al-Farra
Chapter 6 Classification and Prediction
Classification and Prediction
Communication and Memory Efficient Parallel Decision Tree Construction
Classification by Decision Tree Induction
Data Mining: Concepts and Techniques
Basic Concepts and Decision Trees
CS 685: Special Topics in Data Mining Jinze Liu
Bootstrapped Optimistic Algorithm for Tree Construction
Data Mining – Chapter 3 Classification
Dept. of Computer Sciences University of Wisconsin-Madison
Classification and Prediction
CS 685: Special Topics in Data Mining Jinze Liu
CSCI N317 Computation for Scientific Applications Unit Weka
Decision Tree Concept of Decision Tree
Decision Trees for Mining Data Streams
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
Classification.
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
Decision Tree (Rule Induction)
CS 685: Special Topics in Data Mining Jinze Liu
Classification 1.
Presentation transcript:

RainForest ارائه از: مهدی تیموری و کیانوش ملکوتی استاد: دکتر وحیدی پور ارائه از: مهدی تیموری و کیانوش ملکوتی استاد: دکتر وحیدی پور دانشکده برق و کامپیوتر، دانشگاه کاشان؛ پاییز 96

Background (1/2) Decision Tree: Efficiency & Scalability The efficiency of existing decision tree algorithms, such as ID3, C4.5, and CART, has been well established for relatively small data sets. What if D, the disk-resident training set of class-labeled tuples, does not fit in main memory? In other words, how scalable is decision tree induction? The pioneering decision tree algorithms that we have mentioned assume that the data are memory resident (have the restriction that the training tuples should reside in memory).

Background (2/2) Decision Tree: Efficiency & Scalability Efficiency becomes an issue of concern when these algorithms are applied to the mining of very large real-world databases (most often, the training data will not fit in memory). More scalable approaches, capable of handling training data that are too large to fit in memory, are required. Earlier strategies to “save space” included discretizing continuous-valued attributes and sampling data at each node. These techniques, however, still assume that the training set can fit in memory. Some other strategies reduces quality.

What is RainForest? (1/2) RainForest (a whimsical name!) is a scalable framework for decision tree induction. It was first introduced by Gehrke, Ramakrishnan and Ganti in an article which was published in “Data Mining and Knowledge Discovery” journal, July 2000, Volume 4. Separates the scalability aspects from the criteria that determine the quality of the tree.

What is RainForest? (2/2) Adapts to the amount of main memory available and applies to any decision tree induction algorithm. Also increases performance. The framework applied to different algorithms results in a scalable version of the algorithm without modifying the result. Thus evaluation of the quality of the resulting decision tree is not needed, instead we concentrate on scalability issues.

AVC-list (1/2) The method maintains an AVC-list (where “AVC” stands for “Attribute- Value, Classlabel”) describing the training tuples at the node. The AVC-set of an attribute P at node N gives the classlabel aggregated counts for each distinct attribute value of P for the tuples at N. Projection of F(N), the family of node N is the set of tuples of the DB that follows the path from the root to N when being classified by the tree, into P and the class label. The set of all AVC-sets at a node N is the AVC-group of N.

AVC-list (2/2) The size of an AVC-set for attribute A at node N depends only on the number of distinct values of A and the number of classelabels in the set of tuples at N ( F(N) ). Not the number of records. Typically, this size should fit in memory, even for real-world data. RainForest also has techniques, however, for handling the case where the AVC-group does not fit in memory. Therefore, the method has high scalability for decision tree induction in very large data sets.

Example: Class-Labeled Training Tuples id age income student credit_rating class: buys_computer 1 youth high no fair 2 excellent 3 middle_aged yes 4 senior medium 5 low 6 7 8 9 10 11 12 13 14

Example: AVC-sets age buys_computer yes no youth 2 3 middle_aged 4 senior income buys_computer yes no low 3 1 medium 4 2 high student buys_computer yes no 6 1 3 4 credit_rating buys_computer yes no fair 6 2 excellent 3

Top-Down Decision Tree Induction Schema: A Common Pattern BuildTree (Node N, datapartition D) Apply Algorithm to D to find criterion(N) let K be the number of children of N if (K>0) create K children C1,C2,…,CK of N use best split to partition D into D1,D2,...,DK for (i=1 ; i<=K ; i++) BuildTree(Ci , Di ) endfor endif

RainForest Refinement for each predicator attribute P call A.find_best_partitioning(AVC-set of P) endfor K=A.decide_splitting_criterion() Scale with the size of the database Adapt gracefully to the amount of main memory available Not restricted to a specific classification algorithm Utility of a predicator attribute P is examined independent of the other predicator attributes isolates an important component, the AVC-set (which allows the separation of scalability issues from the algorithms to decide on the splitting criterion). Total main memory required = max size of any AVC-sets Evaluate previous results

Steps in RainForest Refinement Algorithm 1- AVC-group construction 2- Choose splitting attribute and predicate (According to algorithm) 3- Partition D across the children nodes

Main Memory Depending on the amount of main memory available, three cases can be distinguished: 1- The AVC-group of the root node fits in main memory. RF-Write, RF-Read, RF-Hybrid 2-Each individual AVC-set of the root node fits in main memory, but the AVC-group of the root node does not fit in main memory. RF-Vertical 3-None of the individual AVC-sets of the root fit in main memory.

State of Node State Precondition Processing Behavior of tuple t Send criterion(N) computed. N’s children nodes are allocated. N is root or parent of N is in ‘Send’ state t is send to a child according to criterion(N). Fill the AVC-group an N is updated. Write t is appended to a N’s partition. FillWrite the AVC-group an N is updated by t. Undecided (New Node) - Dead N doesn’t split or all children of N is in ‘Dead’ State.

RF-Write 1- R:Fill. One scan of the DB and construct AVC-group of R 2- Apply our algorithm (compute criterion(R)) and create K children of R. R:Send. Each children:Write 3- Additional scan of DB where each tuple t is written into one of the K partitions 4- Recurs on each partition

Comparison

The End Thank you for your time