Presentation is loading. Please wait.

Presentation is loading. Please wait.

DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.

Similar presentations


Presentation on theme: "DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision."— Presentation transcript:

1

2 DECISION TREE INDUCTION

3 CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision tree induction?

4 what is Classification? Classification: predicts categorical class labels. classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data. Given a collection of records (training set ). Each record contains a set of attributes, one of the attributes is the class.

5 classification-is two way process 1)Model construction: describing a set of predetermined classes – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute. – The set of tuples used for model construction is training set. – The model is represented as classification rules, decision trees.

6 classification-is two way process 2)Model usage: for classifying future or unknown objects. – Estimate accuracy of the model. – If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known.

7 What is prediction? Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Typical applications – Credit approval – Target marketing – Medical diagnosis – Fraud detection

8 Issues for classification and prediction There are two issues for classification and prediction. 1)Data preparation Data cleaning – Preprocess data in order to reduce noise and handle missing values. Relevance analysis (feature selection) – Remove the irrelevant or redundant attributes. Data transformation – Generalize and/or normalize data.

9 Issues for classification and prediction 2)Evaluation Accuracy – classifier accuracy: predicting class label – predictor accuracy: guessing value of predicted attributes Speed – time to construct the model (training time) – time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability – understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules

10 What is decision tree induction? Decision tree – A flow-chart-like tree structure – Internal node denotes a test on an attribute – Branch represents an outcome of the test – Leaf nodes represent class labels or class distribution A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision.

11 Decision tree generation –two phases Decision tree generation consists of two phases 1)Tree construction At start, all the training examples are at the root. Partition examples recursively based on selected attributes. 2)Tree pruning Identify and remove branches that reflect noise or outliers. Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree.

12 Algorithms for decision tree induction Decision Trees: Rules for classifying data using attributes. The tree consists of decision nodes and leaf nodes. A decision node has two or more branches, each representing values for the attribute tested. A leaf node attribute produces a homogeneous result (all in one class), which does not require additional classification testing.

13 Algorithm for Decision Tree Induction Id3 C4.5 Cart These are called decision tree induction algorithm.

14 Id3- background ID3 stands for Iterative Dichotomiser 3. Algorithm used to generate a decision tree. ID3 is a precursor to the C4.5 Algorithm.

15 (Id3,c4.5)-entropy We can calculate entropy, and information gain. The quantify information is called entropy. Entropy is used to measure the amount of uncertainty in a set of data.

16 INFORMATION GAIN The information gain is based on the decrease in entropy after a dataset is split on an attribute. First the entropy of the total dataset is calculated. The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split.

17 INFORMATION GAIN The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy. The attribute that yields the largest IG is chosen for the decision node.

18 ADVANTAGES AND DISADVANTAGES OF ID3,c4.5 Advantages: Understandable prediction rules are created from the training data. Builds the fastest tree. Builds a short tree. Only need to test enough attributes until all data is classified.

19 ADVANTAGES AND DISADVANTAGES OF ID3,c4.5 Disadvantages: Data may be over-fitted or over-classified, if a small sample is tested. Only one attribute at a time is tested for making a decision.

20 algorithm-cart Cart means classification and regression trees. Cart generates binary decision tree. Classification Trees vs Regression Trees 1.Splitting Criteria: Splitting Criterion: Gini, Entropy sum of squared errors 2.Goodness of fit measure: Goodness of fit: misclassification rates same measure! sum of squared errors 3.Prior probabilities and No priors or misclassification misclassification costs costs… – available as model ……….. just let it run – “tuning parameters”

21 Cart- advantages and disadvantages Advantages: Nonparametric (no probabilistic assumptions) Automatically performs variable selection Uses any combination of continuous/discrete variables.

22 Cart- advantages and disadvantages Disadvantages: Might take a large tree to get good lift But then hard to interpret Data gets chopped thinner at each split Instability of model structure Correlated variables  random data fluctuations could result in entirely different trees.

23 Privacy Preserving Decision Tree Learning Using Unrealized Data Sets Abstract:- Privacy preservation is important for machine learning and data mining, but measures designed to protect private information often result in a trade-off: reduced utility of the training samples. This paper introduces a privacy preserving approach that can be applied to decision tree learning, without concomitant loss of accuracy. It describes an approach to the preservation of the privacy of collected data samples in cases where information from the sample database has been partially lost.

24 Privacy Preserving Decision Tree Learning Using Unrealized Data Sets Abstract:- This approach converts the original sample data sets into a group of unreal data sets, from which the original samples cannot be reconstructed without the entire group of unreal data sets. Meanwhile, an accurate decision tree can be built directly from those unreal data sets. This novel approach can be applied directly to the data storage as soon as the first sample is collected. The approach is compatible with other privacy preserving approaches, such as cryptography, for extra protection.

25 INTRODUCTION This paper as are important for decision making or pattern recognition. Therefore, privacy- preserving processes have been developed to sanitize private information from the samples while keeping their utility. research in privacy preserving data mining mainly falls into one of two categories: 1) perturbation and randomization-based approaches, and 2) secure multiparty computation (SMC).

26 perturbation and randomization-based approaches perturbation and randomization based approach that protects centralized sample data sets utilized for decision tree data mining. This approach can be applied at any time during the data collection process so that privacy protection can be in effect even while samples are still being collected.

27 secure multiparty computation (SMC) SMC approaches employ cryptographic tools for collaborative data mining computation by multiple parties. SMC research focuses on protocol development for protecting privacy among the involved parties or computation efficiency however, centralized processing of samples and storage privacy is out of the scope of SMC.

28 Privacy Risk Without the dummy attribute values technique, the average privacy loss per leaked unrealized data set is small, except for the even distribution case (in which the unrealized samples are the same as the originals). By doubling the sample domain, the average privacy loss for a single leaked data set is zero, as the unrealized samples are not linked to any information provider. The randomly picked tests show that the data set complementation approach eliminates the privacy risk for most cases and always improves privacy security significantly when dummy values are used.

29 CONCLUSION We introduced a new privacy preserving approach via data set complementation which confirms the utility of training data sets for decision tree learning. privacy preservation via data set complementation fails if all training data sets are leaked because the data set reconstruction algorithm is generic. Therefore, further research is required to overcome this limitation.

30 Future work and references Future work: further studies are needed into optimizing 1)the storage size of the unrealized samples, and 2)the processing time when generating a decision tree from those samples. References: 1)M. Shaneck and Y. Kim, “Efficient Cryptographic Primitives for Private Data Mining,” Proc. 43rd Hawaii Int’l Conf. System Sciences (HICSS), pp. 1-9, 2010. 2)L. Liu, M. Kantarcioglu, and B. Thuraisingham, “Privacy Preserving Decision Tree Mining from Perturbed Data,” Proc. 42nd Hawaii Int’l Conf. System Sciences (HICSS ’09), 2009.

31 THANK YOU


Download ppt "DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision."

Similar presentations


Ads by Google