ITCS 6265/8265 Project Group 5 Gabriel Njock Tanusree Pai Ke Wang.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Naïve-Bayes Classifiers Business Intelligence for Managers.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Naïve Bayes Classifier
Evaluation.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Week 9 Data Mining System (Knowledge Data Discovery)
Evaluation.
Ensemble Learning: An Introduction
IMPUTING MISSING VALUES FOR HIERARCHICAL POPULATION DATA Overview of Database Research Muhammad Aurangzeb Ahmad Nupur Bhatnagar.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Experimental Evaluation
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
Bayesian Decision Theory Making Decisions Under uncertainty 1.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Simple Bayesian Classifier
Bayesian Networks. Male brain wiring Female brain wiring.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Naive Bayes Classifier
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Bayesian Classifier. 2 Review: Decision Tree Age? Student? Credit? fair excellent >40 31…40
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Bayesian Classification
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Classification And Bayesian Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Data Mining and Decision Support
Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Chapter 8 – Naïve Bayes DM for Business Intelligence.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Chapter 7. Classification and Prediction
Naive Bayes Classifier
Bayesian Classification
Bayesian Classification Using P-tree
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Classification and Prediction
EE513 Audio Signals and Systems
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
MIS2502: Data Analytics Classification Using Decision Trees
Presentation transcript:

ITCS 6265/8265 Project Group 5 Gabriel Njock Tanusree Pai Ke Wang

Outline Domain Problem Statement and Objective Data Description Problem Characteristics and Method Used Implementation Data Formating Feature Selection Boosting and Derived Attributes Testing & Results References

Domain COIL CHALLENGE Direct mailings to a company's potential customers - "junk mail" to many - can be a very effective way for them to market a product or a service. However, as we all know, much of this junk mail is really of no interest to the people that receive it. Most of it ends up thrown away, not only wasting the money that the company spent on it, but also filling up landfill waste sites or needing to be recycled. If the company had a better understanding of who their potential customers were, they would know more accurately who to send it to, so some of this waste and expense could be reduced. Motivation for Data Mining : cost reduction - realized by only targeting a portion of the potential customers.

Problem Statement The data used in this problem represents a frequently occurring problem: analysis of data about customers of a company, in this case an insurance company. Information about customers consists of 86 variables and includes product usage data and socio-demographic data derived from zip codes. The data was supplied by the Dutch data mining company Sentient Machine Research, and is based on real world business data.

Coil Challenge objective The competition consists of two tasks: Predict which customers are potentially interested in a caravan insurance policy. Describe the actual or potential customers; and possibly explain why these customers buy a caravan policy.

Project Objective Propose a solution that will allow us to predict whether a customer is interested in a caravan insurance policy. Find the subset of customers with a probability of purchasing a caravan insurance policy above some boundary probability.

Data Description TRAINING SET 5822 customer records 86 attributes. The attributes could be broadly categorized as follows: Socio-demographic (43) Insurance Policy Related (42) Contribution-per-policy type - (21) Number-of-policies - (21) Decision Attribute “Caravan”

Data Description TEST SET 4000 customer records 85 attributes. Caravan attribute was missing. Need to predict the Caravan attribute value. Note: Attribute values in both sets were pre- discretized.

Problem Characteristics The problem reduces to a classification analysis of customers: two classes are of those who are interested in purchasing a Caravan policy and those who are not. The learning of the classification model is supervised because the training set provides the decision attribute values.

Method Used Naive Bayesian Classification Bayesian classifiers are statistical classifiers [1] used to predict class membership probabilities. Bayes Theorem If X is a data sample whose class label is unknown and H is some hypothesis such that X belongs to class C, then the probability that the hypothesis H holds given the observed data sample X is denoted as P(H|X) and given by Bayes theorem as

Naive Bayes Classification The naive Bayesian classifier, is based on Bayes theorem: 1. Each data sample is represented by an n-dimensional feature vector, X = (x1, x2,..., xn), depicting n measurements made on the sample from n attributes, respectively A1, A2,..., An. For our problem we have 86 attributes, so n = If there are m classes, C1, C2,..., Cm, then given an unknown data sample, X (i.e. having no class label), the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naive Bayesian classifier assigns an unknown sample X to the class Ci if and only if P(Ci|X) > P(Cj|X) for 1 i Thus we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized, is called the maximum posteriori hypothesis. By Bayes theorem,

Naive Bayes Classification 3. As P(X) is constant for all classes, only P(X | Ci) P(Ci) need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, i.e., P(C1) = P(C2) =...= P(Cm), and we would therefore maximize P(X | Ci). Otherwise we maximize P(X | Ci) P(Ci). 4. In order to reduce computation in evaluating P(X | Ci), the naive assumption of class independence is made. This presumes that the values of the attributes are conditionally independent of one another. 5. To classify an unknown sample X, P(X | Ci) P(Ci) is evaluated for each class Ci. Sample X is then assigned to the class Ci if and only if P(X | Ci) P(Ci) > P(X | Cj) P(Cj) for 1 I In other words, it is assigned to the class Ci for which P(X | Ci) P(Ci) is the maximum.

Naive Bayes Classification Advantages 1. Comparable in performance with decision trees and neural networks. 2. High accuracy and speed when applied to large databases. 3. Theoretically, Bayesian classifiers have the minimum error rate in comparison to all other classifiers. Disadvantages 1. It makes certain assumptions, which may lead to inaccuracy.

Implementation Data Formatting and Tools Used: The data set (training as well as test) available was a simple text file with values separated by space. A database was created COILDB.mdb) with two tables (COILDB_TRAIN and COILDB_TEST) having appropriate column definitions for all attributes. Data from the text file was populated into the tables. Software used:MS Access Purpose:Allow executing sql query for data analysis

Implementation Data from the text file was populated into spreadsheets. Software used:MS Excel, Analyse-It Purpose:Allow statistical analysis (histogram, correlation) and have graphical output..arff files were created Software Used:Weka Purpose:Use Bayesian Classification for machine learning as well as testing.

Implementation Feature selection The analysis to determine the relevance of each attribute was carried out in two steps: 1. Analyze the relevance of demographical attributes. 2. Analyze the relevance of the non-demographical attributes.

Feature Selection – Demographical Attributes The attribute selection feature in Weka was used to rank the demographical attributes according to information gain. The 4 demographical attributes that had the highest information gain values were Customer Type (Mostype), Customer Subtype (Moshoofd), Average Income (Minkgem) and Purchasing Power Class (Mkoopla).

Simple Naive Bayesian classification was used with different combinations of the 4 demographical attributes along with the non-demographical attributes to determine which combination of these four attributes would yield in the best accuracy. The percentage of correctly classified instances and percentage of incorrectly classified instances were then compared for all the combined attribute groups. Feature Selection – Demographical Attributes

A correlation analysis was also conducted on the 4 demographical attributes. Figure shows the Customer Type and Customer Subtype attributes with a Pearson Correlation factor of 0.99:

Feature Selection – Demographical Attributes Based on the correlation analysis between the 4 demographical attributes and results from the comparison of percentage of correctly classified instances and percentage of incorrectly classified instances, it was decided to retain only 2 of the 43 demographical attributes. Average Income (Minkgem) Accuracy: Purchasing Power Class (Mkoopla) Accuracy: All attributes Accuracy:

Feature Selection – Policy Attributes The 42 non-demographical attributes were mainly insurance policy related and of two types: Contribution-per-policy Attributes Number-of-policy Attributes Preliminary Analysis: 1. The contribution-per-policy and number-of-policies attributes were highly correlated. 2. For 37 out of the 43 policy related attributes (including the caravan policy ownership attribute), more then 90% of the records has only 1 value (mainly: 0) – sparsely used attributes. 3. the vast majority of customers buys mostly the fire, car and third party insurance policies.

Boosting and Deriving Attributes For each pair of attributes (contribution-per-policy, number-of-policy) we performed two kinds of analysis: 1. Determine the correlation factor among the attributes 2. Derive the Total Contribution Attribute, which was the product of the contribution from a policy and the number of those policies.

Boosting and Deriving Attributes Simple Naive Bayes classification was conducted using the derived attributes and it was found that the product attribute gave a higher accuracy than the attributes individually. The three derived attributes which made a significant difference were: 1. CAR Policies (PPERSAUT, APERSAUT) 2. FIRE Policies (PRAND, ABRAND) 3. Private Third Party Insurance Policies (PWAPART, AWAPART)

Boosting and Deriving Attributes The classification was performed again, using all combinations of the derived attributes to determine which of the three derived attributes had to be retained. We compared the percentage of correctly classified instances and percentage of incorrectly classified instances. The highest accuracy and lowest error was found when using all three derived attributes.

Feature Selection – Policy Attributes The number of non-demographical attributes were then reduced from 42 to 39 by replacing 6 attributes with the derived attributes. These attributes were combined with the Average Income and Purchasing Power Class Attribute individually as well as together. Accuracy in each case: Avg. Income & Policy Attributes (with 3 derived): Purchasing Power Class & Policy Attributes (with 3 derived): Avg. Income, Purchasing Power class & Policy Attributes (3 derived):

Feature Selection – Policy Attributes We ranked the remaining attributes again according to information gain to test their relevance. The attributes with significant information gain were related to Boat policies and SSN policies. Correlation analysis as well accuracy analysis with Classification was used to determine the final set of attributes. Accuracy reached with 6 attributes (Purchasing Power, 3 derived attributes representing contribution from Car, Fire and Private Third Party policies, Number of Boat policies and Number of SSN Policies was:

Testing & Results After training of the model, the test data was used to predict the subset of customers likely to purchase the CARAVAN policy. A cut-off probability of 80% was used. We obtained 115 records with 80% or higher probability of purchasing the CARAVAN policy.

Conclusion From a set of 4000 customer records, our classification analysis predicted only 115 with a probability of purchasing the Caravan Policy higher than 80%. Significant reduction in cost can be obtained by targeting only those customers.

References [1]. The Insurance Company (TIC) Benchmark: The Coil Challenge Report, [2]. Sentient Machine Research.