Variable Reduction for Predictive Modeling with Clustering

Slides:



Advertisements
Similar presentations
Unido.org/statistics International workshop on industrial statistics 8 – 10 July, Beijing Non response in industrial surveys Shyam Upadhyaya.
Advertisements

Lesson 10: Linear Regression and Correlation
Forecasting Using the Simple Linear Regression Model and Correlation
Ridge Regression Population Characteristics and Carbon Emissions in China ( ) Q. Zhu and X. Peng (2012). “The Impacts of Population Change on Carbon.
Simple Regression. Major Questions Given an economic model involving a relationship between two economic variables, how do we go about specifying the.
Sampling u Randomization u Stratification u Proportional u Clusters u Systematic u Purposive.
Chapter 1 Introduction to Clustering. Section 1.1 Introduction.
Statistical Methods Chichang Jou Tamkang University.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Multiple Regression Dr. Andy Field.
CAS Predictive Modeling Seminar Evaluating Predictive Models Glenn Meyers ISO Innovative Analytics October 5, 2006.
Correlation and Linear Regression
Chapter 11 Simple Regression
Linear Regression and Correlation
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
Data Cleansing for Predictive Models: The Next Level Roosevelt C. Mosley, Jr., FCAS, MAAA CAS Ratemaking & Product Management Seminar Philadelphia, PA.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Statistical Methods Statistical Methods Descriptive Inferential
2007 CAS Predictive Modeling Seminar Estimating Loss Costs at the Address Level Glenn Meyers ISO Innovative Analytics.
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
Unsupervised learning introduction
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
CHAPTER 5 CORRELATION & LINEAR REGRESSION. GOAL : Understand and interpret the terms dependent variable and independent variable. Draw a scatter diagram.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Glenn Meyers ISO Innovative Analytics 2007 CAS Annual Meeting Estimating Loss Cost at the Address Level.
Correlation & Regression Analysis
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Lecture 4 Ways to get data into SAS Some practice programming
Matching. Objectives Discuss methods of matching Discuss advantages and disadvantages of matching Discuss applications of matching Confounding residual.
1 ESTIMATORS OF VARIANCE, COVARIANCE, AND CORRELATION We have seen that the variance of a random variable X is given by the expression above. Variance.
CORRELATION ANALYSIS.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Linear Regression and Correlation Chapter 13.
FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Dr. Justin Bateh. Point of Estimate the value of a single sample statistics, such as the sample mean (or the average of the sample data). Confidence Interval.
Multivariate Analysis - Introduction. What is Multivariate Analysis? The expression multivariate analysis is used to describe analyses of data that have.
Correlation and Linear Regression
Unsupervised Learning
Chapter 12 – Discriminant Analysis
PREDICT 422: Practical Machine Learning
Machine Learning Clustering: K-means Supervised Learning
Multiple Regression Prof. Andy Field.
Correlation and Simple Linear Regression
Multivariate Analysis - Introduction
Regression 10/29.
Dimension Reduction in Workers Compensation
Evaluation of measuring tools: validity
Understanding Standards Event Higher Statistics Award
K Nearest Neighbor Classification
MEASURES OF CENTRAL TENDENCY
CORRELATION ANALYSIS.
LESSON 4: MEASURES OF VARIABILITY AND PROPORTION
Additional notes on random variables
Additional notes on random variables
Product moment correlation
5 points piece Homework Check
Multivariate Analysis - Introduction
Statistics Review (It’s not so scary).
Canonical Correlation Analysis
ANalysis Of VAriance Lecture 1 Sections: 12.1 – 12.2
Numerical Descriptive Measures
MGS 3100 Business Analysis Regression Feb 18, 2016
Unsupervised Learning
Presentation transcript:

Variable Reduction for Predictive Modeling with Clustering Presentation Title Variable Reduction for Predictive Modeling with Clustering Casualty Actuarial Society Seminar on Ratemaking Bob Sanche March 14, 2006 Sunday, July 01, 2018

Data Storage and Amount of Predictive Variables Contents Presentation Title Data Storage and Amount of Predictive Variables Predictive Modeling and Model Generalization Dimension Reduction Goal of Variable Clustering What Is Clustering? Variable Clustering When Does Variable Clustering Occur During the Predictive Modeling Process? Example Sunday, July 01, 2018

Data Storage and Predictive Variables Presentation Title Data storage economics “In 1956, IBM sold its first magnetic disk system, RAMAC (Random Access Method of Accounting and Control). It used 50 24-inch metal disks, with 100 tracks per side. It could store 5 megabytes of data and cost $10,000 per megabyte. (As of 2005, disk storage costs less than $1 per gigabyte).” http://en.wikipedia.org/wiki/History_of_computing_hardware 1 gigabyte = 130 numeric characteristics for 1 million policies for $1 Sunday, July 01, 2018

Data Storage and Predictive Variables Presentation Title New data sources Data warehousing External sources (demographics, meteorological, etc.) Policyholder, household or company information Agency Other Data storage economics and availability of data Increase the number of predictive variables Data mining paradigm Additional inputs add lift to the model Sunday, July 01, 2018

Predictive Modeling and Model Generalization Presentation Title A predictive model is created from a number of predictors that are likely to influence future results Y = α1X1 + … + αnXn + β n is universe of all available predictors Goal of predictive modeling Obtain coefficients for α’s and β Predictive of future results Model generalizes well over time Model complexity → Overfitting Sunday, July 01, 2018

Need to reduce model complexity Dimension reduction Presentation Title Need to reduce model complexity Dimension reduction Clustering (K-Means) Rows Variable clustering Columns Alternatives to variable clustering PCA and factor analysis Difficult to interpret and deploy Sunday, July 01, 2018

Goal of Variable Clustering Presentation Title Reduce the number of variables More difficult to identify irrelevant variables than redundant variables Y = α1X1 + … + αmXm + β where m<n Why do we want to reduce the number of variables? Improve efficiency of predictive modeling process Time to develop the model Interpretation of the results Reduce variance of the model estimates Demographics example Average household size, median household size, proportion of families, median vehicles per household could be replaced by only one variable Sunday, July 01, 2018

Divide set of data (variable) into groups of similar characteristics What Is Clustering? Presentation Title “Cluster Analysis is a set of methods for constructing a sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual” B.S. Everitt , The Cambridge Dictionary of Statistics, 1998 Divide set of data (variable) into groups of similar characteristics Unsupervised learning technique Useful only when there is redundancy in the data Sunday, July 01, 2018

Similarity measured by distance or correlation metrics What Is Clustering? Presentation Title Similarity measured by distance or correlation metrics Types of clustering Hierarchical clustering Agglomerative Divisive Partitive (optimization) clustering Sunday, July 01, 2018

Presentation Title Variable Clustering Variable clustering divides a set of numeric* variables into clusters. A cluster representing a large set of variables can be replaced by a single member (cluster representative). * Hamming distance for categorical variables Selection of the cluster representative Sunday, July 01, 2018 Intuitively, we want the cluster representative to be as closely correlated to its own cluster (R2own1) and as uncorrelated to the nearest cluster (R2nearest0). Therefore, the optimal representative of a cluster is a variable where 1-R2 ratio tends to zero

SEMMA process for data mining Sample Explore Modify Model Assess When Does Variable Clustering Occur During the Predictive Modeling Process? Presentation Title SEMMA process for data mining Sample Explore Modify Model Assess Sunday, July 01, 2018

Example 3 CLUSTERS R-SQUARED WITH 1-R2 Ratio Cluster Variable Presentation Title 3 CLUSTERS R-SQUARED WITH 1-R2 Ratio Cluster Variable Own Cluster Next Closest Cluster 1 Rain Days 0.5995 0.0426 0.4183 Snow Days 0.8976 0.0317 0.1095 Annual Snow 0.8940 0.0314 Cluster 2 Population Density 0.9804 0.0228 0.0201 Car Density 0.0113 0.0199 Cluster 3 Population Growth 0.6459 0.0911 0.3896 Legal Expenditures 0.0013 0.3546 Sunday, July 01, 2018

Example Name of Variable or Cluster Population Density Presentation Title Name of Variable or Cluster Population Density Legal Expenditures Snow Days Annual Snow Accumulation Rain Days Sunday, July 01, 2018 Population Growth Car Density 1.00 0.95 0.90 0.85 0.80 0.75 0.70 Proportion of Variance Explained

Need for dimension reduction for model generalization Conclusion Presentation Title Need for dimension reduction for model generalization Variable clustering reduces the amount of variables available for predictive modeling (GLM, etc.) The predictive modeling process using variable clustering Avoid overfitting Increases interpretability Reduces time for modeling Sunday, July 01, 2018