Some simple, useful, but seldom taught statistical techniques Larry Weldon Statistics and Actuarial Science Simon Fraser University Nov. 27, 2008 1.

Slides:



Advertisements
Similar presentations
Chapter 12 Inference for Linear Regression
Advertisements

BANISHING THE THEORY-APPLICATIONS DICHOTOMY FROM STATISTICS EDUCATION Larry Weldon Department of Statistics and Actuarial Science Simon Fraser University,
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
/k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
5/13/ lecture 91 STATS 330: Lecture 9. 5/13/ lecture 92 Diagnostics Aim of today’s lecture: To give you an overview of the modelling cycle,
TigerStat ECOTS Understanding the population of rare and endangered Amur tigers in Siberia. [Gerow et al. (2006)] Estimating the Age distribution.
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
Data Analysis and the Shackles of Statistical Tradition Larry Weldon Statistics and Actuarial Science SFU.
6/3/20151 FROM DATA TO GRAPHS TO WORDS - BUT WHERE ARE THE MODELS? K. Larry Weldon Simon Fraser University Canada …. …. Findings.
BA 555 Practical Business Analysis
1 BA 275 Quantitative Business Methods Residual Analysis Multiple Linear Regression Adjusted R-squared Prediction Dummy Variables Agenda.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Linear Regression and Correlation Analysis
Part III: Inference Topic 6 Sampling and Sampling Distributions
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Quantitative Business Analysis for Decision Making Simple Linear Regression.
Ch. 14: The Multiple Regression Model building
Linear Regression Models Powerful modeling technique Tease out relationships between “independent” variables and 1 “dependent” variable Models not perfect…need.
Correlation and Regression Analysis
CS1100: Computer Science and Its Applications Creating Graphs and Charts in Excel.
Chapter 7 Forecasting with Simple Regression
Correlation & Regression Math 137 Fresno State Burger.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
Inference for regression - Simple linear regression
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
10/1/ lecture 151 STATS 330: Lecture /1/ lecture 152 Variable selection Aim of today’s lecture  To describe some further techniques.
Measures of relationship Dr. Omar Al Jadaan. Agenda Correlation – Need – meaning, simple linear regression – analysis – prediction.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Chapter 10 Correlation and Regression
Data Mining: Neural Network Applications by Louise Francis CAS Annual Meeting, Nov 11, 2002 Francis Analytics and Actuarial Data Mining, Inc.
Linear correlation and linear regression + summary of tests
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Using R for Marketing Research Dan Toomey 2/23/2015
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Applied Quantitative Analysis and Practices LECTURE#25 By Dr. Osman Sadiq Paracha.
V Bandi and R Lahdelma 1 Forecasting. V Bandi and R Lahdelma 2 Forecasting? Decision-making deals with future problems -Thus data describing future must.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Data Science and Big Data Analytics Chap 3: Data Analytics Using R
Chapter 8: Simple Linear Regression Yang Zhenlin.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Using Graphs for Information Extraction Gasoline Consumption Example Jan 12, 2010STAT 1001.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 Multiple explanatory variables (10.1,
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Data Mining: Neural Network Applications by Louise Francis CAS Convention, Nov 13, 2001 Francis Analytics and Actuarial Data Mining, Inc.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
Introduction Many problems in Engineering, Management, Health Sciences and other Sciences involve exploring the relationships between two or more variables.
Analysis by Simulation and Graphics
Chapter 14 Introduction to Multiple Regression
Correlation & Regression
Analysis by Simulation and Graphics
Regression Analysis Week 4.
1/18/2019 ST3131, Lecture 1.
Presentation transcript:

Some simple, useful, but seldom taught statistical techniques Larry Weldon Statistics and Actuarial Science Simon Fraser University Nov. 27,

Outline of Talk Why simple techniques overlooked Simplest kernel estimation and smoothing Simplest multivariate data display Simplest bootstrap Expanded use of Simulation Conclusion Some simple, useful, but seldom taught statistical techniques

Evolution of Good Ideas in Stats New idea proposed by researcher (e.g. Loess smoothing, bootstrap, coplots) Developed by researchers to optimal form (usually mathematically complex) Considered too advanced for undergrad Undergrad courses do not include good ideas

What new techniques can undergrad stats use? Simplest kernel estimation and smoothing Simplest multivariate data display Simplest bootstrap Expanded use of Simulation

What new techniques can undergrad stats use? Simplest kernel estimation and smoothing Simplest multivariate data display Simplest bootstrap Expanded use of Simulation

Kernel Estimation Windowgrams and Non-Parametric Smoothers

Histogram

Primitive “Windowgram”

Estimate density at grid points 9 d

Windowgram Just records frequency withing “d” of grid point. Like using rectangular “window” 0 1 Grid point Weight Function d

Primitive “Windowgram” go to R Count at grid Join grid counts Rescale to area 1

Extension to “Kernel” 0 1 Grid point Weight Function Simple Concept? Weighted Count?

Advantage of Kernel Discussion Bias - Variance trade off - can demonstrate. (Too wide window - high bias, low var Too narrow window - low bias, high var ) Idea extends to smoothing data sequence

Concept Transition: Count -> Average Count of local data -> Average of local data (X data only) (X,Y data) Simplest Example: Moving Average of Ys (X equi-spaced)

Gasoline Consumption Each Fill - record kms and litres of fuel used Smooth ---> Seasonal Pattern …. Why? 15

Pattern Explainable? Air temperature? Rain on roads? Seasonal Traffic Pattern? Tire Pressure? Info Extraction Useful for Exploration of Cause Smoothing was key technology in info extraction 16

Recap of Non-Parametric Smoothing Very useful in data analysis practice Easy to understand and explain More optimal procedures available in software Good topic for intro course

What new techniques can undergrad stats use? Simplest kernel estimation and smoothing Simplest multivariate data display Simplest bootstrap Expanded use of Simulation

Multivariate Data Display Profile Plots Augmented Scatter Plots Star Plots Coplots

Profile Plot [1] "Density" "Age" [3] "Wgt" "Hgt" [5] "Neck" "Chest" [7] "Abdomen" "Hip" [9] "Thigh" "Knee" [11] "Ankle" "Biceps" [13] "Forearm" "Wrist"

Augmented Scatter Plot

Star Plot (for many variables)

Star Plot - More Detail

Coplot (3 Variables here) Interaction? Fig. 7 Coplots of Abrasion Loss vs Tensile Strength Given Hardness low hardnessmedium hardnesshigh hardness

Ethanol Example (Cleveland) Shows how graphical analysis sometimes more informative than regression analysis. 25

26 Coplot: Visualizing an Interaction

Exercise for 4rth-yr Students Use regression analysis to model interaction in the Ethanol data. Tough to do! After many modeling steps (introducing powers and interactions of predictors and checking residual plots) ….--->

Call: lm(formula = log(NOX) ~ ER + CR + ER.SQ + CR.SQ + ER.CB + ER.QD + ER * CR) (Intercept) 2.080e e * ER e e *** CR 1.665e e e-05 *** ER.SQ 3.066e e e-05 *** CR.SQ e e ER.CB e e e-06 *** ER.QD 7.201e e e-06 *** ER:CR e e e-09 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 80 degrees of freedom Multiple R-Squared: 0.957,Adjusted R-squared: F-statistic: on 7 and 80 DF, p-value: < 2.2e-16 28

29 As ER increases, NOX slope against CR decreases.  Interaction Negative

Why “Plots of Multivariate Data?” Allows novice to see data complexity Correct “Model” not an issue Easy to understand and explain

What new techniques can undergrad stats use? Simplest kernel estimation and smoothing Simplest multivariate data display Simplest bootstrap Expanded use of Simulation

Resampling The bootstrap General resampling strategies

The Bootstrap Population: Digits R. Sample of Sample Mean is Precision?

Resample! Resample without replacement Take mean of resample Repeat many times and get SD of means SD of means is 32.8 What did theory say? Sample SD = n=25 so ….. Est SD of means = Sample SD/√n = 33.4 Go to R

Try a harder problem 90th percentile of population? Data men’s BMIs Estimate 90th percentile = 29.7 Precision? Resample SD of sample percentiles is 1.4 Easy, useful.

Why does bootstrap work? 36

Why does the bootstrap work? 37

The Bootstrap Easy, Useful Should be included in intro courses.

What new techniques can undergrad stats use? Simplest kernel estimation and smoothing Simplest multivariate data display Simplest bootstrap Expanded use of Simulation

Expanding the Use of Simulation Bimbo Bakery Example Data: 53 weeks, 6 days per week Deliveries and Sales of loaves of bread Question: Delivery Levels Optimal? Additional Data needed: cost and price of loaf, cost of overage, cost of underage 40

Data Mondays for 53 weeks deliveries / sales 142 / / / / / 111 and 48 more pairs. Also: Economic Parameters cost per loaf = $0.50 sale price per loaf = $1.00 revenue from overage? cost of underage? Profit each day

Method of analysis – step 1 Guess demand distribution for each day Simulate demand, use delivery to infer simulated sales each day (one outlet) Compute simulated daily sales for year Compare with actual daily sales (ecdf) Adjust guess and repeat to estimate demand 42

Compare simulated (red) and actual(blue) sales 43 m=110, s=20m=110, s=30

Method of analysis –step 2 Use estimated demand to compute profit Redo with various delivery adjustments (%) Select optimal delivery adjustment 44

Use fitted demand m=110, s=30 to compute profit (many simul’ns) 45 Max Profit if 38% increase in deliveries

Methods Used? common sense ecdf simulation graphics 46

Summary Software ease suggests useful techniques e.g. nonparametric smoothing, multivariate plots, bootstrap. Simulation – a simple tool useful even for for complex problems Simple tools in education –> wider use Better reputation for statistics!

The End More info on topics like this