A Few Handful Many Time Stamps One Time Snapshot Many Time Series Number of Variables Mobile Phone Galton Height Census Titanic Survivors Stock Market.

Slides:



Advertisements
Similar presentations
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
Advertisements

Learning Objectives Copyright © 2004 John Wiley & Sons, Inc. Bivariate Correlation and Regression CHAPTER Thirteen.
Multiple Linear Regression
Chapter 17 Overview of Multivariate Analysis Methods
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Chapter 13 Multiple Regression
Chapter 12 Multiple Regression
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Elaboration Elaboration extends our knowledge about an association to see if it continues or changes under different situations, that is, when you introduce.
Correlation MEASURING ASSOCIATION Establishing a degree of association between two or more variables gets at the central objective of the scientific enterprise.
EViews. Agenda Introduction EViews files and data Examining the data Estimating equations.
Alok Srivastava Chapter 2 Describing Data: Graphs and Tables Basic Concepts Frequency Tables and Histograms Bar and Pie Charts Scatter Plots Time Series.
Review Regression and Pearson’s R SPSS Demo
Analyzing Data: Bivariate Relationships Chapter 7.
Module 32: Multiple Regression This module reviews simple linear regression and then discusses multiple regression. The next module contains several examples.
Project Categories and Questions How to improve [Financial Metric]? Business Science What Determines Height? Government Sports How are School Districts.
Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)
Correlation and Covariance
What factors are most responsible for height?
Outline Class Intros – What are your goals? – What types of problems? datasets? Overview of Course Example Research Project.
Outline Class Intros Overview of Course & Series Example Research Projects Beginning R.
Time Series 1.
Outline Class Intros Overview of Course Example Research Project.
Multiple Linear Regression. Purpose To analyze the relationship between a single dependent variable and several independent variables.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
CADA Final Review Assessment –Continuous assessment (10%) –Mini-project (20%) –Mid-test (20%) –Final Examination (50%) 40% from Part 1 & 2 60% from Part.
What determines height? Genetics NutritionGender Heels Child’s Height Parent’s Height Not able to get data for all our variables! Linear Regression vs.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
 Some variables are inherently categorical, for example:  Sex  Race  Occupation  Other categorical variables are created by grouping values of a.
STA291 Statistical Methods Lecture LINEar Association o r measures “closeness” of data to the “best” line. What line is that? And best in what terms.
Syllabus. We covered Regression in Applied Stats. We will review Regression and cover Time Series and Principle Components Analysis. Reference Book.
Linear Discriminant Analysis (LDA). Goal To classify observations into 2 or more groups based on k discriminant functions (Dependent variable Y is categorical.
No Yes No Yes Longitudinal & Time Series Cross-Sectional & Panel Data PEW Mobile Phone Galton Children Height Census Text Sentiment Old Faithful Web Analytics.
Research Question What determines a person’s height?
Where to Get Data? Run an Experiment Use Existing Data.
Regression Analysis. 1. To comprehend the nature of correlation analysis. 2. To understand bivariate regression analysis. 3. To become aware of the coefficient.
Research Design. What is Research Design ? Plan for getting from the research question to the conclusion Blueprint for data collection and interpretation.
What factors are most responsible for height?. Model Specification ERROR??? measurement error model error analysis unexplained unknown unaccounted for.
Outline Research Question: What determines height? Data Input Look at One Variable Compare Two Variables Children’s Height and Parents Height Children’s.
4 basic analytical tasks in statistics: 1)Comparing scores across groups  look for differences in means 2)Cross-tabulating categoric variables  look.
Main Themes Few vs. Many Variables Linear vs. Non-Linear Statistics vs. Machine Learning.
Steps Continuous Categorical Histogram Scatter Boxplot Child’s Height Linear Regression Dad’s Height Gender Continuous Y X1, X2 X3 Type Variable Mom’s.
Continuous Outcome, Dependent Variable (Y-Axis) Child’s Height
Chapter 9 Scatter Plots and Data Analysis LESSON 1 SCATTER PLOTS AND ASSOCIATION.
1 Take a challenge with time; never let time idles away aimlessly.
Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data.
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
Describing Relationships. Least-Squares Regression  A method for finding a line that summarizes the relationship between two variables Only in a specific.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Appendix I A Refresher on some Statistical Terms and Tests.
1. Analyzing patterns in scatterplots 2. Correlation and linearity 3. Least-squares regression line 4. Residual plots, outliers, and influential points.
Titanic and Decision Trees Supplement. Titanic Predictions and Decision Trees Variable Selection Approaches – Hypothesis Driven – Data Driven – Kitchen.
Thursday, May 12, 2016 Report at 11:30 to Prairieview
LSRL.
Least Squares Regression Line.
Predict whom survived the Titanic Disaster
Chapter 2 Describing Data: Graphs and Tables
Examining Relationships
Week 5 Lecture 2 Chapter 8. Regression Wisdom.
Least Squares Regression Line LSRL Chapter 7-continued
Treat everyone with sincerity,
Multiple Regression Chapter 14.
Chapter 5 LSRL.
Multiple Linear Regression Analysis
Linear Regression and Correlation
Linear Regression and Correlation
Variation Learning Objectives:
Correlation and Covariance
Presentation transcript:

A Few Handful Many Time Stamps One Time Snapshot Many Time Series Number of Variables Mobile Phone Galton Height Census Titanic Survivors Stock Market Boston Housing Old Faithful Gene Array Climate MA Schools Project Management Our datasets have been relatively small, with no time variable.

Comparison of Datasets HeightsHousingTitanic # of Variables81511 # Variables Used41011 # Numerical386 # Categorical125 # of Observations UnivariateYes Bivariate – CorrelationsYes Missing DataNo Yes - Age Variable TransformationsYes - GaltonPotentiallyYes - Age Data RelationshipsLinear Non-Linear, Partitioned Regression?Yes No Factor Analysis Relevant?NoYesNo Decision Trees Relevant?NoPotentiallyYes Cluster Analysis Relevant?For 1 VariableYesPotentially Merge Analytic Models?NoPotentiallyYes We can apply similar basic stats to each of these datasets. Depending on the type of data relationships, the newer techniques may (or may not) be applicable This slidedeck walks through the Heights DataSet

Research Question What determines a person’s height?

Genetics Nutrition Immigration / Origins Disease Hypothesis Brainstorming Sons will be similar to their Dad’s height Daughters will be similar to their Mom’s height Hypotheses:

Height Dataset Variables heights <- read.csv("GaltonFamilies.csv") Observations: 934 Variables: 8

We only need a subset of the data Dataset Variables and Selection

Histograms Heights of Father, Mother, and Child Appear Normal

Scatterplots Child Height somewhat correlated to Father, Mother Heights

Correlations Matrices library(car) scatterplotMatrix(heights) library(PerformanceAnalytics) chart.Correlation(heights.num) With Categorical Variable (Gender) Only Numerical Variables

Children Height by Gender Noticeably difference between Gender for Heights

Categorical: Box Plot

Linear Regression Modeling X’s Independent Variables Dependent Variable Y X4 X3 X2X1 delta

Comparing Regression Models Variable Father Mom Gender ChildNum Intercept R-squares With 4 Variables there are 24 different combinations. Fortunately there is a R Library, LEAPS, that can help.

LEAPS Package The three variable model appears best trade-off between explanation and simplicity Goes through different combination of variables to find best ones R-Square Variable If not highlighted, then not in model If highlighted, then in model Finds best combination of variables. Starts with 1 variable, then 2, and so on.

Height Dataset Summary What determines height? Not able to get data for all our variables! Gender has the biggest effect Parent’s Height also influence a Child’s Height GeneticsNutritionGender Heels Child’s Height Parent’s Height Child Height = Father’s Height * Mother’s Height * 0.29 and If a Male then add 5.21 inches.

Number of Variables Analyzed Pivot Tables Predictive Modeling Class Correlation Matrices Regression Factor Analysis Histograms Applied Stats Class Cluster Analysis Decision Trees Types of Analysis Additional Techniques

Factor Analysis on Height Dataset? In this case, FA would not be of any help There is little correlation among predictors

For Decision Tree Analysis what would be the First Variable? Linear or Partitioned Data?

Linear Models vs. Decision Trees Height variable relationships appear linear Decision Trees Would Not Appear to Help

Decision Tree Output This model appears less accurate than the regression model AS THE OUTPUT IS IN DISCRETE VALUES

Would Cluster Analysis Be Helpful?

Cluster Analysis on Three Variables? Results not surprising; though unclear as to how to leverage

Continuous Convert Continuous to Categorical Height Categorical S MLXL

Cluster Analysis of Child Heights 5 Cluster - S, M, L, XL, XXL Children Heights for each of the Cluster Centers Cluster Centers