Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Multivariate Data Analysis Pekka Malo 30E00500 – Quantitative Empirical Research Spring 2016.

Similar presentations


Presentation on theme: "Introduction to Multivariate Data Analysis Pekka Malo 30E00500 – Quantitative Empirical Research Spring 2016."— Presentation transcript:

1 Introduction to Multivariate Data Analysis Pekka Malo 30E00500 – Quantitative Empirical Research Spring 2016

2 Agenda Course schedule and practical arrangements Basic concepts of multivariate analysis Selecting a multivariate technique Guidelines for multivariate analysis Tutorial: Missing value analysis 04.01.16 Missing Value Analysis 2

3 What is Multivariate Data Analysis? Analyzing situations when you have –Many independent (i.e. explanatory) variables and/or –Many dependent (i.e. response/explained) variables –Varying degrees of correlations between variables Multivariate statistics ~ statistical methods that simultaneously analyze several measurements on each data-case (e.g. individual) under investigation. 04.01.16 Missing Value Analysis 3

4 Why Multivariate Statistics? Difficulty of addressing complicated research questions with univariate tools Several drivers for increasing popularity, e.g. –Availability of nicely packaged software –Greater complexity of contemporary research –Large amounts of data –Emergence of data mining perspective (finding unforeseen patterns and associations) Measurement Hypothesis testing Explanation & prediction 04.01.16 Missing Value Analysis 4

5 Qualitative & Quantitative Research Quantitative Qualitative Confirmatory Exploratory 04.01.16 Missing Value Analysis 5

6 Confirmatory Analysis 04.01.16 Missing Value Analysis 6 Theory / Pre-Knowledge Hypothesis Data / Random Sample Statistical Analysis Verification New Hypothesis / Theory Population

7 Exploratory Analysis 04.01.16 Missing Value Analysis 7 New Theory / Invention Data / Random Sample Statistical Analysis New Hypothesis / Theory Population New Data (Random) Verification Data Analysis

8 Experimental vs. Non-experimental Research Experimental research Researcher has control over the levels of at least one IV Definition of levels Implementation Random assignment Control over other influential factors Attempt to create populations by ”treating” subgroups from an originally homogeneous group differently Statistical tests to examine whether the treatment had an effect (i.e. do the samples still come from the same population) Non-experimental research Researcher cannot control the assignment of subjects to the levels of IV(s) Difficult to attribute causality to an IV Most multivariate techniques have been developed for non-experimntal research Investigation of relationships among variables in some predefined population (correlational / survey research) 04.01.16 Missing Value Analysis 8

9 Part I: Some Useful Concepts 9

10 Data Types and Measurement Scales Data Qualitative (non-metric) Nominal (categorization) Ordinal (rank order) Quantitative (metric) Interval (differences) Ratio 04.01.16 Missing Value Analysis 10

11 Measurement Error Examples of potential causes –Data entry errors –Imprecise measurement –Inability of respondents to provide accurate information Affects observed relationships and reduces the strength of multivariate techniques Measurement error ~ the degree to which observed values are representative of true values (all variables tend to have some error) 04.01.16 Missing Value Analysis 11

12 Validity and Reliability Reducing Measurement Error Validity ~ the degree to which a measure accurately represents what it is expected to Do we understand the target of measurement? Are we asking the right questions? Reliability ~ the degree to which the observed variable measures the “true” value and is “error free” Does the measure produce consistent results? 04.01.16 Missing Value Analysis 12

13 Statistical Significance and Power Type I Errors: H 0 is true, but H 1 is accepted (  risks) Type II Errors: H 1 is true, but H 0 is accepted (ß risks) Power (1- ß): Probability of rejecting H 0 when it is false H 0 true H 0 false Fail to Reject H 0 1-   Type II error Reject H 0  r Type I error 1-  Power Ref: Hair et al. 04.01.16 Missing Value Analysis 13

14 Determinants of Statistical Power Effect size –Magnitude of the effect of interest –Larger effects are easier to find Alpha –Choice of strict alpha reduces power –Need to strike a balance between level of alpha risk and the resulting power Sample size –The larger the sample size, the higher the power –Very large samples can also lead to oversensitivity 04.01.16 Missing Value Analysis 14

15 Power, Alpha-risk and Sample Size s Ref: Hair et al. Hair et al. (2010): Multivariate Data Analysis 04.01.16 Missing Value Analysis 15

16 Part II: Classification of multivariate methods 16

17 Classification of MV methods Type of relationship is being examined –Can the variables be divided into independent and dependent classifications based on some theory? Number of dependent variables –How many variables are treated as dependent in a single analysis? Type of dependent / independent variables –How are the variables measured (metric vs. non-metric)? 04.01.16 Missing Value Analysis 17

18 Dependence vs. Interdependence p 1  p 2 : Dependence –(Logistic) Regression Analysis –Analysis of Variance (ANOVA) –Discriminant Analysis –Canonical Correlation Analysis –Conjoint Analysis p1  p2 : Interdependence –Principal Component Analysis –Factor Analysis –Cluster Analysis –Loglinear Models –Correspondence Analysis 04.01.16 Missing Value Analysis 18

19 Type of relationship Number of DV(s) Type of interdependence Structural equations modeling Canonical correlation analysis Multivariate analysis of variance Scale of variables Several DV(s) Multiple relations Dependence Interdependence Several DV(s) Single relation Metric DV(s) Metric IV(s) Metric DV(s) Non-metric IV(s) One DV Single relation Multiple Regression; Conjoint Analysis Discriminant analysis; Linear probability models Metric DV Non-metric DV Factor analysis (FA); Confirmatory FA; Principal component analysis Between variables Cluster analysis Between Respondents/cases A Decision Tree 04.01.16 Missing Value Analysis 19

20 Garbage in, roses out? Data is the foundation for analytics – Quality & quantity concerns – Sample size affects all results – Know your data well Careful model evaluation needed – generalization vs. over-fitting – prefer parsimonious model – ensure practical significance as well statistical significance 04.01.16 Missing Value Analysis 20

21 Structured Approach to Modeling Define the research problem Stage 1 What is the research problem and objectives? What multivariate techniques should be used? Develop the analysis plan Stage 2 Implementation issues (sample sizes, allowable variable types, estimation) Check the model assumptions Stage 3 Are the underlying assumptions of the chosen multivariate model satisfied? e.g. normality, linearity, independence of error terms, equality of variances Evaluate overall model fit Stage 4 Does the model achieve acceptable levels on statistical criteria (e.g. significance)? Are the proposed relationships identified? Is the result practically significant? Interpret the variates Stage 5 Analysing effects for individual variables by examining the estimated coefficients Is there empirical evidence of multivariate relationships that can be generalized? Validate the model Stage 6 How does the model on a hold-out dataset? Demonstrate the generalizability of the results to total population 04.01.16 Missing Value Analysis 21

22 Part III: Preparing for multivariate analysis Missing Value Analysis and Imputation 22

23 What is Missing Data? Missing data often occur when a respondent fails to answer one or more questions in a survey. Missing Data ~ information not available for a subject (or case) about whom other information is available. 04.01.16 Missing Value Analysis 23

24 24 What is Missing Data? 04.01.16 Missing Value Analysis

25 Why Do We Have Missing Data? 25 Missing Data Process ~ Any systematic event external to the respondent (such as data entry errors or data collection problems) or any action on the part of the respondent (such as refusal to answer a question) that leads to missing data 04.01.16 Missing Value Analysis

26 Why Do We Have Missing Data? Examples of reasons (from survey research): Respondent does not want to respond to a question Respondent is not able to respond to a question Question too difficult or complicated Ignored by accident Missing Value has a meaning 26 04.01.16 Missing Value Analysis

27 Is it serious? Well, it depends on … 1.The pattern of missing data 2.The amount of missing data 3.The reasons why it is missing 27 04.01.16 Missing Value Analysis

28 Patterns of Missing Data MCAR = missing completely at random –The distribution of missing data is unpredictable (i.e. the cases with missing data are indistinguishable from cases with complete data) MAR = missing at random (a.k.a. ignorable non- response) –The pattern is predictable from other variables in the data MNAR = missing not at random or non-ignorable –The pattern is related to the dependent variable and cannot be ignored 28 MCAR (The Good) MAR (The Bad) MNAR (The Ugly) 04.01.16 Missing Value Analysis

29 Patterns of Missing Data (cont’d) Let us assume we have a data set [Y, X]: –Y denotes the complete data consisting of two parts: Yobs, the observed data, and Ymis, the data which has missing values –X is additional data –Mi = 1 if i-th observation has missing value in Y The data is MCAR if MAR if MNAR if 04.01.16 Missing Value Analysis 29

30 Practice Spotting the Patterns In groups of 2-3 people, go through the stack of cases and label them as MCAR, MAR, or MNAR. Write down your choice on each slide. Prepare to explain your choice for the other groups. 04.01.16 Missing Value Analysis 30

31 How Much Missing Data Is Too Much? Hair et al. (2010) Missing data under 10% for an individual case or observation can generally be ignored, except when the missing data occur in a specific non-random fashion (e.g., concentration in a specific set of questions, attrition at the end of the questionnaire, etc.). The number of cases with no missing data must be sufficient for the selected analysis technique if replacement values will not be substituted (imputed) for the missing data. 04.01.16 Missing Value Analysis 31

32 How to Deal with Missing Values? 04.01.16 Missing Value Analysis 32

33 04.01.16 Missing Value Analysis 33

34 04.01.16 Missing Value Analysis 34 Imputation = inserting a value into data in a “more or less fabricated way”

35 Ways to Deal With Missing Values Use all available –Compute distribution characteristics and relationships from all valid values Replacement –Case substitution –Mean substitution –Cold deck (i.e. external source) –Hot deck –Model based ■Expectation maximization ■Regression ■Multiple imputation (combine several models) 04.01.16 Missing Value Analysis 35

36 Rules of Thumb for Imputation Hair et al. (2010) When the amount of missing data is … Under 10%: – Any of the imputation methods should be fine. 10 to 20%: –For MCAR data, consider hot deck case substitution and regression methods –For MAR, model-based methods are necessary Over 20%: –If imputation is considered necessary, then use regression for MCAR and model-based for MAR 04.01.16 Missing Value Analysis 36

37 Imputation with valid data 04.01.16 Missing Value Analysis 37

38 Imputation using known values 04.01.16 Missing Value Analysis 38

39 Imputation by calculating values 04.01.16 Missing Value Analysis 39

40 Imputation of MAR processes 04.01.16 Missing Value Analysis 40

41 Imputation can be used also when… Existing value is partially missing (e.g., available as interval but unique value is not given) Existing value appears to be incorrect or “corrupted” Existing value is too confidential and cannot be revealed (e.g., some company datasets with detailed personal information) 04.01.16 Missing Value Analysis 41

42 Watch out for data with time dependence! 04.01.16 Missing Value Analysis 42 Source: anychart.com

43 04.01.16 Missing Value Analysis 43

44 Tutorial I: Missing values Form groups of 1-3 students. Check that SPSS is installed on your computer. Q1: Why worry about missing values? Q2: How to fix the missing value problem? 04.01.16 Missing Value Analysis 44

45 Further reading Flexible Imputation of Missing Data (Chapman & Hall/CRC Interdisciplinary Statistics) by Stef van Buuren Links: –http://www.stefvanbuuren.nl/mi/index.htmlhttp://www.stefvanbuuren.nl/mi/index.html –http://www.stefvanbuuren.nl/mi/Course.htmlhttp://www.stefvanbuuren.nl/mi/Course.html 04.01.16 Missing Value Analysis 45

46 Thank you! 46


Download ppt "Introduction to Multivariate Data Analysis Pekka Malo 30E00500 – Quantitative Empirical Research Spring 2016."

Similar presentations


Ads by Google