Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk.

Slides:



Advertisements
Similar presentations
Simple Linear Regression Analysis
Advertisements

Multiple Regression and Model Building
Hypothesis Testing Steps in Hypothesis Testing:
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Inference for Regression
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Chapter Eight Hypothesis Testing McGraw-Hill/Irwin Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Simple Linear Regression Analysis
REGRESSION AND CORRELATION
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Chapter 9 Hypothesis Testing.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Simple Linear Regression Analysis
Linear Regression/Correlation
Correlation & Regression
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Correlation and Regression
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Regression Analysis (2)
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Adapted by Peter Au, George Brown College McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
INTRODUCTORY LINEAR REGRESSION SIMPLE LINEAR REGRESSION - Curve fitting - Inferences about estimated parameter - Adequacy of the models - Linear.
Chapter 12 Examining Relationships in Quantitative Research Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Section 9-1: Inference for Slope and Correlation Section 9-3: Confidence and Prediction Intervals Visit the Maths Study Centre.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 16 Data Analysis: Testing for Associations.
Lecture 10: Correlation and Regression Model.
Lecture 12 Factor Analysis.
Correlation & Regression Analysis
Education 795 Class Notes Factor Analysis Note set 6.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Applied Quantitative Analysis and Practices LECTURE#19 By Dr. Osman Sadiq Paracha.
PART 2 SPSS (the Statistical Package for the Social Sciences)
Principal Component Analysis
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression Chapter 14.
© The McGraw-Hill Companies, Inc., Chapter 10 Correlation and Regression.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
INTRODUCTION TO MULTIPLE REGRESSION MULTIPLE REGRESSION MODEL 11.2 MULTIPLE COEFFICIENT OF DETERMINATION 11.3 MODEL ASSUMPTIONS 11.4 TEST OF SIGNIFICANCE.
Inference about the slope parameter and correlation
FACTOR ANALYSIS & SPSS.
Regression and Correlation
Chapter 9 Hypothesis Testing.
Chapter 4 Basic Estimation Techniques
AP Statistics Chapter 14 Section 1.
Correlation and Simple Linear Regression
8-1 of 23.
Analysis of Survey Results
...Relax... 9/21/2018 ST3131, Lecture 3 ST5213 Semester II, 2000/2001
Correlation and Simple Linear Regression
Correlation and Regression
Chapter 9 Hypothesis Testing.
CHAPTER 29: Multiple Regression*
Chapter 9 Hypothesis Testing.
Ass. Prof. Dr. Mogeeb Mosleh
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
Principal Component Analysis
Statistics II: An Overview of Statistics
Product moment correlation
Presentation transcript:

Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk

 Topics: 1. Bivariate analysis (quantitative & qualitative variables) 2. Scatter-plot and Correlation coefficient 3. The straight line equation and regression 4. Coefficient of determination R 2 (goodness of fit) 5. Hypothesis testing

Case 1 Case 2 Case - Case 3 Qualitative Quantitative Qualitative Quantitative Y-variable (dependent) X-variable (independent)

Definition: the combination of frequencies of two (qualitative) variables Example: F7b_a * Stratum Observe that the independent variable (stratum) should be the column variable if the assumption is that sick leave depends on the size of the company

 Is the identified dependency between the variables statistically significant or is it due to chance?  Using hypothesis to test for dependency, (Chi 2 -test)  H 0 : The variables are independent i.e. there is no difference in the change in sick leave due to company size  H 1 : The variables are dependent i.e. there is a difference in ….. The Chi 2 -test tells us that we can reject H 0 since the likelihood that the difference is due to chance is less than 1 in (p-value = 0,000)

From different values of X we can study the change in Y in the form of mean values. Example: F5a * Stratum The Eta 2 means that (only) 20 % of the variation in the Y- variable can be explained by the X-variable (size)

 Is the identified dependency between the stratum means statistically significant or is it due to chance? Using hypothesis to test for dependency, (Anova-test)  H 0 : There are no difference between the stratum means  H 1 : Two or more of the mean values are different The Anova-test tells us that we can reject H 0 since the likelihood that the difference is due to chance is less than 1 in (p-value = 0,000)

 The connection between two quantitative variables are best presented by a scatter-plot and the strength of the connection can be explained by the coefficient of correlation r.

 The strength of the connection between the variables: Coefficient of correlation r.  The stronger the connection the closer r is to +/- 1.

 Y = β 0 + β 1 X β 0 is the intercept at witch the line cuts the Y-axis and β 1 is the coefficient of regression.

 y|x =  0 +  1 x +  is the mean value of the dependent variable y when the value of the independent variable is x  0 is the y-intercept, the mean of y when x is 0  1 is the slope, the change in the mean of y per unit change in x  is an error term that describes the effect on y of all factors other than x

 β0 and β1 are called regression parameters  β0 is the y-intercept and β1 is the slope  We do not know the true values of these parameters  So, we must use sample data to estimate them  b0 is the estimate of β0 and b1 is the estimate of β1

 Y = 3, ,048X

 A measure of estimation ability is achieved if the r (coef. of correlation) is squared.  R 2 (goodness of fit) is the proportion of the total variation in Y that can be explained by the linear relation X-Y

 Is the identified dependency between X and Y statistically significant or is it due to chance?  Using hypothesis to test for dependency, (F-test)  H 0 : There is no dependence, R 2 = 0  H 1 : There is dependence, R 2 > 0  (The F test tests the significance of the overall regression relationship between x and y) The F-test tells us that we can reject H 0 since the likelihood that the difference is due to chance is less than 1 in (p-value = 0,000)

 To test H 0 :  1 = 0 versus H a :  1  0 at the  level of significance  Test statistics based on F:  Reject H 0 if  F(model) > F  or  p-value <   F  is based on 1 numerator and n-2 denominator degrees of freedom

 Null and Alternative Hypotheses and Errors in Testing Null and Alternative Hypotheses and Errors in Testing  t - Tests about a Population Mean (std. unknown)  z Tests about a Population Proportion

The null hypothesis, denoted H 0, is a statement of the basic proposition being tested. The statement generally represents the status quo and is not rejected unless there is convincing sample evidence that it is false. The alternative or research hypothesis, denoted H a, is an alternative (to the null hypothesis) statement that will be accepted only if there is convincing sample evidence that it is true

Type I Error: Rejecting H 0 when it is true Type II Error: Failing to reject H 0 when it is false

Error Probabilities Type I Error: Rejecting H 0 when it is true   is the probability of making a Type I error  1 –  is the probability of not making a Type I error Type II Error: Failing to reject H 0 when it is false   is the probability of making a Type II error  1 –  is the probability of not making a Type II error State of Nature Conclusion H 0 TrueH 0 False Reject H 0  1 –  Do not Reject H 0 1 –  

 Usually set  to a low value ◦ So that there is only a small chance of rejecting a true H 0 ◦ Typically,  = 0.05  For  = 0.05, strong evidence is required to reject H 0  Usually choose  between 0.01 and 0.05   = 0.01 requires very strong evidence to reject H 0  Sometimes choose  as high as 0.10 ◦ Tradeoff between  and   For fixed sample size, the lower we set , the higher is   And the higher , the lower 

 Let x-bar be the mean of a sample of size n with standard deviation s  Also,  0 is the claimed value of the population mean  Define a new test statistic  If the population being sampled is normal, and  If s is used to estimate , then …  The sampling distribution of the t statistic is a t distribution with n – 1 degrees of freedom

AlternativeReject H 0 if:p-value H a :  >  0 t > t  Area under t distribution to right of t H a :  <  0 t < –t  Area under t distribution to left of –t H a :    0 |t| > t  /2 * Twice area under t distribution to right of |t| t , t  /2, and p-values are based on n – 1 degrees of freedom (for a sample of size n) * either t > t  /2 or t < –t  /2

 If the sample size n is large, we can reject H0: p = p0 at the  level of significance (probability of Type I error equal to  ) if and only if the appropriate rejection point condition holds or, equivalently, if the corresponding p-value is less than   We have the following rules …

* either z > z  /2 or z < –z  /2 where the test statistic is: Alternative Reject H 0 if: p-value Area under standard normal to the right of z Area under standard normal to the left of –z Twice the area under standard normal to the right of |z| *

 You have a set of p continuous variables.  You want to repackage their variance into m components.  You will usually want m to be < p.  Each component is a weighted linear combination of the variables

 Data reduction.  Discover and summarize pattern of intercorrelations among variables.  Test theory about the latent variables underlying a set of measurement variables.  Construct a test instrument.  There are many other uses of PCA and FA.

 A principal component is a linear combination of weighted observed variables.  In PCA, the goal is to explain as much of the total variance in the variables as possible. The goal in Factor Analysis is to explain the covariances or correlations between the variables.  PCA: Reduce multiple observed variables into fewer components that summarize their variance.  FA: Determine the nature of and the number of latent variables that account for observed variation and covariation among set of observed indicators.  Use Principal Components Analysis to reduce the data into a smaller number of components. Use Factor Analysis to understand.

 Analyze, Data Reduction, Factor,  Click Descriptives and then check Initial Solution, Coefficients, KMO and Bartlett’s Test of Sphericity, and Anti- image. Click Continue.  Click Extraction and then select Principal Components, Correlation Matrix, Unrotated Factor Solution, Scree Plot, and Eigenvalues Over 1. Click Continue.  Click Rotation. Select Varimax and Rotated Solution. Click Continue.  Click Options. Select Exclude Cases Listwise and Sorted By Size. Click Continue.  Click OK, and SPSS completes the Principal Components Analysis.

 Check the correlation matrix : If there are any variables not well correlated with some others, might as well delete them.  Bartlett’s test of sphericity tests null that the matrix is an identity matrix, but does not help identify individual variables that are not well correlated with others.  For each variable, check R 2 between it and the remaining variables. SPSS reports these as the initial communalities when you do a principal axis factor analysis.  Delete any variable with a low R 2.  Look at partial correlations – pairs of variables with large partial correlations share variance with one another but not with the remaining variables – this is problematic.  Kaiser’s MSA will tell you, for each variable, how much of this problem exists. The smaller the MSA, the greater the problem.  Variables with small MSAs should be deleted

 From p variables we can extract p components.  Each of p eigenvalues represents the amount of standardized variance that has been captured by one component.  The first component accounts for the largest possible amount of variance.  The second captures as much as possible of what is left over, and so on.  Each is orthogonal to the others.

 Example for the eigenvalues and proportions of variance for the seven components:  Only the first two components have eigenvalues greater than 1.  Big drop in eigenvalue between component 2 and component 3. Components 3-7 are scree.  Try a 2 component solution.  Should also look at solution with one fewer and with one more component.