Download presentation
Presentation is loading. Please wait.
Published byKristian Conley Modified over 8 years ago
1
Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk
2
Topics: 1. Bivariate analysis (quantitative & qualitative variables) 2. Scatter-plot and Correlation coefficient 3. The straight line equation and regression 4. Coefficient of determination R 2 (goodness of fit) 5. Hypothesis testing
3
Case 1 Case 2 Case - Case 3 Qualitative Quantitative Qualitative Quantitative Y-variable (dependent) X-variable (independent)
4
Definition: the combination of frequencies of two (qualitative) variables Example: F7b_a * Stratum Observe that the independent variable (stratum) should be the column variable if the assumption is that sick leave depends on the size of the company
5
Is the identified dependency between the variables statistically significant or is it due to chance? Using hypothesis to test for dependency, (Chi 2 -test) H 0 : The variables are independent i.e. there is no difference in the change in sick leave due to company size H 1 : The variables are dependent i.e. there is a difference in ….. The Chi 2 -test tells us that we can reject H 0 since the likelihood that the difference is due to chance is less than 1 in 1 000. (p-value = 0,000)
6
From different values of X we can study the change in Y in the form of mean values. Example: F5a * Stratum The Eta 2 means that (only) 20 % of the variation in the Y- variable can be explained by the X-variable (size)
7
Is the identified dependency between the stratum means statistically significant or is it due to chance? Using hypothesis to test for dependency, (Anova-test) H 0 : There are no difference between the stratum means H 1 : Two or more of the mean values are different The Anova-test tells us that we can reject H 0 since the likelihood that the difference is due to chance is less than 1 in 1 000. (p-value = 0,000)
8
The connection between two quantitative variables are best presented by a scatter-plot and the strength of the connection can be explained by the coefficient of correlation r.
9
The strength of the connection between the variables: Coefficient of correlation r. The stronger the connection the closer r is to +/- 1.
11
Y = β 0 + β 1 X β 0 is the intercept at witch the line cuts the Y-axis and β 1 is the coefficient of regression.
12
y|x = 0 + 1 x + is the mean value of the dependent variable y when the value of the independent variable is x 0 is the y-intercept, the mean of y when x is 0 1 is the slope, the change in the mean of y per unit change in x is an error term that describes the effect on y of all factors other than x
13
β0 and β1 are called regression parameters β0 is the y-intercept and β1 is the slope We do not know the true values of these parameters So, we must use sample data to estimate them b0 is the estimate of β0 and b1 is the estimate of β1
14
Y = 3,595 + 0,048X
15
A measure of estimation ability is achieved if the r (coef. of correlation) is squared. R 2 (goodness of fit) is the proportion of the total variation in Y that can be explained by the linear relation X-Y
16
Is the identified dependency between X and Y statistically significant or is it due to chance? Using hypothesis to test for dependency, (F-test) H 0 : There is no dependence, R 2 = 0 H 1 : There is dependence, R 2 > 0 (The F test tests the significance of the overall regression relationship between x and y) The F-test tells us that we can reject H 0 since the likelihood that the difference is due to chance is less than 1 in 1 000. (p-value = 0,000)
17
To test H 0 : 1 = 0 versus H a : 1 0 at the level of significance Test statistics based on F: Reject H 0 if F(model) > F or p-value < F is based on 1 numerator and n-2 denominator degrees of freedom
18
Null and Alternative Hypotheses and Errors in Testing Null and Alternative Hypotheses and Errors in Testing t - Tests about a Population Mean (std. unknown) z Tests about a Population Proportion
19
The null hypothesis, denoted H 0, is a statement of the basic proposition being tested. The statement generally represents the status quo and is not rejected unless there is convincing sample evidence that it is false. The alternative or research hypothesis, denoted H a, is an alternative (to the null hypothesis) statement that will be accepted only if there is convincing sample evidence that it is true
20
Type I Error: Rejecting H 0 when it is true Type II Error: Failing to reject H 0 when it is false
21
Error Probabilities Type I Error: Rejecting H 0 when it is true is the probability of making a Type I error 1 – is the probability of not making a Type I error Type II Error: Failing to reject H 0 when it is false is the probability of making a Type II error 1 – is the probability of not making a Type II error State of Nature Conclusion H 0 TrueH 0 False Reject H 0 1 – Do not Reject H 0 1 –
22
Usually set to a low value ◦ So that there is only a small chance of rejecting a true H 0 ◦ Typically, = 0.05 For = 0.05, strong evidence is required to reject H 0 Usually choose between 0.01 and 0.05 = 0.01 requires very strong evidence to reject H 0 Sometimes choose as high as 0.10 ◦ Tradeoff between and For fixed sample size, the lower we set , the higher is And the higher , the lower
23
Let x-bar be the mean of a sample of size n with standard deviation s Also, 0 is the claimed value of the population mean Define a new test statistic If the population being sampled is normal, and If s is used to estimate , then … The sampling distribution of the t statistic is a t distribution with n – 1 degrees of freedom
24
AlternativeReject H 0 if:p-value H a : > 0 t > t Area under t distribution to right of t H a : < 0 t < –t Area under t distribution to left of –t H a : 0 |t| > t /2 * Twice area under t distribution to right of |t| t , t /2, and p-values are based on n – 1 degrees of freedom (for a sample of size n) * either t > t /2 or t < –t /2
25
If the sample size n is large, we can reject H0: p = p0 at the level of significance (probability of Type I error equal to ) if and only if the appropriate rejection point condition holds or, equivalently, if the corresponding p-value is less than We have the following rules …
26
* either z > z /2 or z < –z /2 where the test statistic is: Alternative Reject H 0 if: p-value Area under standard normal to the right of z Area under standard normal to the left of –z Twice the area under standard normal to the right of |z| *
30
You have a set of p continuous variables. You want to repackage their variance into m components. You will usually want m to be < p. Each component is a weighted linear combination of the variables
31
Data reduction. Discover and summarize pattern of intercorrelations among variables. Test theory about the latent variables underlying a set of measurement variables. Construct a test instrument. There are many other uses of PCA and FA.
32
A principal component is a linear combination of weighted observed variables. In PCA, the goal is to explain as much of the total variance in the variables as possible. The goal in Factor Analysis is to explain the covariances or correlations between the variables. PCA: Reduce multiple observed variables into fewer components that summarize their variance. FA: Determine the nature of and the number of latent variables that account for observed variation and covariation among set of observed indicators. Use Principal Components Analysis to reduce the data into a smaller number of components. Use Factor Analysis to understand.
33
Analyze, Data Reduction, Factor, Click Descriptives and then check Initial Solution, Coefficients, KMO and Bartlett’s Test of Sphericity, and Anti- image. Click Continue. Click Extraction and then select Principal Components, Correlation Matrix, Unrotated Factor Solution, Scree Plot, and Eigenvalues Over 1. Click Continue. Click Rotation. Select Varimax and Rotated Solution. Click Continue. Click Options. Select Exclude Cases Listwise and Sorted By Size. Click Continue. Click OK, and SPSS completes the Principal Components Analysis.
34
Check the correlation matrix : If there are any variables not well correlated with some others, might as well delete them. Bartlett’s test of sphericity tests null that the matrix is an identity matrix, but does not help identify individual variables that are not well correlated with others. For each variable, check R 2 between it and the remaining variables. SPSS reports these as the initial communalities when you do a principal axis factor analysis. Delete any variable with a low R 2. Look at partial correlations – pairs of variables with large partial correlations share variance with one another but not with the remaining variables – this is problematic. Kaiser’s MSA will tell you, for each variable, how much of this problem exists. The smaller the MSA, the greater the problem. Variables with small MSAs should be deleted
35
From p variables we can extract p components. Each of p eigenvalues represents the amount of standardized variance that has been captured by one component. The first component accounts for the largest possible amount of variance. The second captures as much as possible of what is left over, and so on. Each is orthogonal to the others.
36
Example for the eigenvalues and proportions of variance for the seven components: Only the first two components have eigenvalues greater than 1. Big drop in eigenvalue between component 2 and component 3. Components 3-7 are scree. Try a 2 component solution. Should also look at solution with one fewer and with one more component.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.