Endogeneity in PLS-SEM: The Gaussian Copula Approach

Endogeneity in PLS-SEM: The Gaussian Copula Approach
Hult, G. T. M., Hair, J. F., Proksch, D., Sarstedt, M., Pinkwart, A., and Ringle, C. M. (2018). Addressing Endogeneity in International Marketing Applications of Partial Least Squares Structural Equation Modeling, Journal of International Marketing, forthcoming.

What’s the Endogeneity Problem? (I)
Endogeneity can have various roots such as measurement errors, simultaneous causality, common method variance, and (un)observed heterogeneity. Often endogeneity problems arise from omitted variables that correlate with one or more independent variable(s) and the dependent variable(s) in the regression model. Omitting such variables induces a correlation between the corresponding independent variables and the dependent variables’ error term. That is, the independent variables then not only explain the dependent variable, but also the error in the model. 2

What’s the Endogeneity Problem? (II)
Consider the following regression model where y represents the dependent variable, x1 and x2 are independent variables, β0 the intercept, β1 and β2 the regression coefficients of x1 and x2, and ε the error term: y = β0+ β1x1+β2x2+ ε. Let us assume that the independent variable x2 is uncorrelated with ε (i.e., x2 is exogenous), whereas x1 is endogenous since it is correlated with the error term ε (i.e., Cov(x1,ε)≠0). The coefficient estimates from standard regression analyses are biased and inconsistent, thereby becoming causally uninterpretable and potentially triggering type I and type II errors. corr. ≠ 0 x1 1 y ε Bias: Is the difference between this estimator's expected value and the true value of the parameter being estimated. Consistency: Consistent estimators are asymptotically unbiased, i.e., they converge to the correct value as sample size increases to infinity. x2 2 corr. = 0 3

On the Relevance Assessment in PLS-SEM
Dealing with endogeneity has been extensively discussed in the literature, especially with respect to different forms of regression and panel models, as well as conjoint analysis. While several studies have discussed endogeneity in the context of factor-based SEM, there is a paucity of research on this topic in PLS-SEM. Some researchers even claim that PLS-SEM does not allow for addressing endogeneity at all. This assertion is astonishing and inaccurate given that PLS-SEM is grounded in regression analysis, for which numerous approaches for handling endogeneity exist. Hence, research must address when and how to address endogeneity in PLS-SEM. 4

Comparison of Approaches to Deal with Endogeneity
Methods Criteria Control variable approach Instrumental variable (IV) approach Instrumental variable-free approaches Gaussian copula Latent instrumental variable (LIV) Number of variables Data on additional variables must be collected Instrumental variables have to be identified and data has to be collected No additional variables needed No additional variables needed Distribution of variables No assumptions required Endogenous variables have to be non-normally distributed Endogenous variables have to be non-normally distributed Nature of dependent variable Discrete or continuous Continuous Statistical tests Not necessary Test for significance and relevance Test for significance Acceptance in scientific community Widely accepted and commonly used Relatively new and therefore rarely used Implementation in software No additional implementation necessary Supported by, for example, SPSS, STATA, and R software packages The REndo (Gui et al. 2017) package supports the Gaussian copula approach The REndo (Gui et al. 2017) package supports LIV model with one dependent and one independent variables 5

PLS-SEM Endogeneity Assessment Procedure
Does the Gaussian copula approach detect endogeneity issues? Assumptions of the Gaussian copula approach fulfilled? Yes Requirements Check Model Analysis 6

The Gaussian Copulas Approach
Park and Gupta (2012) introduced the Gaussian copula approach, which controls for endogeneity by directly modeling the correlation between the endogenous variable and the error term by means of a copula. To determine whether endogeneity is at a critical level, researchers need to assess the significance of the copula coefficient. A significant coefficient indicates a critical level of endogeneity. As c* is an estimated quantity, the standard errors of the OLS regressions are not correct. Park and Gupta (2012) suggest a bootstrapping approach The Gaussian copula approach requires the endogenous variable to be nonnormally distributed. 𝑐 ∗ = Φ −1 (𝐻 𝑥 1 Inverse normal cumulative density function Empirical cumulative density function (ECDF) 𝑦= 𝛽′′ 0 + 𝛽′′ 1 𝑥 1 + 𝛽′′ 2 𝑥 2 + 𝛽′′ 3 𝑐 ∗ + 𝜀′′ 9

Download and Save the Files of the Example
… e.g. to C:\endo 10

Create and estimate the PLS path model
Data Preparation Create and estimate the PLS path model Generate the data file Requirements Check Prepare the R code for the requirements check Run the requirements check in R and analyze the results Model Analysis Prepare the R code for the Gaussian copula analysis Run the Gaussian copula analysis in R and interpret the results 11

Simple Corporate Reputation Model Example
The book on PLS-SEM explains how to create and estimate the simple corporate reputation model example using SmartPLS 3: Hair, J. F., Hult, G. T. M., Ringle, C. M., and Sarstedt, M. (2017). A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM), Thousand Oaks, CA: Sage. 12

Run the Simple Corporate Reputation Model Example in SmartPLS
Missing value marker: -99 Case wise deletion Path weighting scheme 13

Copy Latent Variable Scores to Excel and Save as CSV
15

CRP_dataset_std.csv … e.g. save to C:\endo 16

Open CRP_KS-test_code.r with Text Editor
or Notepad++ with highlighting … e.g. from c:\endo 18

Prepare the R Code (1) # R code for the Kolmogorov–Smirnov test with Lilliefors correction # Set directory -> REPLACE WITH THE DIRECTORY INCLUDING THE EXAMPLE CSV FILE # ON YOUR COMPUTER setwd ("C:/endo") # Load required libraries -> PLEASE INSTALL THE "KScorrect" PACKAGE IF YOU HAVE NOT # ALREADY. SEE # FOR INSTRUCTIONS HOW TO INSTALL A PACKAGE library(KScorrect) # Read data (extracted standardized latent variable scores from PLS model) CRDdata = read.csv2("CRP_dataset_std.csv", header=TRUE, sep=";", dec=".", stringsAsFactors=FALSE); LIKE <- CRDdata[,"LIKE"] COMP <- CRDdata[,"COMP"] CUSA <- CRDdata[,"CUSA"] #Run the he Kolmogorov–Smirnov test with Lilliefors correction LcKS(LIKE, "pnorm", nreps = 4999) LcKS(COMP, "pnorm", nreps = 4999) LcKS(CUSA, "pnorm", nreps = 4999) Path to the working directory of this R session (i.e., folder where you saved the data file) Name of the csv file including the latent variable scores Get the data of the latent variable scores. In this example of the three independent variables when regressing CUSL on LIKE, COMP, and CUSA. Define the variables for which you like to run the KS test 19

Download, Install and Run the Statistical Software R
21

Run R and Copy & Paste the Code Into the Console
22

Results If the p-value is below 0.05, the variable does not follow a normal distribution. Scroll up (and down) to see the p-values of the other variables 23

Requirements Check Before Running Gaussian Copulas
Before initiating the Gaussian copula approach to meet its assumptions, we first verify if the variables, which potentially exhibit endogeneity, are non-normally distributed. We do so by running the Kolmogorov–Smirnov test with Lilliefors correction (Sarstedt and Mooi 2014) on the standardized composite scores of COMP, LIKE, and CUSA, which the PLS path model estimation provides. If the p-value is below 0.05, the variable does not follow a normal distribution. The results show that none of the constructs has normally distributed scores, which allows us to consider them endogenous in the Gaussian copula analysis. 24

Open CRP_copula_code.r with Text Editor
or Notepad++ with highlighting … e.g. from c:\endo 26

Load the R package “car”
Prepare the R Code (I) # R code for correcting for endogeneity in the Corporate Reputation Data PLS model # using Gaussian Copula Approach as descripted in Park and Gupta (2012) # # PLEASE CITE AS: # Hult, G. T. M., J. F. Hair, D. Proksch, M. Sarstedt, A. Pinkwart, & C. M. Ringle (2018). # Addressing Endogeneity in International Marketing Applications of Partial # Least Squares Structural Equation Modeling. Journal of International Marketing, # forthcoming. # Set directory -> REPLACE WITH THE DIRECTORY INCLUDING THE EXAMPLE CSV FILE # ON YOUR COMPUTER setwd ("C:/endo") # Load required libraries -> PLEASE INSTALL THE "CAR" PACKAGE IF YOU HAVE NOT # ALREADY. SEE # FOR INSTRUCTIONS HOW TO INSTALL A PACKAGE library(car) Path to the working directory of this R session (i.e., folder where you saved the data file) Load the R package “car” 27

Prepare the R Code (II) # Function to create Gaussian Copula
# From Gui, Raluca, Markus Meierer, and Rene Algesheimer (2017), # "R Package REndo: Fitting Linear Models with Endogenous Regressors using # Latent Instrumental Variables (Version 1.3)," createCopula <- function(P){ H.p <- stats::ecdf(P) H.p <- H.p(P) H.p <- ifelse(H.p==0, ,H.p) H.p <- ifelse(H.p==1, ,H.p) U.p <- H.p p.star <- stats::qnorm(U.p) return(p.star) } # Function to calculate corrected p-values for regression based on bootstrapped standard errors bootstrapedSignificance <- function(dataset, bootstrapresults, numIndependentVariables, numCopulas){ for (i in 1:nrow(summary(bootstrapresults))){ t <- summary(bootstrapresults)[i, "original"] / summary(bootstrapresults)[i, "bootSE"] # df = n (number of observations) - k (number of independent variables + copulas) - 1 pvalue <- 2 * pt(-abs(t),df=nrow(dataset)-numIndependentVariables-numCopulas-1) cat("Pr(>|t|)", rownames(summary(bootstrapresults))[i], ": ", pvalue, "\n") 28

Prepare the R Code (III)
# Read data (extracted standardized latent variable scores from PLS model) CRDdata = read.csv2("CRP_dataset_std.csv", header=TRUE, sep=";", dec=".", stringsAsFactors=FALSE); CUSL <- CRDdata[,"CUSL"] LIKE <- CRDdata[,"LIKE"] COMP <- CRDdata[,"COMP"] CUSA <- CRDdata[,"CUSA"] # Calculate standard regression stdModel <- lm (CUSL ~ LIKE + COMP + CUSA) summary(stdModel); # Calculate copulas for independent variables within model LIKE_star <- createCopula(LIKE) COMP_star <- createCopula(COMP) CUSA_star <- createCopula(CUSA) Name of the csv file including the latent variable scores Get the data of the latent variable scores Define the regression model (i.e., regress CUSL on LIKE, COMP, and CUSA Compute the Gaussian copulas of the independent variables in the regression model which may be subject to endogeneity issues 29

Prepare the R Code (IV) # Set bootstrapping rounds
# FOR TESTING PURPOSE, WE RECOMMEND SETTING THIS VALUE TO 100; FOR REPORTING THE FINAL # RESULTS WE RECOMMEND SETTING IT TO 10000 bootrounds = 10000 # Calculate Results # Include Copula for COMP (Model 1) # Normal regression copulaResults1 <- lm (CUSL ~ COMP + LIKE + CUSA + COMP_star + 0) summary(copulaResults1) # Bootstrap Standard Errors bootCopulaResults1 <- Boot(copulaResults1, R=bootrounds) summary(bootCopulaResults1) # Calculate corrected p-values based on bootstrapped standard errors bootstrapedSignificance(CRDdata, bootCopulaResults1, 3, 1) Set the number of bootstraps (e.g., 10,000) Run the regression model which includes the Gaussian copula for COMP CHECK / UPDATE VALUES! 3 = Number of independent variables in the regression model 1 = Number of Gaussian copulas in the regression model 30

Prepare the R Code (V) # Include Copula for LIKE (Model 2)
# Normal copula regression copulaResults2 <- lm (CUSL ~ COMP + LIKE + CUSA + LIKE_star + 0) summary(copulaResults2) # Bootstrap standard errors bootCopulaResults2 <- Boot(copulaResults2, R=bootrounds) summary(bootCopulaResults2) # Calculate corrected p-values based on bootstrapped standard errors bootstrapedSignificance(CRDdata, bootCopulaResults2, 3, 1) # Include Copula for CUSA (Model 3) copulaResults3 <- lm (CUSL ~ COMP + LIKE + CUSA + CUSA_star + 0) summary(copulaResults3) bootCopulaResults3 <- Boot(copulaResults3, R=bootrounds) summary(bootCopulaResults3) bootstrapedSignificance(CRDdata, bootCopulaResults3, 3, 1) Run the regression model which includes the Gaussian copula for LIKE CHECK / UPDATE VALUES! 3 = Number of independent variables in the regression model 1 = Number of Gaussian copulas in the regression model Run the regression model which includes the Gaussian copula for CUSA CHECK / UPDATE VALUES! 3 = Number of independent variables in the regression model 1 = Number of Gaussian copulas in the regression model 31

Prepare the R Code (VI) # Include Copula for LIKE and COMP (Model 4)
# Normal copula regression copulaResults4 <- lm (CUSL ~ COMP + LIKE + CUSA + COMP_star + LIKE_star + 0) summary(copulaResults4) # Bootstrap standard errors bootCopulaResults4 <- Boot(copulaResults4, R=bootrounds) summary(bootCopulaResults4) # Calculate corrected p-values based on bootstrapped standard errors bootstrapedSignificance(CRDdata, bootCopulaResults4, 3, 2) # Include Copula for LIKE and CUSA (Model 5) copulaResults5 <- lm (CUSL ~ COMP + LIKE + CUSA + LIKE_star + CUSA_star + 0) summary(copulaResults5) bootCopulaResults5 <- Boot(copulaResults5, R=bootrounds) summary(bootCopulaResults5) bootstrapedSignificance(CRDdata, bootCopulaResults5, 3, 2) Run the regression model which includes the Gaussian copula for COMP and LIKE CHECK / UPDATE VALUES! 3 = Number of independent variables in the regression model 2 = Number of Gaussian copulas in the regression model Run the regression model which includes the Gaussian copula for LIKE and CUSA CHECK / UPDATE VALUES! 3 = Number of independent variables in the regression model 1 = Number of Gaussian copulas in the regression model 32

Prepare the R Code (VII)
# Include Copula for COMP and CUSA (Model 6) # Normal copula regression copulaResults6 <- lm (CUSL ~ COMP + LIKE + CUSA + COMP_star + CUSA_star + 0) summary(copulaResults6) # Bootstrap standard errors bootCopulaResults6 <- Boot(copulaResults6, R=bootrounds) summary(bootCopulaResults6) # Calculate corrected p-values based on bootstrapped standard errors bootstrapedSignificance(CRDdata, bootCopulaResults6, 3, 2) # Include Copula for LIKE, COMP and CUSA (Model 7) copulaResults7 <- lm (CUSL ~ COMP + LIKE + CUSA + COMP_star + LIKE_star + CUSA_star + 0) summary(copulaResults7) bootCopulaResults7 <- Boot(copulaResults7, R=bootrounds) summary(bootCopulaResults7) bootstrapedSignificance(CRDdata, bootCopulaResults7, 3, 3) Run the regression model which includes the Gaussian copula for COMP and CUSA CHECK / UPDATE VALUES! 3 = Number of independent variables in the regression model 2 = Number of Gaussian copulas in the regression model Run the regression model which includes the Gaussian copula for COMP, LIKE, and CUSA CHECK / UPDATE VALUES! 3 = Number of independent variables in the regression model 3 = Number of Gaussian copulas in the regression model 33

Run R and Copy & Paste the Code Into the Console
35

R Results of Regression Model 1 in the Example (i. e
R Results of Regression Model 1 in the Example (i.e. the regression model includes the copula for COMP) > # Bootstrap Standard Errors > bootCopulaResults1 <- Boot(copulaResults1, R=bootrounds) Loading required namespace: boot > summary(bootCopulaResults1) Number of bootstrap replications R = 10000 original bootBias bootSE bootMed COMP LIKE CUSA COMP_star > # Calculate corrected p-values based on bootstrapped standard errors > bootstrapedSignificance(CRDdata, bootCopulaResults1, 3, 1) Pr(>|t|) COMP : Pr(>|t|) LIKE : e-08 Pr(>|t|) CUSA : e-25 Pr(>|t|) COMP_star : ATTENTION As bootstrapping is based on random sampling, your results will slightly differ. Also, the results will slightly differ each time you perform the analysis. 36

R Results of Regression Model 7 in the Example (i. e
R Results of Regression Model 7 in the Example (i.e. the regression model includes copulas for COMP, LIKE, and CUSA) > # Bootstrap standard errors > bootCopulaResults7 <- Boot(copulaResults7, R=bootrounds) > summary(bootCopulaResults7) Number of bootstrap replications R = 10000 original bootBias bootSE bootMed COMP LIKE CUSA COMP_star LIKE_star CUSA_star > # Calculate corrected p-values based on bootstrapped standard errors > bootstrapedSignificance(CRDdata, bootCopulaResults7, 3, 3) Pr(>|t|) COMP : Pr(>|t|) LIKE : e-05 Pr(>|t|) CUSA : e-16 Pr(>|t|) COMP_star : Pr(>|t|) LIKE_star : Pr(>|t|) CUSA_star : ATTENTION As bootstrapping is based on random sampling, your results will slightly differ. Also, the results will slightly differ each time you perform the analysis. 37

Results Table Including the Outcomes of Model 1
Original model Gaussian copula Model 1 (endogenous variable: COMP) Gaussian copula Model 2 (endogenous variable: LIKE) Gaussian copula Model 3 (endogenous variable: CUSA) Variable Value p-value COMP 0.016 0.746 0.014 0.850 0.017 0.763 0.021 0.707 LIKE 0.331 < 0.01 0.370 CUSA 0.509 0.511 0.582 cCOMP 0.002 0.973 cLIKE -0.033 0.245 cCUSA -0.041 0.063 38

Results Table Including the Outcomes of Model 7
Gaussian copula Model 4 (endogenous variables: LIKE, COMP) Gaussian copula Model 5 (endogenous variables: LIKE, CUSA) Gaussian copula Model 6 (endogenous variables: COMP, CUSA) Gaussian copula Model 7 (endogenous variables: LIKE, COMP, CUSA) Variable Value p-value COMP -0.006 0.939 0.021 0.705 -0.027 0.717 -0.037 0.644 LIKE 0.381 0.000 0.341 0.333 0.362 CUSA 0.509 0.580 0.592 0.589 cCOMP 0.019 0.737 0.039 0.432 0.047 0.395 cLIKE -0.041 0.283 -0.008 0.790 -0.024 0.505 cCUSA -0.039 0.084 -0.049 0.045 -0.047 0.056 39

Results of the Gaussian Copulas Endogeneity Assessment
The results show that only one Gaussian copula (i.e., cCUSA) is significant (p < 0.1) when treating one endogenous variable, which points to a potential endogeneity issue. Including the significant Gaussian copula in the model changes the effect of CUSA on CUSL by units (from to 0.582), which points to a potential endogeneity problem of CUSA (Model 3). Similarly, cCUSA is also significant in the CUSA models in combination with LIKE (Model 5, p < 0.1), COMP (Model 6, p < 0.05), and LIKE and COMP (Model 7, p < 0.1). This confirms the possibility of CUSA being endogenous. You may use the results of model 3 to treat the identified endogeneity problem. 40

Further Research? http://journals.ama.org/doi/10.1509/jim.17.0151
Hult, G. T. M., J. F. Hair, D. Proksch, M. Sarstedt, A. Pinkwart, & C. M. Ringle (2018). Addressing Endogeneity in International Marketing Applications of Partial Least Squares Structural Equation Modeling. Journal of International Marketing, forthcoming. 41

Gaussian copula, graphical representation of solution – part 1
42

Gaussian copula, graphical representation of solution – part 2
? First, we calculate the area of the empirical density function at the value 5. Second, we identify the value for which the area of the normal density function is the same as the just identified area. 43

Explanation of Gaussian Copula approach – Part 1
Key idea: estimate the joint density of the structural error and the endogenous regressor Sklar’s theorem states: every joint distribution can be written as a function of its margins and the other way around A copula is used to map the two cumulative distribution functions (CDFs) to a joint CDF. We assume that they have a joint normal distribution (therefore Gaussian Copula is used) We assume that 𝑋 consist of 𝑥1 (an endogenous variable) and 𝑥2 (one or more exogenous regressor) We assume that the CDF of 𝜀 is a normal distribution with mean 0 and variance 𝜎 𝜀 2 We use the Gaussian copula to get the joint CDF 𝐺 𝑥 1 , 𝜀 =𝑁( 𝑥 1 ∗ , 𝜀 ∗ ), with 𝑥 1 ∗ = 𝜙 −1 𝐹 𝑥 𝑥 1 and 𝜀 ∗ = 𝜙 −1 𝐹 𝜀 𝜀 and 𝑁 is the bivariate standard normal distribution with correlation coefficient 𝜌. We differentiate 𝐺 𝑥 1 , 𝜀 : Joint probability density function: 𝑔 𝑥 1 , 𝜀 = 𝛿𝛿𝐺 𝑥 1 , 𝜀 𝛿 𝑥 1 𝛿𝜀 𝑓 𝑥 𝑓 𝑦 We could use the density to obtain the likelihood function and then consistently estimate the coefficient of 𝑥1 using maximum likelihood estimation (see Park and Gupta, 2012 for the likelihood function). 44

Explanation of Gaussian Copula approach – Part 2
We could instead include 𝑥 1 ∗ in the equation based on the following reasoning: The copula model implies that 𝑥 1 ∗ and 𝜀 ∗ follow the standard bivariate normal distribution with correlation coefficient 𝜌 𝑥 1 ∗ 𝜀 ∗ = 1 0 𝜌 1− 𝜌 𝑣 1 𝑣 2 , 𝑣 1 and 𝑣 2 being independent random variables drawn from a standard normal distribution The structural error is: 𝜀= 𝐹 𝜀 −1 𝜙 𝜀 ∗ = 𝜙 𝜎 𝜀 2 −1 𝜙 𝜀 ∗ = 𝜎 𝜀 𝜀 ∗ We can rewrite the original equation by: 𝑦= 𝑥 1 𝛽 1 + 𝑋 2 𝛽 2 + 𝜎 𝜀 (𝜌 𝑥 1 ∗ +( 1− 𝜌 2 ) 𝑣 2 ) We therefore split the structural error in two parts, the first part is correlated with 𝑥 1 , the second part is uncorrelated We can estimate the first part 45

Endogeneity in PLS-SEM: The Gaussian Copula Approach

Similar presentations

Presentation on theme: "Endogeneity in PLS-SEM: The Gaussian Copula Approach"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Endogeneity in PLS-SEM: The Gaussian Copula Approach

Similar presentations

Presentation on theme: "Endogeneity in PLS-SEM: The Gaussian Copula Approach"— Presentation transcript:

Similar presentations

About project

Feedback