Discovering and Describing Relationships Farideh Dehkordi-Vakil
Exploring Relationships between Two Quantitative Variables Scatter plots Represent the relationship between two different continuous variables measured on the same subjects. Each point in the plot represents the values for one subject for the two variables.
Exploring Relationships between Two Quantitative Variables Example: Data reported by the organization for Economic Development and Cooperation on its 29 member nations in 1998. Per capita gross domestic product is on x-axis Per capita health care expenditures is on y-axis.
Exploring Relationships between Two Quantitative Variables We can describe the overall pattern of scatter plot by Form or shape Direction strength
Exploring Relationships between Two Quantitative Variables Form or shape The form shown by the scatter plot is linear if the points lie in a straight-line pattern. Strength The relation ship is strong if the points lie close to a line, with little scatter.
Exploring Relationships between Two Quantitative Variables Direction Positive and negative association Two variables are positively associated when above-average values of one variable tend to occur in individuals with above average values for the other variable, and below average values of both also tend to occur together. Two variable are negatively associated when above average values for one tend to occur in subjects with below average values of the other, and vice-versa
Exploring Relationships between Two Quantitative Variables Per capita health care example “subjects” studied are countries Form of relationship is roughly linear The direction is positive The relationship is strong.
Correlation It is often useful to have a measure of degree of association between two variables. For example, you may believe that sales may be affected by expenditures on advertising, and want to measure the degree of association between sales and advertising. Correlation coefficient is a numeric measure of the direction and strength of linear relationship between two continuous variables The notation for sample correlation coefficient is r.
Correlation There are several alternative ways to write the algebraic expression for the correlation coefficient. The following is one. X and Y represent the two variables of interest. For example advertising and sales or per capita gross domestic product, and the per capita health care expenditure. n is the number of subjects in the sample The notation for population correlation coefficient is .
Correlation Facts about correlation coefficient r has no unit. r > 0 indicates a positive association; r < 0 indicates a negative association r is always between –1 and +1 Values of r near 0 imply a very weak linear relationship Correlation measures only the strength of linear association.
Correlation We could perform a hypothesis test to determine whether the value of a sample correlation coefficient (r) gives us reason to believe that the population correlation () is significantly different from zero The hypothesis test would be H0: = 0 Ha: 0
Correlation The test statistic would be Reject H0 if The test statistic has a t-distribution with n-2 degrees of freedom. Reject H0 if
Example: Do wages rise with experience? Many factors affect the wages of workers: the industry they work in, their type of job, their education and their experience, and changes in general levels of wages. We will look at a sample of 59 married women who hold customer service jobs in Indiana banks. The following table gives their weekly wages at a specific point in time also their length of service with their employer, in month. The size of the place of work is recorded simply as “large” (100 or more workers) or “small.” Because industry, job type, and the time of measurement are the same for all 59 subjects, we expect to see a clear relationship between wages and length of service.
Example: Do wages rise with experience?
Example: Do wages rise with experience?
Example: Do wages rise with experience? The correlation between wages and length of service for the 59 bank workers is r = 0.3535. We expect a positive correlation between length of service and wages in the population of all married female bank workers. Is the sample result convincing that this is true?
Example: Do wages rise with experience? To compute correlation: we need: Replacing these in the formula We want to test H0: = 0 Ha: > 0 The test statistic is
Example: Do wages rise with experience? Comparing t = 2.853 with critical values from the t table with n - 2 = 57 degrees of freedom help us to make our decision. Conclusion: Since P( t > 2.853) < .005, we reject H0. There is a positive correlation between wages and length of service.
Correlograms: An Alternative Method of Data Exploration In evaluating time series data, it is useful to look at the correlation between successive observations over time. This measure of correlation is called autocorrelation and may be calculated as follows: rk = autocorrelation coefficient for a k period lag. mean of the time series. yt = Value of the time series at period t. y t-k = Value of time series k periods before period t.
Correlograms: An Alternative Method of Data Exploration Autocorrelation coefficient for different time lags can be used to answer the following questions about a time series data. Are the data random? In this case the autocorrelations between yt and y t-k for any lag are close to zero. The successive values of a time series are not related to each other.
Correlograms: An Alternative Method of Data Exploration Is there a trend? If the series has a trend, yt and y t-k are highly correlated The autocorrelation coefficients are significantly different from zero for the first few lags and then gradually drops toward zero. The autocorrelation coefficient for the lag 1 is often very large (close to 1). A series that contains a trend is said to be non-stationary.
Correlograms: An Alternative Method of Data Exploration Is there seasonal pattern? If a series has a seasonal pattern, there will be a significant autocorrelation coefficient at the seasonal time lag or multiples of the seasonal lag. The seasonal lag is 4 for quarterly data and 12 for monthly data.
Correlograms: An Alternative Method of Data Exploration Is it stationary? A stationary time series is one whose basic statistical properties, such as the mean and variance, remain constant over time. Autocorrelation coefficients for a stationary series decline to zero fairly rapidly, generally after the second or third time lag.
Correlograms: An Alternative Method of Data Exploration To determine whether the autocorrelation at lag k is significantly different from zero, the following hypothesis and rule of thumb may be used. H0: k= 0, Ha: k 0 For any k, reject H0 if Where n is the number of observations. This rule of thumb is for = 5%
Correlograms: An Alternative Method of Data Exploration The hypothesis test developed to determine whether a particular autocorrelation coefficient is significantly different from zero is: Hypotheses H0: k= 0, Ha: k 0 Test Statistic:
Correlograms: An Alternative Method of Data Exploration Reject H0 if
Correlograms: An Alternative Method of Data Exploration The plot of the autocorrelations versus time lag is called Correlogram. The horizontal scale is the time lag The vertical axis is the autocorrelation coefficient. Patterns in a Correlogram are used to analyze key features of data.
Example:Mobil Home Shipment Correlograms for the mobile home shipment Note that this is quarterly data
Example:Japanese exchange Rate As the world’s economy becomes increasingly interdependent, various exchange rates between currencies have become important in making business decisions. For many U.S. businesses, The Japanese exchange rate (in yen per U.S. dollar) is an important decision variable. A time series plot of the Japanese-yen U.S.-dollar exchange rate is shown below. On the basis of this plot, would you say the data is stationary? Is there any seasonal component to this time series plot?
Example:Japanese exchange Rate
Example:Japanese exchange Rate Here is the autocorrelation structure for EXRJ. With a sample size of 12, the critical value is This is the approximate 95% critical value for rejecting the null hypothesis of zero autocorrelation at lag K.
Example:Japanese exchange Rate The Correlograms for EXRJ is given below
Example:Japanese exchange Rate Since the autocorrelation coefficients fall to below the critical value after just two periods, we can conclude that there is no trend in the data.
Example:Japanese exchange Rate To check for seasonality at = .05 The hypotheses are: H0; 12 = 0 Ha:12 0 Test statistic is: Reject H0 if
Example:Japanese exchange Rate Since We do not reject H0 , therefore seasonality does not appear to be an attribute of the data.