MATH1005 STATISTICS M.Harahap@maths.usyd.edu.au http://mahritaharahap.wordpress.com/teaching-areas Tutorial 3: Bivariate Data
In statistics we usually want to statistically analyse a population but collecting data for the whole population is usually impractical, expensive and unavailable. That is why we collect samples from the population (sampling) and make inferences about the population parameters using the statistics of the sample (inferencing) with some level of accuracy (confidence level). A population is a collection of all possible individuals, objects, or measurements of interest. A sample is a subset of the population of interest.
Regression The linear regression line characterises the relationship between two numerical variables. Using regression analysis on data can help us draw insights about that data. It helps us understand the impact of one of the variables on the other. It examines the relationship between one independent variable (predictor/explanatory) and one dependent variable (response/outcome) . The linear regression line equation is based on the equation of a line in mathematics. β0+β1X
Y: Outcome variable Response Variable Dependent Variable The outcome to be measured/predicted. X: Predictor Variable Explanatory Variable Independent Variable The variable one can control.
Correlation Correlation measures the association between two numerical variables with the strength of the relationship measured by the correlation coefficient r. A statistic that quantifies a linear relation between two variables Falls between -1.00 and 1.00 The sign of the number indicates the direction of relationship. The value of the number indicates the strength of the relation. NOTE: Regression examines the relationship between one independent variable and one dependent variable. That is the slope of the linear regression. Correlation indicates the association between two metric variables with the strength and direction of the relationship measured by the correlation coefficient.
Strength & Direction of Correlation DIRECTION: POSITIVE NEGATIVE STRENGTH: PERFECT STRONG MODERATE WEAK
R2 Coefficient of Determination R-squared gives us the proportion of the total variability in the response variable (Y) that is “explained” by the least squares regression line based on the predictor variable (X). It is usually stated as a percentage. Interpretation: On average, R2% of the variation in the dependent variable can be explained by the independent variable through the regression model.
> Result <- Olympics100mW$Result > Olympics100mW[order(Result),] Year Athlete Medal Country Result 1988 Florence Griffith-Joyner GOLD USA 10.54 2012 Shelly-Ann Fraser-Pryce GOLD JAM 10.75 # The reigning champion is Florence Griffith-Joyner from the USA with a time of 10.54s at the 1988 Seoul Olympics.
# The scatter plot on the right indicates a linear regression might be appropriate which is further suggested by the correlation coefficient r = -0.8736502 and that 76% of the variability of Results is explained by Years.
# The boxplot shows 1 outlier (9502mins in 1945 by Rani). # Take a logarithm transformation of Time to get rid of the outlier, and use this as your subsequent y variable.