Review
Review We’ve covered three main topics thus far Data collection Data summarization Probability
Data Collection We’ve talked about three ways of data collection Survey Sampling frame, questionnaire, probability sample, convenience sample, non-response bias, other types of bias Observational study No assignment of treatments. No causal conclusions Randomized experiment Random assignment units/subjects to treatments. If done properly causal conclusions (conclusions might not generalize). Why randomize?
Data summarization We talked about graphical and numerical summaries for one variable and two. Important to identify type of variable. One categorical/qualitative variable graphical: pie chart, bar graph numerical: counts/percents/frequencies One quantitative variable graphical: histogram/boxplot (shape, center, spread, outliers) numerical:mean, median, standard deviation, inter-quartile range, range, percentiles
Data summarization Two variables One categorical/qualitative and one quantitative graphical: side-by-side boxplots numerical: means, meadians, SDs, IQRs, etc. for each category Two quantitative graphical: scatterplot (form, direction, strength, outliers) numerical: means, SDs, etc. for both. correlation coefficient If association is linear model with straight line. slope and intercept of regression line (prediction, interpretation, extrapolation, etc.) Two categorical/qualitative graphical: plots we didn’t talk about numerical: contigency tables; marginal frequencies, conditional frequencies Also relative risk and odds ratios
Probability To find probability of event A Enumerate sample space. Count number of outcomes in event A. Divide by the total number of outcomes Easy to do if sample space is small Use probability laws to push symbols around Independence, mutually exclusive, joint= marginal(conditional) Sample space large only way to approach things
Duke b-ball What type of study is this? Survey? Randomized experiment? observational study? Might it be reasonable to assume that the opponents are a random sample of all type of opponents Duke could potentially face? If not, then everything we see can’t be generalized to teams Duke might play in the future. (In other words, the population is the teams that Duke has played so far and we’ve have observations on all of them.)
Limitations Since this is not a designed experiment what are limitations? Can we make causal conclusions? nope Is there potential for lurking variables? Yup. In I’d bet there are some. What type of information does looking at these type of data provide?
JMP Lets look at a few variables to summarize them graphically and numerically.
Regression vs correlation coefficient Do change of units change value? Correlation coefficient (no) Regression slope yes Does defining the response and explanatory variable matter Regression slope (yes) Provides direction and strength of linear association Correlation coefficient (yes, yes) Regression slope (yes, no) Quantifies linear association between two quantitative variables
Correlation coefficient vs regression Influenced by outliers Correlation coefficient (yes) Regression slope (yes) sometimes called influential points Can conclude explanatory variable causes change in the response variable Correlation coefficient (no) Regression slope (no) Although under a well designed experiment it is possible Must both variables be quantitative Corelation coefficient (yes) Regression slope (not necessarily but I don’t think we’ll be able to cover the the quantitative qualitative regression often called ANOVA)