Statistics 100 Lecture Set 7
Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters in the coming weeks Suggested problems: –13.5, 13.17, 13.19, 13.25, –14.5, 14.9, 14.15, 14.23
Recall: –A distribution of a variable tells us what values it takes on and how often it takes these values. –Visualize the empirical (observed dataset) distribution using a histogram
Consider the histogram of a numerical variable for a population Example: Weights of 20 year-old Canadian males (mean=182 lbs) What would this look like?
Simulated 100,000 individuals –I made up values. These are not real. –But comes from a distribution with the correct mean With a lot of data can draw histograms with many many bins
Simulated 100,000 individuals –I made up values. These are not real. –But comes from a distribution with the correct mean With a lot of data can draw histograms with many many bins
Idea: –Many populations of numerical variables have histograms that can be approximated by smooth curves –Can describe overall shape of distribution with a mathematical model called a, –Describes main features of a distribution with a single expression Advantage of doing this:
Can fit formula to data –May give better estimates of population parameters than empirical distribution does –Works well if population follows a certain formula (at least approximately) –This is called modeling –Formula is a model A common model:
Example: Took random sample of 100 from my 100,000 weights Consider this my population of males How do results from empirical distribution and from model compare to population?
Estimate that is closest to population is highlighted Percentil e Population Sample Normal model for these data
Between 20th and 80th percentiles –Model & Empirical are pretty similar Model is just a tiny bit better … sometimes This is true in general Decent models and empirical estimates work about equally well at estimating population percentiles in the centre of a distribution
In the tails –The farther out in the tail, the worse the empirical estimates are –Model still does well there Empirical estimates perform poorly way out in the tails (beyond 10% either way) Too little data there
Conclusion from this –Models can improve on the ability of data to describe a population! Warning: –Assumes that we can find a good formula –Not “one size fits all” –There are many different models to fit various shapes
The Normal Density model
Can calculate percentages for values in any interval Just like in histogram Is the proportion of area under curve
The Normal Density model –Is actually the source of the empirical rule:
Standard scores
Why is this important?
Example The distribution of the heights of young men is approximately normal with a mean of 70 inches and standard deviation of 2,5 inches What is the standard score for a height of 72 inches? What is the probability of randomly selecting a young man with a height less than 72 inches
Example On an IQ test for the year old age group, the scores are approximately normally distributed with a mean of 110 and standard deviation of 15 Sarah scores 130 on the test What is her standard score? What percentile was her score?
Visualizing and measuring relationships (chapter 14) Important use of statistics has to do with uncovering relationships between variables Allows one to explore relationships that are hypothesized or to discover potentially new ones
The manner in which measurement and visualization takes place depends on the types of variables we are exploring Visualizing and measuring relationships
The manner in which measurement and visualization takes place depends on the types of variables we are exploring –Categorical vs. Categorical –Categorical vs. Numerical –Numerical vs. Categorical –Numerical vs. Numerical... look at this one first Visualizing and measuring relationships
The manner in which measurement and visualization takes place depends on the types of variables we are exploring –Categorical vs. Categorical –Categorical vs. Numerical –Numerical vs. Categorical –Numerical vs. Numerical... look at this one first Visualizing and measuring relationships
Used to depiction of numerical vs. numerical relationship –Each individual has values for 2 variables measured, x and y –Plot y vs. x for each individual Scatter plots
For experiments, there is the variable we adjust and the variable we measure Response variable (y) –Sometimes called the dependent variable –Measures the outcome of a study Explanatory variable (x) –Sometimes called the independent variable –Explains or influences changes in the response variable Note: When we don’t set the values of either variables but just observe both, there may or may not be explanatory or response variables. It depends on how we plan to use the data. Scatter plots
A study was conducted to determine if there is a relationship between the number of power boat registrations and the number of manatees killed in the Florida everglades. The data is given in the Example
Scatter Diagram or Scatter Plot Scatter plot of the number of manatees killed against the number of powerboat registrations
Scatter Diagram or Scatter Plot In the previous plot: Look for the overall pattern Look for striking deviations from the pattern Describe the pattern by the form (are there clusters of points?), direction and strength of the relationship Outliers (points that fall outside the overall pattern) have important information
Direction and strength of association Positive association or positive relationship Negative association or negative relationship Strength of a relationship in a scatter plot is determined by how closely the points follow a clear pattern We need a numerical summary to quantify the magnitude of the strength of the relationship The plots on the next page will illustrate this point
Direction and strength of association
Sample Correlation
Notes about Correlation Correlation makes no distinction between explanatory and response variables. It makes no difference which variable you call x and which variable you call y in calculating the correlation Since r uses the standardized values of the observations, r does not change when we change the units of measurement of x, y or both The correlation r has no unit of measurement, it is just a number
Notes about Correlation Correlation requires that both variables be quantitative, so that it makes sense to do the arithmetic indicated by the formula r We can’t calculate a correlation between incomes of a group of people and what city they live in, because city is a categorical variable. Like the mean and standard deviation, the correlation r is strongly affected by a few outlying observations.
Notes about Correlation The correlation r is a number that lies between -1 and +1. Values of r near 0 suggest a very weak linear relationship The strength of the linear relationship increases as r moves away from 0 towards either -1 or 1 Values of r close to -1 or 1 indicate that the points in the scatter plot all lie close to a straight line. The extreme values of r=-1 or r=1 occur only in the case of a perfect relationship. In this case all points fall right on the straight line
Notes about Correlation Correlation measures the strength of the linear relationship between two variables. Correlation does not describe curved relationships between two variables, no matter how strong they are It is possible to have a high correlation r when a non linear relationship exists…. The correlation r is meaningless in this case.
A geyser is a hot spring that becomes unstable and erupts hot gases into the air. Perhaps the most famous of these is Wyoming's Old Faithful Geyser. Visitors to Yellowstone park most often visit Old Faithful to see it erupt. Consequently, it is of great interest to be able to predict the interval time of the next eruption. Let’s use what we have learned
Example Consider a sample of 272 interval times between eruptions. The first few lines of the available data are:
Example…Putting this all together Consider a histogram to shed some light upon the nature of the intervals between eruptions What do we observe?
Let’s use what we have learned What does the scatter- plot tell us?
Example…Putting this all together Consider two histograms for waiting time between eruptions What do we observe?
Example…Putting this all together So, what can we tell people who arrive at the park about waiting for the Old Faithful geyser?
Will learn how to do regression in the next few chapters … will do an even better job at prediction