CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis.

Slides:



Advertisements
Similar presentations
Chapter 3 Examining Relationships Lindsey Van Cleave AP Statistics September 24, 2006.
Advertisements

Slide Slide 1 Chapter 4 Scatterplots and Correlation.
Correlation and Linear Regression
Business Statistics - QBM117 Scatter diagrams and measures of association.
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Describing the Relation Between Two Variables
Association between two variables Example: University fees for the Big Ten Universities Data were collected to study the association between the percentage.
CHAPTER 3 Describing Relationships
Scatter Diagrams and Correlation
Relationships Scatterplots and correlation BPS chapter 4 © 2006 W.H. Freeman and Company.
CHAPTER 4: Scatterplots and Correlation ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Warm-Up A trucking company determines that its fleet of trucks averages a mean of 12.4 miles per gallon with a standard deviation of 1.2 miles per gallon.
Chapter 6: Exploring Data: Relationships Lesson Plan Displaying Relationships: Scatterplots Making Predictions: Regression Line Correlation Least-Squares.
Chapter 6 & 7 Linear Regression & Correlation
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 4 Section 1 – Slide 1 of 30 Chapter 4 Section 1 Scatter Diagrams and Correlation.
1 Examining Relationships in Data William P. Wattles, Ph.D. Francis Marion University.
4.1 Scatter Diagrams and Correlation. 2 Variables ● In many studies, we measure more than one variable for each individual ● Some examples are  Rainfall.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.1 Scatterplots.
CHAPTER 4: Scatterplots and Correlation ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.1 Scatterplots.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Objectives 2.1Scatterplots  Scatterplots  Explanatory and response variables  Interpreting scatterplots  Outliers Adapted from authors’ slides © 2012.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
Scatter Diagrams and Correlation Variables ● In many studies, we measure more than one variable for each individual ● Some examples are  Rainfall.
Chapter 7 Scatterplots, Association, and Correlation.
Chapter 4 - Scatterplots and Correlation Dealing with several variables within a group vs. the same variable for different groups. Response Variable:
The Big Picture Where we are coming from and where we are headed…
Chapter 4 Scatterplots and Correlation. Chapter outline Explanatory and response variables Displaying relationships: Scatterplots Interpreting scatterplots.
Business Statistics for Managerial Decision Making
Notes Chapter 7 Bivariate Data. Relationships between two (or more) variables. The response variable measures an outcome of a study. The explanatory variable.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 3: Describing Relationships Section 3.1 Scatterplots and Correlation.
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 7 LINEAR RELATIONSHIPS
Chapter 3: Describing Relationships
Chapter 7 Part 1 Scatterplots, Association, and Correlation
Chapter 3: Describing Relationships
Chapter 2 Looking at Data— Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3 Scatterplots and Correlation.
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Summarizing Bivariate Data
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
AP Stats Agenda Text book swap 2nd edition to 3rd Frappy – YAY
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Presentation transcript:

CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Association between two variables Example: University fees for the Big Ten Universities Data were collected to study the association between the percentage of students that were from out of state and the tuition paid by nonresidents students (in thousand dollars). Does the tuition money increase with the percentage of non residents students? University Tuition (1,000$) (Y) Nonresident s (%) (X) Northwestern Illinois7.68 Minnesota8.723 Ohio State9.39 Penn State Purdue Indiana Iowa8.631 Wisconsin9.135 Michigan Michigan State 10.59

Example: Example: Size of diamond and price of ring The source of the data is a full page advertisement placed in the Straits Times newspaper issue of February 29, 1992, by a Singapore-based retailer of diamond jewelry. The variables are the size of the diamond in carats (1 carat =.2 gram) and the price of ladies’ rings (single diamond stone) in Singapore dollars. Carats Singapore dollars …….….. How would you describe the association between the two variables?

Association between variables Data are collected for the two variables on each individual/unit. Two variables are associated if changes in one variable correspond to changes in the second variable. If there is a strong association, knowing one variable helps predicting the other. Number of programs running and CPU usage If the association is weak, information about one variable is not very useful in studying the other. Number of users and CPU usage

Useful terminology The following terms are often used: Response variable: measures the outcome of the study (Dependent variable) Explanatory variable: explains or causes changes in the response variable (Independent variable) Can you identify this distinction in the examples shown earlier? 1) Tuition = Response variableNon-residents=Explanatory variable 2) Carat=Explanatory variablePrice=Response variable

Scatter plots: displaying data about two variables Scatter plots show the relationship between two quantitative variables. One variable (independent variable) appears on the x-axis (horizontal axis) and the dependent variable appears on the y-axis (vertical axis). Each observation is represented by a point in the plot. Tuition Nonresident students NWU UMich

Interpreting scatter plots 1.Look for the overall pattern and for striking deviations 2.Define form, direction and strength of the relationship: a.Form: roughly linear if the points follow a straight line or nonlinear… b.Direction: positive or negative? c.Strength: how closely the points follow a clear form 3.Check for the presence of outliers, individual values that fall outside the overall pattern 4.Two variables are positively (negatively) associated if the increase of one variable correspond to an increase (decrease) in the other variable.

2000 Presidential Elections Did the butterfly ballots confuse voters? Did voters for Al Gore instead cast their votes for other candidates? Bush spokesman Ari Fleishcher stated on Nov. 9 that "Palm Beach County is a Pat Buchanan stronghold and that's why Pat Buchanan received 3,407 votes there." What is the level of support that Pat Buchanan enjoys in Palm Beach County? The published election results show the association between the vote totals for Pat Buchanan and the total population for Florida counties.

Is the association positive or negative? Is the form of the relationship almost linear?

Example: House data in Albuquerque (NM) in 1993 Selling price (100$) Annual Taxes ($) Interpret the graph: form, direction & strength of the relationship

Another example: The statistics of poverty and inequality Data from U.N.E.S.C.O Demographic Year Book. For 97 countries in the world, data are given for birth rates and for an index of the Gross National Product.

The plot before shows a non-linear association! Sometimes we can make it linear, by using some transformations on the variables. Possible transformations are, for example, “ln”, “exp”, “sqrt”. Here we consider the ln(GNP)=natural log of GNP. Birth rate (1,000 pop) Log G.N.P.

Measure of Linear Association If there is a strong linear association between the variables, then the cloud of points on the scatter plot will be close to a line. Birth rate (1,000 pop) Log G.N.P.

The Correlation Coefficient r The correlation coefficient r measures the direction and the strength of the linear relationship between two variables. It is a value between –1 and 1 The closer r is to 1 or –1, the stronger the linear association is. Positive values of r imply a positive association, negative values imply a negative association Values of r close to 0 imply weak linear association. It is defined as Where X has average and standard deviation s x, and Y has average and standard deviation s y.

Examples of correlation Birth rate (1,000 pop) Log G.N.P. r = Selling price (100$) Annual Taxes ($) r=0.65 Negative association Positive association

Diamond rings data Carat Price in US dollars N=48Averages.d.MinMax X Carat Y Price in US $ Strong positive association r = Diamond carats vs Price in US$

Positive Correlation In each plot there are 100 points. The correlation coefficient measures the amount of clustering around a line If r is close to 1, then points lie close to a straight line!!

Negative Correlation Negative correlation: as x increases, y tends to decrease. If r is close to – 1, then points lie close to a straight line!!

Guess the correlation Match the diagrams with the following correlations: – 0.93 – 0.75 –

Change of scale These are the low and high temperatures in Boulder (CO) for the month of April The first scatter plot uses degrees in Fahrenheit and the second plot uses degrees in centigrade. Notice that C o = 5/9*(F o – 32) Are the correlations between low and high temperatures in the two graphs different? r = 0.74r = ?

Different correlations? In which diagram below is the correlation coefficient the largest? The smallest?

Outliers and nonlinear association How are the data sets different?

Plot the data: the nature of the association between x and y is very different. The correlation coefficient can be misleading in presence of outliers or non-linear association. Check the scatter plot of the data Perfect association! Why is r not equal to 1? Outliers change the value of r. What would the value of r be without the outliers? r = 0.82

Which of the following diagrams should be summarized by r? (1) (2) (3)

Ecological Correlations Ecological correlations are based on rates or averages. They can be misleading as they tend to overstate the strength of the association. The following example deals with the relationship between income and education level for individuals in 3 states (A, B, C). This shows the averages. The correlation is almost 1!! This shows individual data. The correlation is now moderate. Variability within each state!!!

Summary  The correlation coefficient r varies between –1 and 1. If r=0 means there no linear association between X and Y. If r=1 or –1, then the points in a scatter plot lie on a straight line.  Positive r indicates positive association between X and Y. Negative r indicates negative association between X and Y.  Both variables X and Y must be quantitative. The correlation coefficient between X and Y is the same as the correlation between Y and X  r does not change if we change the units of measurement for X and Y  The correlation measures only the linear relationship between two variables  r can be strongly affected by the presence of outliers.

Correlation does not mean Causation!!

The correlation between teachers’ salaries and the consumption of alcohol over a period of years turned out to be almost Do the teachers drink? Both variables moved together, because both are influenced by a third variable (confounding variable) which is the long run growth in national income and population. A "bad example“ published in The New York Times' weekly science supplement called "Science Times" on August 22, It stated, "The experts have also developed startling evidence of the cat's renowned ability to survive, this time in the particular setting of New York City, where cats are prone at this time of year to fall from open windows in tall buildings. Researchers call the phenomenon feline high-rise syndrome." "Even more surprising, the longer the fall, the greater the chance of survival. Only one of 22 cats that plunged from above 7 stories died, and there was only one fracture among the 13 that fell more than 9 stories.

The following graph displays the number of radios in the U.K. form 1924 to 1937 and the number of mental defectives for 10,000 people for the same years. A social scientist states: “as more people gave up intellectual pursuits like readings for listening to the radio, general atrophy of the brain set in and lead to increased mental disability” ?!?!?!

Data mining Search for patterns and associations in very large databases, that are hidden in vast amount of data. For instance: Market basket data  purchases recorded by the cash scanners of a national retail chain Web logs data  Logs of the visits to a certain website Exploratory data analysis techniques are used to discover information from huge datasets! Because of the very large dimension of the datasets, efficient algorithms are necessary to “mine” the data. Data mining is cross-disciplinary: statistical methods made efficient by computer scientists!

Correlation is often used in data mining to to construct the “association rules”, i.e. to learn about the associations among variables. Association is often confused with causation in data mining! A supermarket manager observes that there is a strong positive correlation between the sales of hamburgers and hotdogs, and between the sales of hotdogs and barbecue sauce. He decides to sell hotdogs at a large discount, hoping to increase profit by simultaneously raising the price of the barbecue sauce. What is the causal model (cause& effect) that is assumed by the manager? Will the manager make money on this sale?