Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis.

Similar presentations


Presentation on theme: "CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis."— Presentation transcript:

1 CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

2 Association between two variables Example: University fees for the Big Ten Universities Data were collected to study the association between the percentage of students that were from out of state and the tuition paid by nonresidents students (in thousand dollars). Does the tuition money increase with the percentage of non residents students? University Tuition (1,000$) (Y) Nonresident s (%) (X) Northwestern16.4 72 Illinois7.68 Minnesota8.723 Ohio State9.39 Penn State10.718 Purdue9.6 27 Indiana10.229 Iowa8.631 Wisconsin9.135 Michigan15.930 Michigan State 10.59

3 Example: Example: Size of diamond and price of ring The source of the data is a full page advertisement placed in the Straits Times newspaper issue of February 29, 1992, by a Singapore-based retailer of diamond jewelry. The variables are the size of the diamond in carats (1 carat =.2 gram) and the price of ladies’ rings (single diamond stone) in Singapore dollars. Carats Singapore dollars.17355.16 328.17 350.18 325.25 642 …….….. How would you describe the association between the two variables?

4 Association between variables Data are collected for the two variables on each individual/unit. Two variables are associated if changes in one variable correspond to changes in the second variable. If there is a strong association, knowing one variable helps predicting the other. Number of programs running and CPU usage If the association is weak, information about one variable is not very useful in studying the other. Number of users and CPU usage

5 Useful terminology The following terms are often used: Response variable: measures the outcome of the study (Dependent variable) Explanatory variable: explains or causes changes in the response variable (Independent variable) Can you identify this distinction in the examples shown earlier? 1) Tuition = Response variableNon-residents=Explanatory variable 2) Carat=Explanatory variablePrice=Response variable

6 Scatter plots: displaying data about two variables Scatter plots show the relationship between two quantitative variables. One variable (independent variable) appears on the x-axis (horizontal axis) and the dependent variable appears on the y-axis (vertical axis). Each observation is represented by a point in the plot. Tuition Nonresident students NWU UMich

7 Interpreting scatter plots 1.Look for the overall pattern and for striking deviations 2.Define form, direction and strength of the relationship: a.Form: roughly linear if the points follow a straight line or nonlinear… b.Direction: positive or negative? c.Strength: how closely the points follow a clear form 3.Check for the presence of outliers, individual values that fall outside the overall pattern 4.Two variables are positively (negatively) associated if the increase of one variable correspond to an increase (decrease) in the other variable.

8 2000 Presidential Elections Did the butterfly ballots confuse voters? Did voters for Al Gore instead cast their votes for other candidates? Bush spokesman Ari Fleishcher stated on Nov. 9 that "Palm Beach County is a Pat Buchanan stronghold and that's why Pat Buchanan received 3,407 votes there." What is the level of support that Pat Buchanan enjoys in Palm Beach County? The published election results show the association between the vote totals for Pat Buchanan and the total population for Florida counties.

9 Is the association positive or negative? Is the form of the relationship almost linear?

10 Example: House data in Albuquerque (NM) in 1993 Selling price (100$) Annual Taxes ($) Interpret the graph: form, direction & strength of the relationship

11 Another example: The statistics of poverty and inequality Data from U.N.E.S.C.O. 1990 Demographic Year Book. For 97 countries in the world, data are given for birth rates and for an index of the Gross National Product.

12 The plot before shows a non-linear association! Sometimes we can make it linear, by using some transformations on the variables. Possible transformations are, for example, “ln”, “exp”, “sqrt”. Here we consider the ln(GNP)=natural log of GNP. Birth rate (1,000 pop) Log G.N.P.

13 Measure of Linear Association If there is a strong linear association between the variables, then the cloud of points on the scatter plot will be close to a line. Birth rate (1,000 pop) Log G.N.P.

14 The Correlation Coefficient r The correlation coefficient r measures the direction and the strength of the linear relationship between two variables. It is a value between –1 and 1 The closer r is to 1 or –1, the stronger the linear association is. Positive values of r imply a positive association, negative values imply a negative association Values of r close to 0 imply weak linear association. It is defined as Where X has average and standard deviation s x, and Y has average and standard deviation s y.

15 Examples of correlation Birth rate (1,000 pop) Log G.N.P. r = -0.74 Selling price (100$) Annual Taxes ($) r=0.65 Negative association Positive association

16 Diamond rings data Carat Price in US dollars N=48Averages.d.MinMax X Carat0.200.0560.120.35 Y Price in US $ 865.144213.6 4 3851879 Strong positive association r = 0.989 Diamond carats vs Price in US$

17 Positive Correlation In each plot there are 100 points. The correlation coefficient measures the amount of clustering around a line If r is close to 1, then points lie close to a straight line!!

18 Negative Correlation Negative correlation: as x increases, y tends to decrease. If r is close to – 1, then points lie close to a straight line!!

19 Guess the correlation Match the diagrams with the following correlations: – 0.93 – 0.75 –0.200.270.631.0

20 Change of scale These are the low and high temperatures in Boulder (CO) for the month of April 1996. The first scatter plot uses degrees in Fahrenheit and the second plot uses degrees in centigrade. Notice that C o = 5/9*(F o – 32) Are the correlations between low and high temperatures in the two graphs different? r = 0.74r = ?

21 Different correlations? In which diagram below is the correlation coefficient the largest? The smallest?

22 Outliers and nonlinear association How are the data sets different?

23 Plot the data: the nature of the association between x and y is very different. The correlation coefficient can be misleading in presence of outliers or non-linear association. Check the scatter plot of the data Perfect association! Why is r not equal to 1? Outliers change the value of r. What would the value of r be without the outliers? r = 0.82

24 Which of the following diagrams should be summarized by r? (1) (2) (3)

25 Ecological Correlations Ecological correlations are based on rates or averages. They can be misleading as they tend to overstate the strength of the association. The following example deals with the relationship between income and education level for individuals in 3 states (A, B, C). This shows the averages. The correlation is almost 1!! This shows individual data. The correlation is now moderate. Variability within each state!!!

26 Summary  The correlation coefficient r varies between –1 and 1. If r=0 means there no linear association between X and Y. If r=1 or –1, then the points in a scatter plot lie on a straight line.  Positive r indicates positive association between X and Y. Negative r indicates negative association between X and Y.  Both variables X and Y must be quantitative. The correlation coefficient between X and Y is the same as the correlation between Y and X  r does not change if we change the units of measurement for X and Y  The correlation measures only the linear relationship between two variables  r can be strongly affected by the presence of outliers.

27 Correlation does not mean Causation!!

28 The correlation between teachers’ salaries and the consumption of alcohol over a period of years turned out to be almost 0.90. Do the teachers drink? Both variables moved together, because both are influenced by a third variable (confounding variable) which is the long run growth in national income and population. A "bad example“ published in The New York Times' weekly science supplement called "Science Times" on August 22, 1989. It stated, "The experts have also developed startling evidence of the cat's renowned ability to survive, this time in the particular setting of New York City, where cats are prone at this time of year to fall from open windows in tall buildings. Researchers call the phenomenon feline high-rise syndrome." "Even more surprising, the longer the fall, the greater the chance of survival. Only one of 22 cats that plunged from above 7 stories died, and there was only one fracture among the 13 that fell more than 9 stories.

29 The following graph displays the number of radios in the U.K. form 1924 to 1937 and the number of mental defectives for 10,000 people for the same years. A social scientist states: “as more people gave up intellectual pursuits like readings for listening to the radio, general atrophy of the brain set in and lead to increased mental disability” ?!?!?!

30 Data mining Search for patterns and associations in very large databases, that are hidden in vast amount of data. For instance: Market basket data  purchases recorded by the cash scanners of a national retail chain Web logs data  Logs of the visits to a certain website Exploratory data analysis techniques are used to discover information from huge datasets! Because of the very large dimension of the datasets, efficient algorithms are necessary to “mine” the data. Data mining is cross-disciplinary: statistical methods made efficient by computer scientists!

31 Correlation is often used in data mining to to construct the “association rules”, i.e. to learn about the associations among variables. Association is often confused with causation in data mining! A supermarket manager observes that there is a strong positive correlation between the sales of hamburgers and hotdogs, and between the sales of hotdogs and barbecue sauce. He decides to sell hotdogs at a large discount, hoping to increase profit by simultaneously raising the price of the barbecue sauce. What is the causal model (cause& effect) that is assumed by the manager? Will the manager make money on this sale?


Download ppt "CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis."

Similar presentations


Ads by Google