Presentation is loading. Please wait.

Presentation is loading. Please wait.

Department of Mathematics

Similar presentations


Presentation on theme: "Department of Mathematics"— Presentation transcript:

1 Department of Mathematics
Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School 1

2 Let us pause for a few moments…
What are we working on in this chapter?

3 Problem Statement If we have a scatter plot that seems “linear”, can we find an equation that generates similar data? How accurate will it be?

4 Regression One important branch of inferential statistics, called regression analysis, is used to compare quantities or variables, to discover relationships that exist between them, and to formulate those relationships in useful ways.

5 Linear Regression Once a scatter diagram has been produced, we can draw a curve that best fits the pattern exhibited by the sample points. The best-fitting curve for the sample points is called an estimated regression curve. If the points in the scatter diagram seem to lie approximately along a straight line, the relationship is assumed to be linear, and the line that best fits the data points is called the estimated linear regression.

6 Linear Regression Linear regression is the process of determining the linear relationship between two variables. If we assume that the best-fitting curve is a line, then the equation of that line will take the form y = ax + b, where a is the slope of the line and b is the y-coordinate of the y-intercept. To identify the estimated regression line, we must find the values of the “regression coefficients” a and b.

7 Regression, 1st approach

8 2nd Approach: Med-Med Line

9 How do we evaluate accuracy?
Root Mean Square Error (RMS) Sum of Squares of Residuals (SSres) 9 9

10 3rd Approach: Least Squares
For each x-value in the data set, the corresponding y-value usually differs from the value it would have if the data point were exactly on the line. These differences are shown in the figure by vertical line segments. The most common procedure is to choose the line where the sum of the squares of all these differences is minimized. This is called the method of least squares, and the resulting line is called the least squares line.

11 Linear Regression Linear regression is the process of determining the linear relationship between two variables. The line of best fit (regression line or the least squares line) is the line such that the sum of the squares of the vertical distances from the line to the data points (on a scatter diagram) is a minimum.

12 Linear Regression Formulas
The least squares line (regression line) that provides the best fit to the data points (x1, y1), (x2, y2),… (xn, yn) has the equation

13 Med-Med vs. Least Squares
The Median-Median Line is sometimes called the resistant line because it is not very influenced by one or two “bad” data points. The Least Squares Line uses every point in its calculation, so it is affected by outliers.

14 Example 1: Regression Suppose that we wish to get an idea of how the number of hours preparing for a final exam relates to the score on the exam. Data is collected and shown below. Hours 1 2 3 4 5 6 7 8 9 10 Score 50 62 74 70 86 78 90 96 94

15 Linear Regression The first step in analyzing these data is to graph the results as shown in the scatter diagram on the next slide.

16 Scatter Diagram

17 Linear Regression If we let x denote hours studying and y denote exam score in the data of the previous slide and assume that the best-fitting curve is a line, then the equation of that line will take the form y = mx + b, where m is the slope of the line and b is the y-coordinate of the y-intercept. To identify the estimated regression line, we must find the values of the “regression coefficients” m and b.

18 Example 1: Computing a Least Squares Line
Solution The equation is

19 Estimated Regression Line

20 Example: Med-Med vs. Best Fit
Hours 1 2 3 4 5 6 7 8 9 10 Score 50 62 74 70 86 78 90 96 94 Using Dobbie, Find the estimated regression line using both methods

21 Example 2: Predicting from a Regression Line
Use the result from the previous example to predict the exam score for a student that studied 6.5 hours. I) Med-Med: Use the equation and replace x with 6.5. Based on the given data, the student should make about an 82%. II) Best Fit: Use the equation and replace x with 6.5. Based on the given data, the student should make about an 81%.

22 Linear Correlation and Regression
13.8 Linear Correlation and Regression

23 Linear Correlation Linear correlation is used to determine whether there is a relationship between two quantities and, if so, how strong the relationship is. The linear correlation coefficient, r, is a unitless measure that describes the strength of the linear relationship between two variables. If the value is positive, as one variable increases, the other increases. If the value is negative, as one variable increases, the other decreases. The variable, r, will always be a value between –1 and 1 inclusive.

24 Scatter Diagrams A visual aid used with correlation is the scatter diagram, a plot of points (bivariate data). The independent variable, x, generally is a quantity that can be controlled. The dependant variable, y, is the other variable. The value of r is a measure of how far a set of points varies from a straight line. The greater the spread, the weaker the correlation and the closer the r value is to 0.

25 Correlation

26 Correlation

27 Linear Correlation Coefficient
The formula to calculate the correlation coefficient (r) is as follows:

28 Example: Words Per Minute versus Mistakes
There are five applicants applying for a job as a medical transcriptionist. The following shows the results of the applicants when asked to type a chart. Determine the correlation coefficient between the words per minute typed and the number of mistakes. 9 34 Nancy 10 41 Kendra 12 53 Phillip 11 67 George 8 24 Ellen Mistakes Words per Minute Applicant

29 Solution We will call the words typed per minute, x, and the mistakes, y. List the values of x and y and calculate the necessary sums. 306 81 1156 9 34 xy = 2,281 y2 = 510 x2 =10,711 y = 50 x = 219 10 12 11 8 y Mistakes xy y2 x2 x 41 53 67 24 WPM 410 100 1681 636 144 2809 737 121 4489 192 64 576

30 Solution continued The n in the formula represents the number of pieces of data. Here n = 5.

31 Solution continued Since 0.86 is fairly close to 1, there is a fairly strong positive correlation. This result implies that the more words typed per minute, the more mistakes made.

32 Linear Regression Linear regression is the process of determining the linear relationship between two variables. The line of best fit (line of regression or the least square line) is the line such that the sum of the vertical distances from the line to the data points is a minimum.

33 The Line of Best Fit Equation:

34 Example Use the data in the previous example to find the equation of the line that relates the number of words per minute and the number of mistakes made while typing a chart. Graph the equation of the line of best fit on a scatter diagram that illustrates the set of bivariate points.

35 Solution Therefore the line of best fit is y = 0.081x + 6.452
From the previous results, we know that Now we find the y-intercept, b. Therefore the line of best fit is y = 0.081x

36 Solution continued To graph y = 0.081x , plot at least two points and draw the graph. 8.882 30 8.072 20 7.262 10 y x

37 Solution continued


Download ppt "Department of Mathematics"

Similar presentations


Ads by Google