IT523-01N: DATA WAREHOUSING AND DATA MINING FINAL PROJECT INSTRUCTOR: DR. SHEILA FOURNIER- BONILLA ELEISHA BARNETT How Mpgs are Affected in Vehicles: A.

Slides:



Advertisements
Similar presentations
Experiments and Variables
Advertisements

Engine Terminology Engine Measurement Lesson 8 March 2008.
BPS - 5th Ed. Chapter 241 One-Way Analysis of Variance: Comparing Several Means.
7.1 Seeking Correlation LEARNING GOAL
 Consumer Research Organization.  Commissions surveys and publishes reports & ratings for automobiles.  Maintains online discussion forums where consumers.
Correlation and Linear Regression.
1 Functions and Applications
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
To accompany Quantitative Analysis for Management, 9e by Render/Stair/Hanna 4-1 © 2006 by Prentice Hall, Inc., Upper Saddle River, NJ Chapter 4 RegressionModels.
Lecture 6 Notes Note: I will homework 2 tonight. It will be due next Thursday. The Multiple Linear Regression model (Chapter 4.1) Inferences from.
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
BCOR 1020 Business Statistics Lecture 24 – April 17, 2008.
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Chapter 5 Data mining : A Closer Look.
All about Regression Sections Section 10-4 regression Objectives ◦Compute the equation of the regression line ◦Make a prediction using the.
Regression multiple Dan Fisher Marriott School of Management Brigham Young University November 2005 linear.
Review Regression and Pearson’s R SPSS Demo
Relationships Among Variables
The Gas Guzzling Luxurious Cars Tony Dapontes and Danielle Sarlo.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Correlation and Linear Regression Chapter 13 McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
Correlation and Linear Regression
Correlation and Linear Regression
Correlation and Linear Regression Chapter 13 Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Example of Simple and Multiple Regression
Introduction Data surrounds us in the real world. Every day, people are presented with numbers and are expected to make predictions about future events.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Inference for regression - Simple linear regression
Linear Regression and Correlation
Correlation and Linear Regression
DERIVATIVES 3. We have seen that a curve lies very close to its tangent line near the point of tangency. DERIVATIVES.
Simple Linear Regression
An Excel-based Data Mining Tool Chapter The iData Analyzer.
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
Regression Examples. Gas Mileage 1993 SOURCES: Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993), Yonkers, NY: Consumers Union. PACE New.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Chapter 3 Section 3.1 Examining Relationships. Continue to ask the preliminary questions familiar from Chapter 1 and 2 What individuals do the data describe?
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 13 Linear Regression and Correlation.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Example 11.2 Explaining Overhead Costs at Bendrix Scatterplots: Graphing Relationships.
< BackNext >PreviewMain Chapter 2 Data in Science Preview Section 1 Tools and Models in ScienceTools and Models in Science Section 2 Organizing Your DataOrganizing.
Linear Regression. Determine if there is a linear correlation between horsepower and fuel consumption for these five vehicles by creating a scatter plot.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Example 13.2 Quarterly Sales of Johnson & Johnson Regression-Based Trend Models.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Chapter 9: Correlation and Regression Analysis. Correlation Correlation is a numerical way to measure the strength and direction of a linear association.
SWBAT: Calculate and interpret the residual plot for a line of regression Do Now: Do heavier cars really use more gasoline? In the following data set,
Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.
Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.
Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)
Example x y We wish to check for a non zero correlation.
CS : Assignment 1 Play around with your data.
The following data represents the amount of Profit (in thousands of $) made by a trucking company dependent on gas prices. Gas $
Bell Ringer A random sample of records of sales of homes from Feb. 15 to Apr. 30, 1993, from the files maintained by the Albuquerque Board of Realtors.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Correlation and Linear Regression
Correlation and Linear Regression
Scatter Plots and Correlation
Regression Analysis.
Sections Review.
Mixed Costs Chapter 2: Managerial Accounting and Cost Concepts. In this chapter we explain how managers need to rely on different cost classifications.
The following data represents the amount of Profit (in thousands of $) made by a trucking company dependent on gas prices. Gas $
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Lesson 2.8 Quadratic Models
Regression.
Multiple Regression BPS 7e Chapter 29 © 2015 W. H. Freeman and Company.
CHAPTER 3 Describing Relationships
Presentation transcript:

IT523-01N: DATA WAREHOUSING AND DATA MINING FINAL PROJECT INSTRUCTOR: DR. SHEILA FOURNIER- BONILLA ELEISHA BARNETT How Mpgs are Affected in Vehicles: A Model Using WEKA Supervised and Unsupervised Analysis Tools

THE MODEL: A DATASET OF 398 AUTOMOBILES WITH 8 ATTRIBUTES THAT COULD POSSIBLY AFFECT A VEHICLE’S GAS CONSUMPTION (MILES PER GALLON) PERFORMANCE How Mpgs are Affected in Vehicles

1908 Model T Ford? 1961 Chevrolet Corvette? Which Gets Better Gas Mileage?

The Attributes Number of Cylinders Engine Displacement Horsepower Weight Acceleration Model (model year) Origin (where the car was made) Class (luxury, sports, sedan, coupe, etc.)

PART Analysis I first used the WEKA Data Analyzer doing a PART rule classification of all 398 instances with cylinders as the output attribute as many car manufacturers use cylinders as an indicator of power and gas mileage, generally meaning the smaller amount of cylinders, the better the gas mileage, but the less power, especially in terms of horsepower. Horsepower is a term whose original meaning is somewhat archaic, indicating the number of horses it would take to put out the same amount of power as found in an engine.

PART Analysis The PART Rule generator used engine displacement to generate the rules with the cylinders. This is important because engine displacement plays a part in the determination of gas mileage. To explain this further, Engine displacement is the volume swept by all the pistons inside the cylinders of an internal combustion engine in a single movement from top dead center to bottom dead center. It is commonly specified in cubic centimeters(cc), liters (l), or (mainly in North America) cubic inches (CID). Engine displacement does not include the total volume of the combustion chamber (Wikipedia, 2011).

PART Analysis As you can see, 6 rules were generated based on the given attributes and output. What we are given is generally, the greater the displacement, the more cylinders a vehicle has and also, the higher the gas consumption. For example, the vehicles with a rule of displacement > 70:4 (191.0/3.0) indicate a smaller engine, therefore less horsepower and a higher mpg or miles per gallon rating. Conversely, displacement > 258:8 (104.0/1.0) indicate a larger engine, more horsepower, and lesser mpg.

PART Analysis The number of correctly classified instances shows at 384/398 showing an accuracy rate of %, 14 incorrectly classified at an accuracy rate of %. It’s possible that the inaccuracies came from the odd European cars that have 3 and 5 cylinders and thus do not fit the usual profiles. This actually applies to 3 cylinders as there were not representations of 5 cylinders. The 3 cylinders were represented in 2 rules of origin = 1:4 (15.0/1.0) and displacement > 107:3 (4.0/1.0). The interesting item to note is that these 3 cylinders engines have the same displacement as a smaller 6 (displacement > 107:6(6.0/1.0) cylinder and presumably the same mpg rating.

J48 Decision Tree Analysis As we can see by this J48 decision tree, the analysis breaks down the dataset further to show how origin of a vehicle might influence mpgs, however, the data indicates that there is little merit to this, but we will examine this further in the clusters analysis. In the meantime, the J48 bears out the same analysis as PART but breaks the analysis down further. In J48 analysis, it presents a slightly more accurate picture than PART.

J48 Analysis In this case, 386 ( %) instances are correctly classified and only 12 (3.0151%) instances incorrectly classified. This sets our TRUE Positive rate at 1 versus a FALSE Positive rate of o.003, which means that we can be 100% confident in the correlation of the data in the rule of IF displacement 156 AND cylinder <= 6 THEN low mpg. The TP and FP rate is calculated based on the confusion matrix. We take the two classifications, add them together to get the predictive number and then divide the true positive number by the predictive number.

Cluster Analysis In cluster analysis, we must decide if there are associations and if they are worth further study. In this case, we use a rough measure of attribute significance to accomplish this. Specifically, for each attribute, subtract the attribute means for the two clusters and divide the absolute value of this result by the domain standard deviation for the attribute. Computations near or greater than one indicate attributes that have been clearly differentiated by the clustering. If there are no such attributes, the clustering is of little interest.

Cluster Analysis As we can see by the next slide, the differentials of the different attributes did not show at or near 1 and so we must conclude that this cluster analysis is not worth exploring. However, as we see in the final analysis, it may be a faulty line of reasoning.

CYLINDER AS THE OUTPUT ATTRIBUTE DISPLACEMENT = / = 0.46 HORSEPOWER = / = 0.36 WEIGHT = / = 0.44 ACCELERATION = / = 0.19 CLASS = /7.816 = Cluster Analysis

Linear Regression Analysis In our final analysis, we will be looking at linear regression. The purpose of regression analysis is to come up with an equation of a line that fits through that cluster of points with the minimal amount of deviations from the line. The deviation of the points from the line is called "error." Once I have this regression equation, I could use this information to predict class. Simple linear regression is actually the same as a bivariate correlation between the independent and dependent variable (Princeton, 2011).

Linear Regression Analysis I can use linear regression to predict values of one variable, given values of another variable. If I plot the values on a graph, with cylinder on the x axis and displacement on the y axis, for example, then the result is a linear relationship between cylinder and displacement showing a cluster of points on the graph which slopes upward.

Linear Regression Analysis However, some very interesting results presented here. While the cylinder/displacement relationship bore true, following the slope upward, it indicates that there are other factors in determining mpg. The clusters grow stronger through horsepower, weight, and acceleration, weakening in model year and origin, and becoming strong again in class.

Linear Regression Analysis Due to incompatibility issues with the WEKA autompg.arff file and Excel, I was unable to copy and paste into Excel and run a LINEST analysis which is why I ran the WEKA visualization. However, I was able to snip and paste the data onto this presentation so as to give one the instances and attributes used.

WHAT CAN WE CONCLUDE FROM THESE ANALYSES? ENGINE SIZE DOES PLAY A ROLE IN GASOLINE CONSUMPTION HOWEVER, OTHER ATTRIBUTES NEED TO BE CONSIDERED IN DETERMINING GAS MILEAGE OR MPG. THESE ATTRIBUTES INCLUDE WEIGHT, ACCELERATION, HORSEPOWER, AND CLASS OF VEHICLE IT IS PRUDENT TO USE MORE THAN ONE ANALYSIS TOOL WHILE NEITHER THE MODEL T NOR THE CORVETTE SHOWN IN SLIDE 3 WERE PART OF THE DATASET, THE MODEL T WINS AT 25 MPG VERSUS THE CORVETTE AT 8 MPG Conclusion

THE FORD MODEL T USED A 177 CUBIC INCH (2.9 L) INLINE 4 CYLINDER ENGINE. IT WAS PRIMARILY A GASOLINE ENGINE, BUT IT HAD MULTIFUEL ABILITY AND COULD ALSO BURN KEROSENE OR ETHANOL. IT PRODUCED 20 HP FOR A TOP SPEED OF 45 MPH. THE CHEVROLET CORVETTE USED A 327 CU IN (5.36 L) V8 8 CYLINDER ENGINE AND WAS STRICTLY A GAS ENGINE. IT PRODUCED 340 HP FOR A TOP SPEED OF 130 MPH Conclusion

References accessed 29 May 11 Roiger, R. J.; Geatz, M. W., Data Mining (2003). A Tutorial-Based Primer, Addison Wesley, Boston, MA Marakas, G. M. (2003). Modern data warehousing, mining, and visualization: core concepts. Upper Saddle River, NJ: Prentice Hall The University of Waikato (WEKA) accessed 27 May 11 Barnett, Eleisha (2011) Photos courtesy of Eleisha Barnett accessed 30 May accessed 30 May 11