Pittsburgh Data Jam 2016 Bringing Big Data Education and Awareness to Pittsburgh High School Students February 26, 2016.

Slides:



Advertisements
Similar presentations
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Advertisements

Forecasting Using the Simple Linear Regression Model and Correlation
7.1 Seeking Correlation LEARNING GOAL
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Chapter 3 Bivariate Data
Chapter 8 Linear Regression © 2010 Pearson Education 1.
CHAPTER 8: LINEAR REGRESSION
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Describing the Relation Between Two Variables
Chapter 12 Simple Regression
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Correlation and Regression Analysis
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 13-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Chapter Topics Types of Regression Models
Linear Regression and Correlation Analysis
1 Business 260: Managerial Decision Analysis Professor David Mease Lecture 1 Agenda: 1) Course web page 2) Greensheet 3) Numerical Descriptive Measures.
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Introduction to Regression Analysis, Chapter 13,
Relationships Among Variables
Simple Linear Regression
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
MAT 254 – Probability and Statistics Sections 1,2 & Spring.
STAT 211 – 019 Dan Piett West Virginia University Lecture 2.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Statistics for Managers.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Relationships between Variables. Two variables are related if they move together in some way Relationship between two variables can be strong, weak or.
PowerPoint Template – delete this slide Fill in the appropriate slides Remove any bold or italicized words after you’ve added your changes Delete slides.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
Variation This presentation should be read by students at home to be able to solve problems.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
Stat 13, Tue 5/29/ Drawing the reg. line. 2. Making predictions. 3. Interpreting b and r. 4. RMS residual. 5. r Residual plots. Final exam.
1.1 Statistical Analysis. Learning Goals: Basic Statistics Data is best demonstrated visually in a graph form with clearly labeled axes and a concise.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
CORRELATION. Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson’s coefficient of correlation.
Lecture 10: Correlation and Regression Model.
Statistical Analysis Topic – Math skills requirements.
Statistics: Analyzing 2 Quantitative Variables MIDDLE SCHOOL LEVEL  Session #2  Presented by: Dr. Del Ferster.
EXCEL DECISION MAKING TOOLS BASIC FORMULAE - REGRESSION - GOAL SEEK - SOLVER.
Intro to Psychology Statistics Supplement. Descriptive Statistics: used to describe different aspects of numerical data; used only to describe the sample.
EXCEL DECISION MAKING TOOLS AND CHARTS BASIC FORMULAE - REGRESSION - GOAL SEEK - SOLVER.
Statistical Methods © 2004 Prentice-Hall, Inc. Week 3-1 Week 3 Numerical Descriptive Measures Statistical Methods.
Introduction Dispersion 1 Central Tendency alone does not explain the observations fully as it does reveal the degree of spread or variability of individual.
BUSINESS MATHEMATICS & STATISTICS. Module 6 Correlation ( Lecture 28-29) Line Fitting ( Lectures 30-31) Time Series and Exponential Smoothing ( Lectures.
Introduction Many problems in Engineering, Management, Health Sciences and other Sciences involve exploring the relationships between two or more variables.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
MAT 135 Introductory Statistics and Data Analysis Adjunct Instructor
MATH-138 Elementary Statistics
Travelling to School.
How could data be used in an EPQ?
Tools of Environmental Science
Correlation and Regression
CHAPTER 3 Describing Relationships
Presentation transcript:

Pittsburgh Data Jam 2016 Bringing Big Data Education and Awareness to Pittsburgh High School Students February 26, 2016

Introductions  Saman Haqqi - President - Pittsburgh Dataworks   Brian Macdonald – Data Scientist – Oracle Corporation   Pitt Science Outreach  Margaret  Laura  Jenny  Jackie  Kyle  Chris

Mentors  Each team will be assigned a mentor  Can ask questions via at any time  Copy everyone on your team  Copy your teacher  Pitt Science Outreach students  Send to all  Have a regular scheduled call with your mentor  Don’t wait to right before presentations.

Data Analysis Workshop Today’s Goals  Identifying relevant variables  Depicting them graphically  Doing the analysis  Drawing conclusions  Making recommendations

What technology will you use?  Lots of tools are available  Keep it simple at the beginning  Use Excel  Tableau is also available  Many Others  R, SAS, Cognos, Oracle Business Intelligence, Google Apps, Matlab, Pyhton, Spotfire, QlikView

Data Analysis Process  A standard repeatable process to guide data analysis.  Used formally and informally  If you do analysis, you will do these steps.  Used for Big Data or not so Big Data  Becomes second nature as you do more analysis.  Is not about using a cool data analysis tool  Although they are extremely helpful.

The Data Analysis Process  Define your Problem  Identify Data  Plan your Analysis  Explore Data  Prepare Data  Model Data  Tell A Story  Make Recommendations  Determine What’s Next Today’s Focus In practice it looks like this

Basic Steps for Analysis  Data Exploration  Data Preparation  Build Models

Data Exploration Exploratory Data Analysis (EDA)  Goal is to get an understanding of what data you have  What are your variables  Basic Statistics  Graph Data  Look for missing values  Look for outliers  Will this data help you answer your question?

Basic Statistics  Goal is to get a basic understanding of your data  Mean (Average) Sum of values/Count of values  Median Mid Point of Values  Maximum, Minimum (Range)  Standard Deviation (σ) & Variance (σ^2) How spread out the values are compared to the mean  Quartiles Nice buckets of the spread of the data

Demo - Statistics in Excel

Graphing Data  Helps visualize patterns in the data  Especially with large data sets.  gnip/locals/#12/ /  Spot exceptions  Use the best graph for the data types  Help tell your story

Demo - Graphing in Excel

Missing Values  Can have large impact on basic statistics  Count # of missing values of every variable (column)  Important to understand why data is missing?  Data entry  Wasn’t collected  Isn’t relevant  Should you use the variable?  Should you fill in missing values  Use mean, median, max, min, 0.  You need to determine best method

Outliers  Outliers are values at the extreme  Much larger or smaller than most of your data  May have many causes  Data Entry Error  Instrument Malfunction  Real Exceptional data  Is 140º F an Outlier  Some are easy to spot within a single variable  Some are only found with multiple variables

Outliers  Need to decide how to treat Outliers  Is the variable ok to use? Do you question the validity of the data?  Remove them from your data set?  Keep them as is?  Change the value (i.e. make it less extreme)  Infer the real meaning -90º F temperature in Miami is likely 90º  Make sure you understand implications  Document your decision making

Demo – Missing Values & Outlier Detection in Excel

One Last Thought on Exploring Data You must be observant  Count the Number of F’s in the following sentence.  You will have 15 Seconds FINISHED FILES ARE THE RE- SULT OF YEARS OF SCIENTIF- IC STUDY COMBINED WITH THE EXPERIENCE OF YEARS.

Leave your assumptions at the door! FINISHED FILES ARE THE RE- SULT OF YEARS OF SCIENTIF- IC STUDY COMBINED WITH THE EXPERIENCE OF YEARS.

Exploration Exercise  Using Excel  Sort  Filter  Summarize  Create Crosstabs  Charting

Basic Steps for Analysis  Data Exploration  Data Preparation  Build Models

Data Preparation  This step will fix any issues you found during data exploration  Fix missing values  Remove bad data  Create new variables  Add/Subtract/Multiply/Divide multiple variables  Ratios  Binning  Other functions like Square Root or Exponents  Anything else you feel appropriate  Have fun and experiment. You can not hurt data.

Demo – Data Preparation

Preparation Exercise  Using Excel  Merge data  New Calculations  Fix Missing Data  Fix Outliers

Basic Steps for Analysis  Data Exploration  Data Preparation  Build Models

Explaining Insights  How do you know what you see is valid?  And not due to chance?  Correlation

Correlation  The degree to which two or more attributes or measurements on the same group of elements show a tendency to vary together  Positive when values increase together  Negative when values decrease together

What can you tell me about this graph? Ice Cream consumption/capita Drownings

Does Ice Cream Consumption Cause Drowning?  Obviously not  Correlation does not imply Causation  One may cause the other, but correlation just defines how they vary.  There may be other reasons. i.e. Hot temperatures  Be very cautious with Causation  There are tests to determine causation

How do I know if variables are correlated  R = Correlation Coefficient  Values between -1 & 1  Positive Correlation > 0 - As one variable increases, the other increases  Perfect Correlation = 1  Negative Correlation < 0 - As one variable increases, the other decreases  Perfect Negative Correlation = -1  0 = No correlation  Can be shown with a trend line  Understanding R and R 2

How do I know if variables are correlated  R 2 = Coefficient of Determination  Tells how likely one variable predicts the other variable  Values between 0 & 1  If R 2 = 0.850, 85% of the total variation in y can be explained by the linear relationship between x and y  R 2 is more commonly used  Understanding R and R 2

Some Terminology  Independent Variable  These are the variables that you modify  In trend equation they are the X values  Dependent Variable  These values depend on the values of the Independent variables.  In trend equation they are the Y values y = x y is Living Area x is Sale Price Slope Intercept

Demo – Modeling Data

Modeling Exercise  Using Excel  Create scatter plot  Show Coefficient of determination  Create a formula to predict a value

What did the Data Tell You  Did it support your initial question?  What conclusions can you make?  Make sure they are fact based  Check your bias  What is your story?  Is it compelling? Does x influence y?  Can it support actions to be taken?  If not, is there still some benefit?

What did the Data Tell You  What recommendations will you make?  Will you stand behind them?  If not, why not?  Can they really be implemented?  What is the value of implementing the recommendation  What new questions would you ask?  To clarify your analysis?  Expand on your analysis  Can better questions be asked?

And the most important Item Have Fun

Questions? Always ask questions!!!!

Timing  Introductions – 10 Minutes  Overview/Data exploration Lecture – 35 Minutes  Exploration Hands-on – 30 Minutes  Data Prep Lecture – 20 Minutes  Data Prep Hands-on – 25 Minutes  Data Modeling Lecture – 20 Minutes  Data Modeling – Hand-on – 30 Minutes  Questions/Wrap Up – 10 Minutes  Total 3:00