Exam 3 Sample Decision Trees Cluster Analysis Association Rules Data Visualization SAS.

Slides:



Advertisements
Similar presentations
Hypothesis Testing: Intervals and Tests
Advertisements

Quantitative Skills 4: The Chi-Square Test
Final Exam Review. Data Mining and Data Analytics Techniques Explain the three data analytics techniques we covered in the course Decision Trees, Clustering,
Hypothesis Testing Steps of a Statistical Significance Test. 1. Assumptions Type of data, form of population, method of sampling, sample size.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
The Simple Regression Model
Hypothesis Tests for Means The context “Statistical significance” Hypothesis tests and confidence intervals The steps Hypothesis Test statistic Distribution.
The Binomial Probability
PSY 307 – Statistics for the Behavioral Sciences Chapter 8 – The Normal Curve, Sample vs Population, and Probability.
The discipline of statistics: Provides methods for organizing and summarizing data and for drawing conclusions based on information contained in data.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Chapter 5 Data mining : A Closer Look.
Decision Tree Models in Data Mining
Based on a sample x 1, x 2, …, x 12 of 12 values from a population that is presumed normal, Genevieve tested H 0 :  = 20 versus H 1 :   20 at the 5%
Discussing the student measurements of building height. Letting them originate concepts for: Multiple measures Mean Standard Deviation Outliers / identifying.
Dennis Shasha From a book co-written with Manda Wilson
Fundamentals of Hypothesis Testing
Tests of significance & hypothesis testing Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics.
Review of Basic Statistics. Definitions Population - The set of all items of interest in a statistical problem e.g. - Houses in Sacramento Parameter -
1 Tests with two+ groups We have examined tests of means for a single group, and for a difference if we have a matched sample (as in husbands and wives)
Chapter 26: Comparing Counts AP Statistics. Comparing Counts In this chapter, we will be performing hypothesis tests on categorical data In previous chapters,
Let’s flip a coin. Making Data-Based Decisions We’re going to flip a coin 10 times. What results do you think we will get?
STATISTICAL INFERENCE PART VII
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 13: Nominal Variables: The Chi-Square and Binomial Distributions.
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
13.1 Goodness of Fit Test AP Statistics. Chi-Square Distributions The chi-square distributions are a family of distributions that take on only positive.
By C. Kohn Waterford Agricultural Sciences.   A major concern in science is proving that what we have observed would occur again if we repeated the.
NOTES The Normal Distribution. In earlier courses, you have explored data in the following ways: By plotting data (histogram, stemplot, bar graph, etc.)
Associate Professor Arthur Dryver, PhD School of Business Administration, NIDA url:
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
Quantitative Methods Partly based on materials by Sherry O’Sullivan Part 3 Chi - Squared Statistic.
1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.
Summary Statistics Review
Exam 3 Review Decision Trees Cluster Analysis Association Rules Data Visualization SAS.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
February 2012 Sampling Distribution Models. Drawing Normal Models For cars on I-10 between Kerrville and Junction, it is estimated that 80% are speeding.
Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.
11/16/2015Slide 1 We will use a two-sample test of proportions to test whether or not there are group differences in the proportions of cases that have.
Unit 8 Section 8-1 & : Steps in Hypothesis Testing- Traditional Method  Hypothesis Testing – a decision making process for evaluating a claim.
Warm up On slide.
The Three Analytics Techniques. Decision Trees – Determining Probability.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Marketing Research Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides 1.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
1 Chapter 11: Bivariate Statistics and Statistical Inference “Figures don’t lie, but liars figure.” Key Concepts: Statistical Inference.
Review - Confidence Interval Most variables used in social science research (e.g., age, officer cynicism) are normally distributed, meaning that their.
Statistics in IB Biology Error bars, standard deviation, t-test and more.
Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Modern Languages Row A Row B Row C Row D Row E Row F Row G Row H Row J Row K Row L Row M
Amanda and Marlee. About Planet Smoothie! 3 rd Largest American Smoothie Company Founded 1995; Atlanta, GA Bonnie Rhinehardt –President Franchise 16 States.
Lecturer’s desk Physics- atmospheric Sciences (PAS) - Room 201 s c r e e n Row A Row B Row C Row D Row E Row F Row G Row H Row A
Cross Tabs and Chi-Squared Testing for a Relationship Between Nominal/Ordinal Variables.
11.1 Chi-Square Tests for Goodness of Fit Objectives SWBAT: STATE appropriate hypotheses and COMPUTE expected counts for a chi- square test for goodness.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Chapter 9 Testing A Claim 9.1 SIGNIFICANT TESTS: THE BASICS OUTCOME: I WILL STATE THE NULL AND ALTERNATIVE HYPOTHESES FOR A SIGNIFICANCE TEST ABOUT A POPULATION.
PO 141: INTRODUCTION TO PUBLIC POLICY Summer I (2015) Claire Leavitt Boston University.
Statistical Sampling. Sample  A subset of units selected from the population to represent it.  Hopefully it is representative.
Topic Test 1 Review.
Hypothesis Testing Hypothesis testing is an inferential process
Hypothesis Testing Review
Hypothesis Testing Is It Significant?.
JEOPARDY-Round 2 Statistics Probability Functions Geometry Advanced
Exam #3 Review Zuyin (Alvin) Zheng.
Making Data-Based Decisions
Practice Mid-Term Exam
Inference About Variables Part IV Review
EQ: How well does the line fit the data?
MIS2502: Review for Exam 3 Aaron Zhi Cheng
15.1 The Role of Statistics in the Research Process
Presentation transcript:

Exam 3 Sample Decision Trees Cluster Analysis Association Rules Data Visualization SAS

When to Use Which Analysis (D, C or A)? –When someone gets an A in this class, what other classes do they get an A in? –What predicts whether a company will go bankrupt? –If someone upgrades to an iPhone, do they also buy a new case? –Which party will win the election? –Can we group our website visitors into types based on their online behaviors? –Which customers will purchase our product? –Can we identify different product markets based on customer demographics?

SAS When to Use Which Analysis (D, C or A)? –When someone gets an A in this class, what other classes do they get an A in? –What predicts whether a company will go bankrupt? –If someone upgrades to an iPhone, do they also buy a new case? –Which party will win the election? –Can we group our website visitors into types based on their online behaviors? –Which customers will purchase our product? –Can we identify different product markets based on customer demographics?

Decision Trees Which is the Root Node? # Leafs Nodes?

Decision Trees Which is the Root Node? # Leafs Nodes? 1

Probability of Purchase? i) Female, 130 lbs, 12 ft? ii) 120 lbs, 5 feet, male? Best predictor variable? Outcome Data 062% 138% n350 OutcomeData 055% 145% n250 OutcomeData 040% 160% n150 OutcomeData 060% 140% n250 Outcome Data 045% 155% n75 OutcomeData 035% 165% n75 Height Weight <150>=150 Weight Gender <170 >=170 Male Female <6’ >=6’

Probability of Purchase? i) Female, 130 lbs, 12 ft? ii) 120 lbs, 5 feet, male? Best predictor variable? Outcome Data 062% 138% n350 OutcomeData 055% 145% n250 OutcomeData 040% 160% n150 OutcomeData 060% 140% n250 Outcome Data 045% 155% n75 OutcomeData 035% 165% n75 Height Weight <150>=150 Weight Gender <170 >=170 Male Female <6’ >=6’

Probability of Purchase? i) 5 ft 5 inches? ii) 6 ft 5 inches 190 lbs? Outcome Data 062% 138% n350 OutcomeData 055% 145% n250 OutcomeData 040% 160% n150 OutcomeData 060% 140% n250 Outcome Data 045% 155% n75 OutcomeData 035% 165% n75 Height Weight <150>=150 Weight Gender <170 >=170 Male Female <6’ >=6’

Decision Trees What does it mean that Gender is only on the right side of the tree? Why is it not on both sides? Based on the tree, which demographic is MOST likely to buy the product? Least likely to buy the product?

Decision Trees What does it mean that Gender is only on the right side of the tree? Why is it not on both sides? –Gender only has predictive/explanatory power for customers who are greater than or equal to 6 feet and below 170lbs. –That is, in other subsets of the population, it does no better than chance at predicting behavior. Based on the tree, which demographic is MOST likely to buy the product? Least likely to buy the product? –Biggest Leaf Node Probability (1): Over 6 ft, below 170 lbs, female (1 = 65% probability) –Biggest Leaf Node Null Probability (0): below 6 ft, below 150 lbs (0 = 62% probability)

Decision Trees What Statistics are Used to Determine Splits for Decision Trees? –Gini Coefficient, Chi-Square Statistics (p-value) What does it mean when the Gini = 1? What does it mean when the Chi-square is bigger? What happens to the p-value as the Chi-square gets bigger? –

Decision Trees What Statistics are Used to Determine Splits for Decision Trees? –Gini Coefficient, Chi-Square Statistics (p-value) What does it mean when the Gini = 1? –The predictor is no better than flipping a coin (you want a small Gini) What does it mean when the Chi-square is bigger? –The variable is better at predicting the outcome (you want a big Chi-square) What happens to the p-value as the Chi-square gets bigger? –The p-value gets smaller as the Chi-square gets bigger (you want a small p-value)

Clustering What statistics do we care about in cluster analysis? What do they represent? What happens to these statistics as the number of clusters is increased? Why do we standardize data? Why do we eliminate outliers?

Clustering What statistic do we care about in cluster analysis? What does it represent? –Sum of Squared Errors – SSE (or Root Mean Square Std Dev.) –Within SSE = cohesion, Between SSE = distinctiveness What happens to these statistics as the number of clusters is increased? –SEE goes down (both within and between) –More cohesive clusters, less distinct though Why do we standardize data? Why do we eliminate outliers? –Standardize else variables with bigger values will have greater weighting –Elimination outliers because they can skew results

Clustering What are the pros and cons of having only a few clusters (compared to having many clusters)? What is bad about the below cluster analysis result? How would you improve it?

What are the pros and cons of having only a few clusters (compared to having many clusters)? –Easier to interpret/analyze, but they may be less informative What is bad about the below cluster analysis result? How would you improve it? –Clusters should be fairly round! –Add more clusters. Clustering

Association Rules How would you describe the following association rule? –{Meat, Dairy}  {Vegetables} How many items are in this item set? What is (are) the antecedents? What are the consequents? What are the statistics we care about when evaluating an association rule?

Association Rules How would you describe the following association rule? –{Meat, Dairy}  {Vegetables} –When someone eats meat and dairy they also eat vegetables. How many items are in this item set? –This is a 3 item set. What is (are) the antecedents? What are the consequents? –Meat and Dairy are the antecedents, vegetables is the consequent. What are the statistics we care about when evaluating an association rule? –Support count, Support Percent, Confidence and Lift

Association Rules Do the following two rules have to have the same Confidence? The same Support? The same Lift? –{Meat, Dairy}  {Vegetables} –{Vegetables}  {Meat, Dairy} What does Lift > 1 mean? Would you take action on such a rule? –What about Lift < 1? –What about Lift = 1?

Association Rules Do the following two rules have to have the same Confidence (NO) ? The same Support (Yes)? The same Lift (Yes)? –{Meat, Dairy}  {Vegetables} –{Vegetables}  {Meat, Dairy} What does Lift > 1 mean? Would you take action on such a rule? –More co-purchase observed than chance would predict (+ association) –What about Lift < 1? Less than chance predicts (- association) –What about Lift = 1? Chance explains the observed co-purchase (no apparent association)

Association Rules What might you do as a manager if you saw a very high Lift and Confidence for the following rule about product purchase? Why would you do this? –{Pasta}  {Orange Juice}

Association Rules What might you do as a manager if you saw a very high Lift and Confidence for the following rule about product purchase? Why would you do this? –{Pasta}  {Orange Juice} Encourage pasta buyers to see OJ (placement) Get them in and milk ‘em (discount pasta, premium OJ) Target market (advertise new OJ to Pasta customers)

Association Rules What is the most reliable association rule below?

Association Rules What is the most reliable association rule below? –Rule 2 – Tied for best Lift (3.60), but has Better confidence!

Data Visualization Look at In-Class Exercise Answers...