SAS Homework 4 Review Clustering and Segmentation

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Chapter 3 Examining Relationships
Section 4.3 ~ Measures of Variation
Section #1 October 5 th Research & Variables 2.Frequency Distributions 3.Graphs 4.Percentiles 5.Central Tendency 6.Variability.
Simple Linear Regression. Start by exploring the data Construct a scatterplot  Does a linear relationship between variables exist?  Is the relationship.
CHAPTER 23: Two Categorical Variables: The Chi-Square Test
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
Linear Regression and Correlation Analysis
Looking at data: distributions - Describing distributions with numbers
Chapter 13 Introduction to Linear Regression and Correlation Analysis
1 Chapter 4: Variability. 2 Variability The goal for variability is to obtain a measure of how spread out the scores are in a distribution. A measure.
Learning Objectives In this chapter you will learn about the importance of variation how to measure variation range variance standard deviation.
Understanding and Comparing Distributions
CHAPTER 1: Picturing Distributions with Graphs
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
Standard Deviation. Two classes took a recent quiz. There were 10 students in each class, and each class had an average score of 81.5.
SW318 Social Work Statistics Slide 1 Estimation Practice Problem – 1 This question asks about the best estimate of the mean for the population. Recall.
Economics 173 Business Statistics Lecture 2 Fall, 2001 Professor J. Petry
SAS Homework 3 Review Association rules mining
Quantitative Skills: Data Analysis
Chapter 3 Descriptive Measures
Statistics Recording the results from our studies.
Topic 1: Descriptive Statistics CEE 11 Spring 2001 Dr. Amelia Regan These notes draw liberally from the class text, Probability and Statistics for Engineering.
Worked examples and exercises are in the text STROUD (Prog. 28 in 7 th Ed) PROGRAMME 27 STATISTICS.
Understanding Numerical Data
Variability The goal for variability is to obtain a measure of how spread out the scores are in a distribution. A measure of variability usually accompanies.
The introduction to SPSS Ⅱ.Tables and Graphs for one variable ---Descriptive Statistics & Graphs.
Nature of Science Science Nature of Science Scientific methods Formulation of a hypothesis Formulation of a hypothesis Survey literature/Archives.
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
Seminar Eight Individual Z-Scores and Z-Score Patterns Caitlin Crawford September 20, 2007.
Make observations to state the problem *a statement that defines the topic of the experiments and identifies the relationship between the two variables.
Summary Statistics Review
Statistics: Mean of Absolute Deviation
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Essential Statistics Chapter 31 The Normal Distributions.
STATISTICS. What is the difference between descriptive and inferential statistics? Descriptive Statistics: Describe data Help us organize bits of data.
Review BPS chapter 1 Picturing Distributions with Graphs What is Statistics ? Individuals and variables Two types of data: categorical and quantitative.
STATISTICS FOR SCIENCE RESEARCH (The Basics). Why Stats? Scientists analyze data collected in an experiment to look for patterns or relationships among.
Chapter 3: Organizing Data. Raw data is useless to us unless we can meaningfully organize and summarize it (descriptive statistics). Organization techniques.
CHAPTER 3 Describing Relationships
Outline of Today’s Discussion 1.Displaying the Order in a Group of Numbers: 2.The Mean, Variance, Standard Deviation, & Z-Scores 3.SPSS: Data Entry, Definition,
Univariate EDA. Quantitative Univariate EDASlide #2 Exploratory Data Analysis Univariate EDA – Describe the distribution –Distribution is concerned with.
Data Analysis. Statistics - a powerful tool for analyzing data 1. Descriptive Statistics - provide an overview of the attributes of a data set. These.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8… Where we are going… Significance Tests!! –Ch 9 Tests about a population proportion –Ch 9Tests.
Chapter 1 Review. Why Statistics? The Birth of Statistics Began in the 17th Century System to combine probabilities with Bayesian inference Important.
Get That List!! (Programs) PREZ, CHEST, LISTRES.  We use the following to graph quantitative data › Dot Plot › Stem & Leaf › Histogram › Ogive.
Analysis of Quantitative Data
STATISTICS FOR SCIENCE RESEARCH
Statistics: The Z score and the normal distribution
Descriptive Statistics I REVIEW
CHAPTER 1: Picturing Distributions with Graphs
Advanced Analytics Using Enterprise Miner
Section 9-3   We already know how to calculate the correlation coefficient, r. The square of this coefficient is called the coefficient of determination.
Standard Deviation.
CHAPTER 1: Picturing Distributions with Graphs
DAY 3 Sections 1.2 and 1.3.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
Lesson Comparing Two Means.
Basic Practice of Statistics - 3rd Edition The Normal Distributions
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Basic Practice of Statistics - 3rd Edition
Basic Practice of Statistics - 3rd Edition
CHAPTER 1: Picturing Distributions with Graphs
Honors Statistics Review Chapters 4 - 5
Basic Practice of Statistics - 3rd Edition The Normal Distributions
Quantitative Data Who? Cans of cola. What? Weight (g) of contents.
The Practice of Statistics
Presentation transcript:

SAS Homework 4 Review Clustering and Segmentation MIS2502 Data Analytics

SAS Homework 4 Review Clustering and Segmentation Using AAEM.DUNGAREE Data Set Explore data set : SALESTOT and STOREID Assign ID to STOREID SALESTOT Role – Rejected Add a Cluster node (Explore) In Properties select Internal Standardization => Standardize Run and Evaluate Change Properties Segment Max to 6 Add a Segment Profile node (Assess)

Set Up Retail – looking for patterns sales of types of jeans by store

Data Source - Edit Variables

Data Source – Explore Note scale

Add Cluster Node, Standardize

Segments, Automatic note root mean square std deviation

Change Number of Clusters to 6

Segments, Max 6 note root mean square std deviation

Segment Profile Node

Segment Profiles red outline is the overall distribution

Questions How do the SALESTOT and STOREID distributions differ from the other variables’ distributions (look at the histograms of each one)? Assign STOREID a model role of ID and SALESTOT a model role of Rejected. Make sure that the remaining variables have the Input model role and the Interval measurement level. Based on the variable descriptions on page 1 and your answer to part Why do you think that the variable SALESTOT should be rejected? Add a Cluster node to the diagram workspace and connect it to the Input Data node. Select the Cluster node and select Internal Standardization  Standardization. Why is it important to standardize your inputs? (hint: look at the range of the scales on the X axis of the histograms) Run the diagram from the Cluster node and examine the results. How many clusters are created? What might be a problem with having so many clusters? What is the highest root mean squared standard deviation among the clusters? Two hints: Look at the Mean Statistics window. The root mean squared standard deviation means basically the same thing as the sum of squares error.

Distribution of Store Id

Distribution of SaleTot Does tell you that there are a handful of stores selling well below average These 2 variables aren’t useful for the product mix analysis.

Why Standardize ? Note difference in range of numbers on x axis

Segment Profile Node

Reading a Histogram Look at the distribution in total,  and then the individual bars.  For this distribution you would say that for this segment, they sell less original jeans than average, and in a narrower range /with less variability (not part of the question).  Overall you can say this because the distribution is to the left of and 'tighter' than the overall distribution.       4) Now look at the specific segment distribution (blue). For this segment approximately 86% of the stores sell within  volume ranges 3 and 4.,  1) The red bars are the distribution of Original Jeans sales over all segments. By comparing the specific segment distribution (blue) to the overall distribution (red) you can make some observations about the what makes this segment different in regards to Original Jeans sold. 3) note that for ranges 3 ,4 and 5, the overall average (red) shows  roughly that 65% of stores sell in these volume ranges (11%  and 23 %  and 31% respectively). You get this by reading the Y axis. 2) Note that you have 8 ranges of standardized sales volumes on the x axis for the overall average (the red).  These are ordered for lowest (on the left) to highest (on the right).  We established this earlier when looking at the individual  segments. 5) Conclusion: Overall, this segment has more stores selling original jeans in lower volume ranges  than the overall average.  Therefore, for this segment we can say that the stores sell less Original Jeans than average. 

Segment Profiles red outline is the overall distribution Original Segment Profiles red outline is the overall distribution

In Class Answer the questions about this output: 1. How many distinct customer groups (segments) are there? 2. Explain how the customers in cluster 1 are different from cluster 2? 3. What aspect of the customer data most differentiates cluster 1 from cluster 3? 4. Which cluster has the highest cohesion? In practical terms, what does that mean?

In Class – Evaluating Clustering Output 5. Is the root mean squared standard deviation of these clusters higher or lower than they were in the three cluster scenario? Why? 6. Is the distance to the nearest cluster higher or lower than in the three cluster scenario? Why? 7. Which scenario (#1 or #2) has higher cohesion among its clusters? 8. Which scenario (#1 or #2) has higher separation between its clusters?