Download presentation
Published byJayson Davis Modified over 9 years ago
1
SAS Homework 4 Review Clustering and Segmentation
MIS2502 Data Analytics
2
SAS Homework 4 Review Clustering and Segmentation
Using AAEM.DUNGAREE Data Set Explore data set : SALESTOT and STOREID Assign ID to STOREID SALESTOT Role – Rejected Add a Cluster node (Explore) In Properties select Internal Standardization => Standardize Run and Evaluate Change Properties Segment Max to 6 Add a Segment Profile node (Assess)
3
Set Up Retail – looking for patterns sales of types of jeans by store
4
Data Source - Edit Variables
5
Data Source – Explore Note scale
6
Add Cluster Node, Standardize
7
Segments, Automatic note root mean square std deviation
8
Change Number of Clusters to 6
9
Segments, Max 6 note root mean square std deviation
10
Segment Profile Node
11
Segment Profiles red outline is the overall distribution
12
Questions How do the SALESTOT and STOREID distributions differ from the other variables’ distributions (look at the histograms of each one)? Assign STOREID a model role of ID and SALESTOT a model role of Rejected. Make sure that the remaining variables have the Input model role and the Interval measurement level. Based on the variable descriptions on page 1 and your answer to part Why do you think that the variable SALESTOT should be rejected? Add a Cluster node to the diagram workspace and connect it to the Input Data node. Select the Cluster node and select Internal Standardization Standardization. Why is it important to standardize your inputs? (hint: look at the range of the scales on the X axis of the histograms) Run the diagram from the Cluster node and examine the results. How many clusters are created? What might be a problem with having so many clusters? What is the highest root mean squared standard deviation among the clusters? Two hints: Look at the Mean Statistics window. The root mean squared standard deviation means basically the same thing as the sum of squares error.
13
Distribution of Store Id
14
Distribution of SaleTot
Does tell you that there are a handful of stores selling well below average These 2 variables aren’t useful for the product mix analysis.
15
Why Standardize ? Note difference in range of numbers on x axis
16
Segment Profile Node
17
Reading a Histogram Look at the distribution in total, and then the individual bars. For this distribution you would say that for this segment, they sell less original jeans than average, and in a narrower range /with less variability (not part of the question). Overall you can say this because the distribution is to the left of and 'tighter' than the overall distribution. 4) Now look at the specific segment distribution (blue). For this segment approximately 86% of the stores sell within volume ranges 3 and 4., 1) The red bars are the distribution of Original Jeans sales over all segments. By comparing the specific segment distribution (blue) to the overall distribution (red) you can make some observations about the what makes this segment different in regards to Original Jeans sold. 3) note that for ranges 3 ,4 and 5, the overall average (red) shows roughly that 65% of stores sell in these volume ranges (11% and 23 % and 31% respectively). You get this by reading the Y axis. 2) Note that you have 8 ranges of standardized sales volumes on the x axis for the overall average (the red). These are ordered for lowest (on the left) to highest (on the right). We established this earlier when looking at the individual segments. 5) Conclusion: Overall, this segment has more stores selling original jeans in lower volume ranges than the overall average. Therefore, for this segment we can say that the stores sell less Original Jeans than average.
18
Segment Profiles red outline is the overall distribution
Original Segment Profiles red outline is the overall distribution
19
In Class Answer the questions about this output:
1. How many distinct customer groups (segments) are there? 2. Explain how the customers in cluster 1 are different from cluster 2? 3. What aspect of the customer data most differentiates cluster 1 from cluster 3? 4. Which cluster has the highest cohesion? In practical terms, what does that mean?
20
In Class – Evaluating Clustering Output
5. Is the root mean squared standard deviation of these clusters higher or lower than they were in the three cluster scenario? Why? 6. Is the distance to the nearest cluster higher or lower than in the three cluster scenario? Why? 7. Which scenario (#1 or #2) has higher cohesion among its clusters? 8. Which scenario (#1 or #2) has higher separation between its clusters?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.