K-Means Lab.

Slides:



Advertisements
Similar presentations
R Language. What is R? Variables in R Summary of data Box plot Histogram Using Help in R.
Advertisements

Two topics in R: Simulation and goodness-of-fit HWU - GS.
Project Maths - Teaching and Learning Relative Frequency % Bar Chart to Relative Frequency Bar Chart What is the median height.
HOW TALL ARE APRENDE 8 TH GRADERS? Algebra Statistics Project By, My Favorite Math Teacher In my project I took the heights of 25 Aprende 8 th graders.
Grouping Data Methods of cluster analysis. Goals 1 1.We want to identify groups of similar artifacts or features or sites or graves, etc that represent.
Describing Distributions Numerically
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 9: Quantitative.
R-Graphics Day 2 Stephen Opiyo. Basic Graphs One of the main reasons data analysts turn to R is for its strong graphic capabilities. R generates publication-ready.
Summary statistics Using a single value to summarize some characteristic of a dataset. For example, the arithmetic mean (or average) is a summary statistic.
Univariate Graphs III Review Create histogram from Commands Window. Multipanel histogram. Quantile Plots Quantile-Normal Plots Quantile-Quantile Plots.
Introduction to Summary Statistics. Statistics The collection, evaluation, and interpretation of data Statistical analysis of measurements can help verify.
Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.
Exploring Engineering Chapter 3, Part 2 Introduction to Spreadsheets.
Are You Smarter Than a 5 th Grader?. 1,000,000 5th Grade Topic 15th Grade Topic 24th Grade Topic 34th Grade Topic 43rd Grade Topic 53rd Grade Topic 62nd.
Chapter 1: Exploring Data Lesson 4: Quartiles, Percentiles, and Box Plots Mrs. Parziale.
1 Running Clustering Algorithm in Weka Presented by Rachsuda Jiamthapthaksin Computer Science Department University of Houston.
Chapter 2: Descriptive Statistics Adding MegaStat in Microsoft Excel Measures of Central Tendency Mode: The most.
Chapter 2 Analysis using R. Few Tips for R Commands included here CANNOT ALWAYS be copied and pasted directly without alteration. –One major reason is.
Continued… Obj: draw Box-and-whisker plots representing a set of data Do now: use your calculator to find the mean for 85, 18, 87, 100, 27, 34, 93, 52,
Comparing Statistical Data MeanMedianMode The average of a set of scores or data. The middle score or number when they are in ascending order. The score.
Exploratory Data Analysis Exploratory Data Analysis Dr.Lutz Hamel Dr.Joan Peckham Venkat Surapaneni.
Machine Learning (ML) with Weka Weka can classify data or approximate functions: choice of many algorithms.
Univariate EDA. Quantitative Univariate EDASlide #2 Exploratory Data Analysis Univariate EDA – Describe the distribution –Distribution is concerned with.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
Diagonal is sum of variances In general, these will be larger when “within” class variance is larger (a bad thing) Sw(iris[,1:4],iris[,5]) Sepal.Length.
Descriptive Statistics using R. Summary Commands An essential starting point with any set of data is to get an overview of what you are dealing with You.
Probability & Statistics Box Plots. Describing Distributions Numerically Five Number Summary and Box Plots (Box & Whisker Plots )
Central Tendency and Variability Chapter 4. Variability In reality – all of statistics can be summed into one statement: – Variability matters. – (and.
Box and Whisker Plot Chapter 3.5. Box and Whisker Plot A Box-and-Whisker Plot or Box plot is a visual device that uses a 5-number summary to reveal the.
THE 4 METHODS All you need to know about maths. Addition  249 Lets start with this +682 sum sum.
2.4 Measures of Variation The Range of a data set is simply: Range = (Max. entry) – (Min. entry)
Common Linear & Classification for Machine Learning using Microsoft R
Describing Distributions of Quantitative Data
Measures of Position – Quartiles and Percentiles
Exploring Data: Summary Statistics and Visualizations
INTRODUCTION TO STATISTICS
Linear Algebra Review.
Box and Whisker Plots and the 5 number summary
Box and Whisker Plots and the 5 number summary
Statistical Data Analysis - Lecture 05 12/03/03
Lecture 2-2 Data Exploration: Understanding Data
Data Mining: Concepts and Techniques
Exploring Computer Science Lesson 5-9
Introduction to Summary Statistics
Reasoning in Psychology Using Statistics
Even More on Dimensions
Introduction to Summary Statistics
Introduction to Summary Statistics
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Introduction to Summary Statistics
Introduction to Summary Statistics
Principal Component Analysis (PCA)
Representing Quantitative Data
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Knight’s Charge Quiz 1!  Clear desks. NO Calculator!
Introduction to Summary Statistics
Lesson 20 Creating Formulas and Charting Data
Box & Whiskers Plots AQR.
Lesson 4: Introduction to Functions
Introduction to Summary Statistics
Comparing Statistical Data
Introduction to Summary Statistics
1-4 Quartiles, Percentiles and Box Plots
EET 2259 Unit 9 Arrays Read Bishop, Sections 6.1 to 6.3.
subtracting fractions with like denominators
Introduction to Summary Statistics
Advanced Algebra Unit 1 Vocabulary
Box and Whisker Plots and the 5 number summary
Using Clustering to Make Prediction Intervals For Neural Networks
Presentation transcript:

K-Means Lab

Start with 1D Data # input data: https://cse.sc.edu/~rose/587/CSV/state_income.csv # 1) use wget to copy the dataset to the vm # 2) use GUI import from Text (base) to import state_income.csv #Sort data tmp = sort(state_income$V2) #Visualize data plot(tmp) # create 4 clusters kmeans=kmeans(tmp,4,15) #Visualize the cluster centers points(kmeans$centers, col = 1:4, pch=20)

Higher Dimensional Data # input data: https://cse.sc.edu/~rose/587/CSV/iris.csv # 1) use wget to copy to vm # 2) use GUI import from Text (base) to import iris.csv # Lets start with 2 dimensions: visualize data plot(iris[,1:2]) # create 4 clusters kmeans=kmeans(iris[,1:2],4,15) #Visualize the cluster centers points(kmeans$centers, col = 1:4, pch=20)

# examine kmeans object kmeans # note: clustering statistics cluster sizes cluster means clustering vector within cluster sum of squares

Plot Results # Visualize clusters plot(iris[,1:2], col=kmeans$cluster) # Visualize the cluster centers points(kmeans$centers, col = 1:4, pch=20)

Best K? # Is 4 the best number of clusters? # explore different number of clusters withinSumSqrs = numeric(20) for (k in 1:20) withinSumSqrs[k] = sum(kmeans(iris[,1:2],centers=k)$withinss) #Visualize within cluster sum of square plot(1:20, withinSumSqrs, type="b", xlab="# Clusters", ylab="Within sum of square")

Best K? Elbow Method Look for the “elbow” We want a small k that has a low WSS value What does a low WSS value indicate?

Higher Dimensional Data # Lets consider 3 dimensions: visualize data plot(iris[,1:3]) # create 4 clusters kmeans=kmeans(iris[,1:3],4,15) # Visualize the clusters plot(iris[,1:3], col=kmeans$cluster)

Higher Dimensional Data # Examine a single 2D projection plot(iris[,1:2], col=kmeans$cluster) #Visualize the cluster centers points(kmeans$centers, col = 1:4, pch=20) # Examine another single 2D projection plot(iris[,1:3], col=kmeans$cluster) points(kmeans$centers[,c(1,3)], col = 1:4, pch=20)

Best K? # Is 4 the best number of clusters? # explore different number of clusters withinSumSqrs = numeric(20) for (k in 1:20) withinSumSqrs[k] = sum(kmeans(iris[,1:3],centers=k)$withinss) #Visualize within cluster sum of square plot(1:20, withinSumSqrs, type="b", xlab="# Clusters", ylab="Within sum of square")

Higher Dimensional Data # Finally consider 4 dimensions: visualize data plot(iris[,1:4]) # create 4 clusters kmeans=kmeans(iris[,1:4],4,15) # Visualize the clusters plot(iris[,1:4], col=kmeans$cluster)

Normalization? # notice that the columns are not the same scale > summary(iris[,1:4]) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 #Which will have more influence: petal width or petal length?

Normalization? # apply this function to one column # Let’s shift and scale the data to be in the range [0,1] #  each dimension should be able to exert equal influence # function to “shift” a vector # subtract the min value so that it ranges from 0 to ?? myShift = function(x) { x - min(x, na.rm=TRUE)} # apply this function to one column myShift(iris[,3]) # apply this function to multiple columns as.data.frame(lapply(iris[,1:4], myShift)) #verify with summary statistics summary(as.data.frame(lapply(iris[,1:4], myShift)))

Normalization? # Still need to scale data # function to “scale” a vector myScale = function(x) { max(x,na.rm=TRUE) - min(x,na.rm=TRUE)} # apply this function to one column myScale(iris[,3]) # apply this function to multiple columns as.data.frame(lapply(iris[,1:4], myScale))

Normalization? # apply this function to a column # Put it all together # myShift= function(x) { x - min(x, na.rm=TRUE)} # myScale = function(x) { max(x,na.rm=TRUE) - min(x,na.rm=TRUE)} myNorm = function(x) { myShift(x)/myScale(x) } # apply this function to a column myNorm(iris[,3]) # apply this function to multiple columns tmp = as.data.frame(lapply(iris[,1:4], myNorm)) #verify with summary statistics summary(as.data.frame(lapply(iris[,1:4], myNorm)))

Higher Dimensional Data # visualize “normalized” data plot(tmp[,1:4]) # try from 1 to 20 clusters for (k in 1:20) withinSumSqrs[k] = sum(kmeans(tmp[,1:4],centers=k)$withinss) #Visualize within cluster sum of square plot(1:20, withinSumSqrs, type="b", xlab="# Clusters", ylab="Within sum of square")