Download presentation
Presentation is loading. Please wait.
Published byDominic Golden Modified over 9 years ago
1
Cluster Analysis Classifying the Exoplanets
2
Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very hot skill to have 70/hour) “The best form of cluster analysis is ordination, because ordination is not a form of cluster analysis.” –Morgan Byron –No formal def. of a cluster –Results are descriptive and subjective.
3
R Commands library("scatterplot3d") scatterplot3d(log(planets$mass), log(planets$period), log(planets$eccen), type = "h", angle = 55, scale.y = 0.7, pch = 16, y.ticklabs = seq(0, 10, by = 2), y.margin.add = 0.1) –Taking the log of the each data point –Setting the angle and the physical scale so it looks like a box –Pch is the symbol used for the data point –Seq() function sets the numeric scales –Y.margin.add adds a bit to the vertical margins
4
Interpretation No real insight after our first view of the data, but it looks neat.
5
R Commands rge <- apply(planets, 2, max) - apply(planets, 2, min) –Stores the range of the data 2 indicates the column margin of the data matrix planet.dat <- sweep(planets, 2, rge, FUN = "/") –Divides each element in the matrix by the range of the column margin n <- nrow(planet.dat) wss <- rep(0, 10) –Creates a 10 dimensional vector of all 0’s wss[1] <- (n-1)*sum(apply(planet.dat, 2, var)) –This is the sum of squares of all the points – if we partition the data in 1 group. for (i in 2:10) wss[i] <- sum(kmeans(planet.dat, centers = i)$withinss) –Using the kmeans method, as the number of partitions increases, calculates the sum of squares of the members of each group.
6
The K-Means Method This method uses different ways of minimizing a numerical value - often a notion of distance- by partitioning the data. The method used in this analysis is minimizing the sums of squares of data within a group, and finding a number of groups that has the lowest SS This method can be impractical with the number of partitions increasing very quickly as the number of groups and data points increases.
7
The “Elbow” In choosing a good number of partitions, the “elbow” or the sharpest angle in the graph is an easy approach. –The steepest angles look to be at 3 and 5 number of groups.
8
Number of planets in the groups planet_kmeans3 <- kmeans(planet.dat, centers = 3) –We chose to try 3 groups table(planet_kmeans3$cluster) – 1 2 3 –14 53 34 ccent <- function(cl) { –f <- function(i) colMeans(planets[cl == i, ]) Finds the mean for each cluster –x <- sapply(sort(unique(cl)), f) Sorts –colnames(x) <- sort(unique(cl)) –return(x) }
9
The results > ccent(planet_kmeans3$cluster) Cluster 1 2 3 mass 10.56786 1.6710566 2.9276471 period 1693.17201 427.7105892 616.0760882 eccen 0.36650 0.1219491 0.4953529 Number 14 53 34
10
Model-Based Clustering in brief –The subjective decision or assumption is the number of clusters. –After that, it becomes a problem of maximizing the likelihood that a partition is the best.
11
Mclust function Mclust find an appropriate model AND the optimal number of groups. –Not Free?!! Need a liscence agreement from University of Washington. R Commands: –Library(“mclust”) –Planet_mclust <- Mclust(planet.dat) –Plot(planet_mclust, planet.dat) –Print(planet_mclust) The best model is of diagonal clusters of varying volume and shape with 3 groups
12
Homework Spend 30 minutes attempting exercise 15.1 and send me what you get done. Stick it to the Man! Then practice your air guitar zweihanderdawg@gmail.com
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.