Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.

Slides:



Advertisements
Similar presentations
Solving LP Problems in a Spreadsheet
Advertisements

HS 67 - Intro Health Statistics Describing Distributions with Numbers
Unit 1.1 Investigating Data 1. Frequency and Histograms CCSS: S.ID.1 Represent data with plots on the real number line (dot plots, histograms, and box.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Cluster Analysis (1).
Lecture 6 Sept 15, 09 Goals: two-dimensional arrays matrix operations circuit analysis using Matlab image processing – simple examples.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Bar Graphs, Histograms, Line Graphs Arizona State Standard – Solve problems by selecting, constructing, interpreting, and calculating with displays of.
By: Suhas Navada and Antony Jacob
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
CPSC 386 Artificial Intelligence Ellen Walker Hiram College
Linear Systems Two or more unknown quantities that are related to each other can be described by a “system of equations”. If the equations are linear,
Linear Programming: Basic Concepts
Variability The goal for variability is to obtain a measure of how spread out the scores are in a distribution. A measure of variability usually accompanies.
Chapter 9 – Classification and Regression Trees
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
1. 2 Traditional Income Statement LO1: Prepare a contribution margin income statement.
The goal is to give an introduction to the mathematical operations with matrices. A matrix is a 2-dimensional arrangement of (real valued) data. The data.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
CLUSTERS and ARRAYS. Array Functions  Build an array  Size an array  Form an array from a cluster or a cluster into an array  Index an array  Find.
Year 3 Block A. 3A1 I can solve number problems and practical problems involving place value and rounding. I can apply partitioning related to place value.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Thinking Mathematically Statistics: 12.3 Measures of Dispersion.
Model-based Clustering
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
The previous cubic is entered in Excel in away that may be read easily Note the ‘ name box’ is used here to give the numerical content of the cell a name,
2014 How Do Celebrities Tweet? A Data Science Case Study Christina Zou Data Scientist, Twitter October 2014 #GHC
2.4 Measures of Variation The Range of a data set is simply: Range = (Max. entry) – (Min. entry)
MIS2502: Data Analytics Clustering and Segmentation Jeremy Shafer
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Statistics Descriptive Statistics. Statistics Introduction Descriptive Statistics Collections, organizations, summary and presentation of data Inferential.
Describing Distributions of Quantitative Data
CHAPTER 1 Exploring Data
Linear Algebra Review.
13.4 Product of Two Matrices
Year 3 Place value & calculation.
Year 4 Block A.
CHAPTER 1 Exploring Data
Year 3 Block A.
CHAPTER 2: Describing Distributions with Numbers
Place Value and Mental Calculation
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Jianping Fan Dept of CS UNC-Charlotte
Centroids Centroids Principles of EngineeringTM
Homework: Frequency & Histogram worksheet
K-Means Lab.
Year 3 Block A.
CHAPTER 1 Exploring Data
Magic Squares   10   X.
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Day 52 – Box-and-Whisker.
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Section 3.2: Least Squares Regressions
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
break each vector into its components
CHAPTER 1 Exploring Data
Describing Data Coordinate Algebra.
CHAPTER 1 Exploring Data
Presentation transcript:

Cluster Analysis Classifying the Exoplanets

Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very hot skill to have 70/hour) “The best form of cluster analysis is ordination, because ordination is not a form of cluster analysis.” –Morgan Byron –No formal def. of a cluster –Results are descriptive and subjective.

R Commands library("scatterplot3d") scatterplot3d(log(planets$mass), log(planets$period), log(planets$eccen), type = "h", angle = 55, scale.y = 0.7, pch = 16, y.ticklabs = seq(0, 10, by = 2), y.margin.add = 0.1) –Taking the log of the each data point –Setting the angle and the physical scale so it looks like a box –Pch is the symbol used for the data point –Seq() function sets the numeric scales –Y.margin.add adds a bit to the vertical margins

Interpretation No real insight after our first view of the data, but it looks neat.

R Commands rge <- apply(planets, 2, max) - apply(planets, 2, min) –Stores the range of the data 2 indicates the column margin of the data matrix planet.dat <- sweep(planets, 2, rge, FUN = "/") –Divides each element in the matrix by the range of the column margin n <- nrow(planet.dat) wss <- rep(0, 10) –Creates a 10 dimensional vector of all 0’s wss[1] <- (n-1)*sum(apply(planet.dat, 2, var)) –This is the sum of squares of all the points – if we partition the data in 1 group. for (i in 2:10) wss[i] <- sum(kmeans(planet.dat, centers = i)$withinss) –Using the kmeans method, as the number of partitions increases, calculates the sum of squares of the members of each group.

The K-Means Method This method uses different ways of minimizing a numerical value - often a notion of distance- by partitioning the data. The method used in this analysis is minimizing the sums of squares of data within a group, and finding a number of groups that has the lowest SS This method can be impractical with the number of partitions increasing very quickly as the number of groups and data points increases.

The “Elbow” In choosing a good number of partitions, the “elbow” or the sharpest angle in the graph is an easy approach. –The steepest angles look to be at 3 and 5 number of groups.

Number of planets in the groups planet_kmeans3 <- kmeans(planet.dat, centers = 3) –We chose to try 3 groups table(planet_kmeans3$cluster) – – ccent <- function(cl) { –f <- function(i) colMeans(planets[cl == i, ]) Finds the mean for each cluster –x <- sapply(sort(unique(cl)), f) Sorts –colnames(x) <- sort(unique(cl)) –return(x) }

The results > ccent(planet_kmeans3$cluster) Cluster mass period eccen Number

Model-Based Clustering in brief –The subjective decision or assumption is the number of clusters. –After that, it becomes a problem of maximizing the likelihood that a partition is the best.

Mclust function Mclust find an appropriate model AND the optimal number of groups. –Not Free?!! Need a liscence agreement from University of Washington. R Commands: –Library(“mclust”) –Planet_mclust <- Mclust(planet.dat) –Plot(planet_mclust, planet.dat) –Print(planet_mclust) The best model is of diagonal clusters of varying volume and shape with 3 groups

Homework Spend 30 minutes attempting exercise 15.1 and send me what you get done. Stick it to the Man! Then practice your air guitar