COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Introduction to Bioinformatics
Clustering II.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster analysis. Partition Methods Divide data into disjoint clusters Hierarchical Methods Build a hierarchy of the observations and deduce the clusters.
17 Correlation. 17 Correlation Chapter17 p399.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
What is Cluster Analysis?
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
What is Cluster Analysis?
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Tutorial # 9 Nov. 21, Segmentation  Isolating a region/regions of interest in an image  Useful for:  Collect more meaningful data from an image.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Birch: An efficient data clustering method for very large databases
Discriminant Analysis Testing latent variables as predictors of groups.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Image Segmentation by Clustering using Moments by, Dhiraj Sakumalla.
1 Formal Evaluation Techniques Chapter 7. 2 test set error rates, confusion matrices, lift charts Focusing on formal evaluation methods for supervised.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
B AD 6243: Applied Univariate Statistics Correlation Professor Laku Chidambaram Price College of Business University of Oklahoma.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
SINGULAR VALUE DECOMPOSITION (SVD)
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Computer Graphics and Image Processing (CIS-601).
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Correlation. Correlation is a measure of the strength of the relation between two or more variables. Any correlation coefficient has two parts – Valence:
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Linear Correlation (12.5) In the regression analysis that we have considered so far, we assume that x is a controlled independent variable and Y is an.
Factor & Cluster Analyses. Factor Analysis Goals Data Process Results.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Correlation  We can often see the strength of the relationship between two quantitative variables in a scatterplot, but be careful. The two figures here.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Automated Classification of X-ray Sources for Very Large Datasets Susan Hojnacki, Joel Kastner, Steven LaLonde Rochester Institute of Technology Giusi.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
JMP Discovery Summit 2016 Janet Alvarado
Cluster Analysis II 10/03/2012.
Chapter 15 – Cluster Analysis
How regression works The Right Questions about Statistics:
Data Mining K-means Algorithm
Progress Report Alvaro Velasquez.
Presentation transcript:

COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

Contains data on 3,462 objects which have been classified as Galaxies in the Chandra Deep Field South which is basically a patch of sky that lies in the Fornax constellation. There is 65 columns of data in this dataset ranging from luminosities in 10 different bands of the spectrum to size and brightness. However the website mentions how a vast majority of these attributes are redundant and not independent. Focusing on three main attributes of this dataset. – Total R (red band) magnitude is a measure of brightness of the galaxy. These are done in inverted logarithmic measurements. So a galaxy with R=21 is 100 more times brighter then one with R=26. – ApDRmag is the difference between the total and aperture magnitude in the R band. This is a rough measure of the size of the galaxy. – rsMAG which is the magnitude of the vector coming from the galaxy. Roughly a vector measurement of distance.

NrRmagApDRmagrsMAGe.rsMAGUbMAGe.UbMAGBbMAGe.BbMAGVnMAGe.VbMAGS280MAGe.S280MAW420FEe.W420FE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E-03

At first glance, Data appeared to have some sort of linear relationship. Started with the Pearson Correlation Coefficient to test for such a relationship. The Pearson Correlation Coefficient Calculated was about The Pearson Correlation Coefficient assumes the data is normally distributed, which may not be the case, but this was just a first step and the data seem to have a slightly linear relationship. The brightness of the galaxy seems to decrease as the size grows.

K Means Clustering Attempt to break the data set into smaller data sets. Number of Clusters was chosen to be 5. Had to limit the number of iterations of when to stop trying to improve the centroid for each cluster. Initial centroids were chosen to be the first 5 records.

Hierarchical Clustering Chose to stop at 5 clusters to have comparison with the K-Means results. Proximity using Euclidean Distance. Used Ward’s Method to determine cluster similarity when merging clusters. Computationally Expensive

K Means with 3 Variables Wanted to see what kind of results would be yielded from choosing 3 Variables to cluster against. Same parameters for the previous K- Means algorithms. Chose Brightness, Size, and Distance from Earth as the 3 Variables. Difficult to present graphically.

ObservationClassDistance to centroid Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs

Conclusions Got to see how the affects of outliers can affect the clustering algorithms for AHC vs K-Means. K- Means was more sensitive to outliers. Also got to see how these cluster analysis can be so versatile with lots of different options i.e. value for K, number of attributes to compare etc. – The lots of options can be a downfall of clustering also in that one small change can yield very different results.

Afterthoughts I would have done another K-Means clustering analysis after removing the outliers from my original data and see how the difference in the clusters and their centroids. I would have experimented with different values of K and looked at the results.