COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012
Contains data on 3,462 objects which have been classified as Galaxies in the Chandra Deep Field South which is basically a patch of sky that lies in the Fornax constellation. There is 65 columns of data in this dataset ranging from luminosities in 10 different bands of the spectrum to size and brightness. However the website mentions how a vast majority of these attributes are redundant and not independent. Focusing on three main attributes of this dataset. – Total R (red band) magnitude is a measure of brightness of the galaxy. These are done in inverted logarithmic measurements. So a galaxy with R=21 is 100 more times brighter then one with R=26. – ApDRmag is the difference between the total and aperture magnitude in the R band. This is a rough measure of the size of the galaxy. – rsMAG which is the magnitude of the vector coming from the galaxy. Roughly a vector measurement of distance.
NrRmagApDRmagrsMAGe.rsMAGUbMAGe.UbMAGBbMAGe.BbMAGVnMAGe.VbMAGS280MAGe.S280MAW420FEe.W420FE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E-03
At first glance, Data appeared to have some sort of linear relationship. Started with the Pearson Correlation Coefficient to test for such a relationship. The Pearson Correlation Coefficient Calculated was about The Pearson Correlation Coefficient assumes the data is normally distributed, which may not be the case, but this was just a first step and the data seem to have a slightly linear relationship. The brightness of the galaxy seems to decrease as the size grows.
K Means Clustering Attempt to break the data set into smaller data sets. Number of Clusters was chosen to be 5. Had to limit the number of iterations of when to stop trying to improve the centroid for each cluster. Initial centroids were chosen to be the first 5 records.
Hierarchical Clustering Chose to stop at 5 clusters to have comparison with the K-Means results. Proximity using Euclidean Distance. Used Ward’s Method to determine cluster similarity when merging clusters. Computationally Expensive
K Means with 3 Variables Wanted to see what kind of results would be yielded from choosing 3 Variables to cluster against. Same parameters for the previous K- Means algorithms. Chose Brightness, Size, and Distance from Earth as the 3 Variables. Difficult to present graphically.
ObservationClassDistance to centroid Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs Obs
Conclusions Got to see how the affects of outliers can affect the clustering algorithms for AHC vs K-Means. K- Means was more sensitive to outliers. Also got to see how these cluster analysis can be so versatile with lots of different options i.e. value for K, number of attributes to compare etc. – The lots of options can be a downfall of clustering also in that one small change can yield very different results.
Afterthoughts I would have done another K-Means clustering analysis after removing the outliers from my original data and see how the difference in the clusters and their centroids. I would have experimented with different values of K and looked at the results.