Principal Components Analysis Τεχνικές Παρατήρησης και Επεξεργασίας Δεδομένων στην Αστροφυσική Τομέας Αστροφυσικής, Αστρονομίας & Μηχανικής Principal Components Analysis Antonios Karampelas, PhD
Astrostatistics Big Data Data Mining Machine Learning Glossary A discipline used to process the vast amount of astronomical data. Big Data Large or complex data sets difficult to process using traditional data processing applications. Data Mining The computational process of discovering patterns in large data sets. Machine Learning A scientific discipline that explores the construction and study of algorithms that can learn from data.
Big Data: An alternative for Astrophysicists? McKinsey Global Institute Report1 Big data: The next frontier for innovation, competition, and productivity Harvard Business Review2 Data Scientist: The Sexiest Job of the 21st Century Fortune3 Big Data could generate millions of new jobs 1http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation 1https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ 1http://fortune.com/2013/05/21/big-data-could-generate-millions-of-new-jobs/
Data Mining/Machine Learning categories 1. Descriptive Data Mining/ Unsupervised Machine Learning 2. Predictive Data Mining/ Supervised Machine learning
Data Mining/Machine Learning categories 1. Descriptive Data Mining/ Unsupervised Machine Learning Discover patterns, trends, clusters, outliers Principal Components Analysis (PCA) Self-Organizing Map (SOM) K-means Clustering 2. Predictive Data Mining/ Supervised Machine learning Parameterize, Classify, Recognize Patterns Decision Trees (DT) Support Vector Machine (SVM) Artificial Neural Network (ANN)
Artwork by Sandbox Studio, Chicago with Kimberly Boustead Astrostatistics Artwork by Sandbox Studio, Chicago with Kimberly Boustead
Principal Components Analysis (PCA) (Karhounen-Loeve transformation) Linear orthogonal transformation in a new base, in which the data variance is highlighted. New axes = Principal Components (PCs) Data = linear combination of PCs
Fingerprint Recognition Very effective in: Data compression, Dimensionality reduction, Noise extraction Applications Astronomy Biology Graphology Face Recognition Fingerprint Recognition
Indicative Astronomy Articles SDSS Yip et al. 2004, AJ Gaia Karampelas et al. 2012, A&A Spitzer Wang et al. 2011, MNRAS 2dF Folkes et al. 1999, MNRAS
PCA procedure Standardize the original data, if necessary. 2. Construct the variance-covariance matrix or the correlation matrix. 3. Determine the eigenvalues (λi) and eigenvectors (PCi) of the matrix. Data covariance has been eliminated. Eigenvalues λ represent the variances of the transformed data. PC1 corresponds to the biggest λ (λ1) and summarizes the majority of the data variance. PC2 corresponds to λ2 and summarizes the majority of the rest of the data variance etc. 4. Determine the admixture coefficients αi (data projection on the new axes). Full data reconstruction: Data = α1PC1 + α2PC2 + … + αkPCk Partial reconstruction is usually sufficient: Data ≈ α1PC1 + α2PC2 + … + α5PC5
No widespread Information PCA procedure Full reconstruction Data = α1PC1 + α2PC2 + α3PC3 + … + αk-1PC(k-1) + αkPCk Widespread information Noise No widespread Information Partial reconstruction Data ≈ α1PC1 + α2PC2 + α3PC3
PCA implementation (Data) Data set Synthetic galaxy spectra1 used for the Gaia Mission. Size 7160 spectra X 801 flux values Waveband 300 – 1,100 nm Redshift No Spectral types 4 (E, S, I, QSFG) 1 Karampelas et al. 2012, Fioc & Rocca-Volmerange 1997, 1999, Le Borgne & Rocca-Volmerange 2002
PCA implementation (IDL) 7160 spectra X 801 pixels (admixture coefficients) 7160 spectra X 801 pixels (original data) result=PCOMP(data, COEFFICIENTS = coefficients, EIGENVALUES = eigenvalues, VARIANCES = variances, /COVARIANCE) m=801 & eigenvectors = coefficients/REBIN(eigenvalues, m, m) 801 spectra X 801 pixels (PCs)
PCA implementation (PC1, PC2, PC3) OIII SIII OII Ha Resembles the average spectrum Strong emission lines Dominant emission lines
PCA implementation (PC4, PC5, PC6) Strong emission lines Dominant emission lines
PCA implementation (Reconstruction) Spectrum = α1PC1 + α2PC2 + α3PC3 + α4PC4 + … = = α1 + α2 + α3 + … + α4
Admixture coefficients α2 Admixture coefficients α1 PCA implementation (Projection to PC1/PC2) Irregular Admixture coefficients α2 Spiral QSFG Early-type Admixture coefficients α1
PCA & Astrostatistics - 1 Yip et al. 2004, AJ ≈ 170, 000 SDSS galaxy spectra PC1 PC2 Outliers Red galaxies PC3 PC4 Blue galaxies Post starburst galaxies
PCA & Astrostatistics - 2 Folkes et al. 1999, MNRAS ≈ 6, 000 2dF galaxy spectra
Bailer-Jones et al. 1998, MNRAS PCA & Astrostatistics - 3 Bailer-Jones et al. 1998, MNRAS 5, 000 Michigan Spectral Survey stellar spectra
PCA & Astrostatistics - 4 Karampelas et al. 2012, A&A ≈ 30, 000 PEGASE synthetic galaxy spectra
Karampelas et al. in preparation PCA & Astrostatistics - 5 Karampelas et al. in preparation ≈ 7, 000 PEGASE synthetic galaxy spectra Classification with PCA + Decision Trees
PCA & Face Recognition Eigenfaces http://graphics.cs.cmu.edu/courses/15-463/2004_fall/www/handins/brh/final/