Streaming Problems in Astrophysics

Slides:



Advertisements
Similar presentations
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Advertisements

Antony Lewis Institute of Astronomy, Cambridge
Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Uncertainty Quantification & the PSUADE Software
Structured Sparse Principal Component Analysis Reading Group Presenter: Peng Zhang Cognitive Radio Institute Friday, October 01, 2010 Authors: Rodolphe.
CMPUT 466/551 Principal Source: CMU
* * Joint work with Michal Aharon Freddy Bruckstein Michael Elad
More MR Fingerprinting
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Image Denoising via Learned Dictionaries and Sparse Representations
Dimensional reduction, PCA
Overview Of Clustering Techniques D. Gunopulos, UCR.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
Distributed and Streaming Evaluation of Batch Queries for Data-Intensive Computational Turbulence Kalin Kanov Department of Computer Science Johns Hopkins.
A Theory of Locally Low Dimensional Light Transport Dhruv Mahajan (Columbia University) Ira Kemelmacher-Shlizerman (Weizmann Institute) Ravi Ramamoorthi.
Robust Statistics Robust Statistics Why do we use the norms we do? Henrik Aanæs IMM,DTU A good general reference is: Robust Statistics:
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Random Sample Consensus: A Paradigm for Model Fitting with Application to Image Analysis and Automated Cartography Martin A. Fischler, Robert C. Bolles.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
The Statistical Properties of Large Scale Structure Alexander Szalay Department of Physics and Astronomy The Johns Hopkins University.
Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
1 Sparsity Control for Robust Principal Component Analysis Gonzalo Mateos and Georgios B. Giannakis ECE Department, University of Minnesota Acknowledgments:
EÖTVÖS UNIVERSITY BUDAPEST Department of Physics of Complex Systems VO Spectroscopy Workshop, ESAC Spectrum Services 2007 László Dobos (ELTE)
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Rozhen 2010, June Singular Value Decomposition of images from scanned photographic plates Vasil Kolev Institute of Computer and Communications Systems.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Approximation of Protein Structure for Fast Similarity Measures Fabian Schwarzer Itay Lotan Stanford University.
Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.
Direct Robust Matrix Factorization Liang Xiong, Xi Chen, Jeff Schneider Presented by xxx School of Computer Science Carnegie Mellon University.
Gap-filling and Fault-detection for the life under your feet dataset.
Mobile Robot Localization (ch. 7)
Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford),
AIRS Radiance and Geophysical Products: Methodology and Validation Mitch Goldberg, Larry McMillin NOAA/NESDIS Walter Wolf, Lihang Zhou, Yanni Qu and M.
Ching-Wa Yip Johns Hopkins University.  Alex Szalay (JHU)  Rosemary Wyse (JHU)  László Dobos (ELTE)  Tamás Budavári (JHU)  Istvan Csabai (ELTE)
Bundle Adjustment A Modern Synthesis Bill Triggs, Philip McLauchlan, Richard Hartley and Andrew Fitzgibbon Presentation by Marios Xanthidis 5 th of No.
Image Priors and the Sparse-Land Model
Model Based Event Detection in Sensor Networks Jayant Gupchup, Andreas Terzis, Randal Burns, Alex Szalay.
Point Distribution Models Active Appearance Models Compilation based on: Dhruv Batra ECE CMU Tim Cootes Machester.
Locations. Soil Temperature Dataset Observations Data is – Correlated in time and space – Evolving over time (seasons) – Gappy (Due to failures) – Faulty.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Martina Uray Heinz Mayer Joanneum Research Graz Institute of Digital Image Processing Horst Bischof Graz University of Technology Institute for Computer.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Applying Control Theory to Stream Processing Systems
Motion Segmentation with Missing Data using PowerFactorization & GPCA
Outlier Processing via L1-Principal Subspaces
Dimension Reduction via PCA (Principal Component Analysis)
PCA vs ICA vs LDA.
In summary C1={skin} C2={~skin} Given x=[R,G,B], is it skin or ~skin?
Principal Component Analysis
Streaming Problems in Astrophysics
Feature space tansformation methods
Introduction to Sensor Interpretation
Introduction to Sensor Interpretation
Presentation transcript:

Streaming Problems in Astrophysics Alex Szalay Institute for Data-Intensive Engineering and Science The Johns Hopkins University

Sloan Digital Sky Survey “The Cosmic Genome Project” Started in 1992, finished in 2008 Data is public 2.5 Terapixels of images => 5 Tpx of sky 10 TB of raw data => 100TB processed 0.5 TB catalogs => 35TB in the end Database and spectrograph built at JHU (SkyServer) Now SDSS-3/4 data served from JHU

The Broad Impact of SDSS Changed the way we do astronomy Remarkably fast transition seen for the community Speeded up the first phase of exploration Wide-area statistical queries easy Multi-wavelength astronomy is now the norm SDSS earned the TRUST of the community Enormous number of projects, way beyond original vision and expectation Many other surveys now follow Established expectations for data delivery Serves as a model for other communities of science

How Do We Prioritize? Data Explosion: science is becoming data driven It is becoming “too easy” to collect even more data Robotic telescopes, next generation sequencers, complex simulations How long can this go on? “Do I have enough data or would I like to have more?” No scientist ever wanted less data…. But: Big Data is synonymous with Dirty Data How can we decide how to collect data that is more relevant ?

Statistical Challenges Data volume and computing power double every year, no polynomial algorithm can survive, only N log N Minimal variance estimators scale as N3, they also optimize on the wrong thing The problem today is not the statistical variance systematic errors => optimal subspace filtering (PCA) We need incremental algorithms, where computing is part of the cost function: the best estimator in a minute, day, week, year?

Randomization and Sampling In many data sets lots of redundancy Random subsampling is an obvious choice Sublinear scaling Streaming algorithms (linear in the number of items drawn) How do we sample from highly skewed distributions? Sample in linear transform space? Central limit theorem -> approximate Gaussian Random projections, FFT Remap PDF onto Gaussian PDF Compressed sensing

Streaming PCA Initialization Incremental updates Eigensystem of a small, random subset Truncate at p largest eigenvalues Incremental updates Mean and the low-rank A matrix SVD of A yields new eigensystem Randomized sublinear algorithm! Mishin, Budavari, Ahmad and Szalay (2012)

Robust PCA PCA minimizes σRMS of the residuals r = y – Py Quadratic formula: r2 extremely sensitive to outliers We optimize a robust M-scale σ2 (Maronna 2005) Implicitly given by Fits in with the iterative method!

Eigenvalues in Streaming PCA Classic Robust

Principal component pursuit Low rank approximation of data matrix: X Standard PCA: works well if the noise distribution is Gaussian outliers can cause bias Principal component pursuit “sparse” spiky noise/outliers: try to minimize the number of outliers while keeping the rank low NP-hard problem The L1 trick: numerically feasible convex problem (Augmented Lagrange Multiplier) * E. Candes, et al. “Robust Principal Component Analysis”. preprint, 2009. Abdelkefi et al. ACM CoNEXT Workshop (traffic anomaly detection)

Testing on Galaxy Spectra Slowly varying continuum + absorption lines Highly variable “sparse” emission lines This is the simple version of PCP: the position of the lines are known but there are many of them, automatic detection can be useful spiky noise can bias standard PCA DATA: Streaming robust PCA implementation for galaxy spectrum catalog (L. Dobos et al.) SDSS 1M galaxy spectra Morphological subclasses Robust averages + first few PCA directions

Streaming PCA Initialization Incremental updates Eigensystem of a small, random subset Truncate at p largest eigenvalues Incremental updates Mean and the low-rank A matrix SVD of A yields new eigensystem Randomized sublinear algorithm! Mishin, Budavari, Ahmad and Szalay (2012)

PCA PCA reconstruction Residual

Principal component pursuit Low rank Sparse Residual λ=0.6/sqrt(n), ε=0.03

Numerical Laboratories Similarities between Turbulence/CFD, N-body, ocean circulation and materials science On Exascale everything will be a Big Data problem Memory footprint will be >2PB With 5M timesteps => 10,000 Exabytes/simulation Impossible to store Doing all in-situ limits the scope of science How can we use streaming ideas to help?

LHC Streams LHC has a single data source, $$$$$ Multiple experiments tap into the beamlines They each use in-situ hardware triggers to filter data Only 1 in 10M events are stored Not that the rest is garbage, just sparsely sampled Resulting “small subset” analyzed many times off-line This is still 10-100 PBs Keeps a whole community busy for a decade or more

Exascale Simulation Analogy Exascale computer running a community simulation Many groups plugging their own “triggers” (in-situ), the equivalents of “beamlines” Keep very small subsets of the data Plus random samples from the field Immersive sensors following world lines or light cones Burst Buffer of timesteps: save precursor of events Sparse output streams analyzed offline by broader community Cover more parameter space and extract more realizations (UQ) using the saved resources

Applications of ML to Turbulence Renyi divergence ? Vorticity clustering, classification, anomaly detection Similarity between regions with J. Schneider, B. Poczos, CMU

Summary Large data sets are here Need new approaches => computable statistics It is all about systematic errors Streaming, sampling, robust techniques Dimensional reduction (PCA, random projections, importance sampling) More data from fewer telescopes Large simulations present additional challenges Breverman talk Time domain data emerging, requiring fast triggers New paradigm of analyzing large public data sets