Streaming Problems in Astrophysics Alex Szalay Institute for Data-Intensive Engineering and Science The Johns Hopkins University
Sloan Digital Sky Survey “The Cosmic Genome Project” Started in 1992, finished in 2008 Data is public 2.5 Terapixels of images => 5 Tpx of sky 10 TB of raw data => 100TB processed 0.5 TB catalogs => 35TB in the end Database and spectrograph built at JHU (SkyServer) Now SDSS-3/4 data served from JHU
Statistical Challenges Data volume and computing power double every year, no polynomial algorithm can survive, only N log N Minimal variance estimators scale as N3, they also optimize on the wrong thing The problem today is not the statistical variance systematic errors => optimal subspace filtering (PCA) We need incremental algorithms, where computing is part of the cost function: the best estimator in a minute, day, week, year?
Randomization and Sampling In many data sets lots of redundancy Random subsampling is an obvious choice Sublinear scaling Streaming algorithms (linear in the number of items drawn) How do we sample from highly skewed distributions? Sample in linear transform space? Central limit theorem -> approximate Gaussian Random projections, FFT Remap PDF onto Gaussian PDF Compressed sensing
Streaming PCA Initialization Incremental updates Eigensystem of a small, random subset Truncate at p largest eigenvalues Incremental updates Mean and the low-rank A matrix SVD of A yields new eigensystem Randomized sublinear algorithm! Mishin, Budavari, Ahmad and Szalay (2012)
Robust PCA PCA minimizes σRMS of the residuals r = y – Py Quadratic formula: r2 extremely sensitive to outliers We optimize a robust M-scale σ2 (Maronna 2005) Implicitly given by Fits in with the iterative method!
Eigenvalues in Streaming PCA Classic Robust
Cyberbricks 36-node Amdahl cluster using 1200W total Zotac Atom/ION motherboards 4GB of memory, N330 dual core Atom, 16 GPU cores Aggregate disk space 148TB (HDD+SSD) Blazing I/O Performance: 18GB/s Amdahl number = 1 for under $30K Using SQL+GPUs for machine learning: 6.4B multidimensional regressions in 5 minutes over 1.2TB Ported Random Forest module from R to SQL/CUDA Szalay, Bell, Huang, Terzis, White (Hotpower-09)
Numerical Laboratories Similarities between Turbulence/CFD, N-body, ocean circulation and materials science On Exascale everything will be a Big Data problem Memory footprint will be >2PB With 5M timesteps => 10,000 Exabytes/simulation Impossible to store Doing all in-situ limits the scope of science How can we use streaming ideas to help?
Cosmology Simulations Simulations are becoming an instrument on their own Millennium DB is the poster child/ success story Built by Gerard Lemson (now at JHU) 600 registered users, 17.3M queries, 287B rows http://gavo.mpa-garching.mpg.de/Millennium/ Dec 2012 Workshop at MPA: 3 days, 50 people Data size and scalability PB data sizes, trillion particles of dark matter Value added services Localized Rendering Global analytics
Halo finding algorithms 2001 SKID Stadel 2001 enhanced BDM Bullock et al. 2001 SUBFIND Springel 2004 MHF Gill, Knebe & Gibson 2004 AdaptaHOP Aubert, Pichon & Colombi 2005 improved DENMAX Weller et al. 2005 VOBOZ Neyrinck et al. 2006 PSB Kim & Park 2006 6DFOF Diemand et al. 2007 subhalo finder Shaw et al. 2007 Ntropy-fofsv Gardner, Connolly & McBride 2009 HSF Maciejewski et al. 2009 LANL finder Habib et al. 2009 AHF Knollmann & Knebe 2010 pHOP Skory et al. 2010 ASOHF Planelles & Quilis 2010 pSO Sutter & Ricker 2010 pFOF Rasera et al. 2010 ORIGAMI Falck et al. 2010 HOT Ascasibar 2010 Rockstar Behroozi Three pictures here (before FOF, threshold, and final clusters) 1992 DENMAX Gelb & Bertschinger 1995 Adaptive FOF van Kampen et al. 1996 IsoDen Pfitzner & Salmon 1997 BDM Klypin & Holtzman 1998 HOP Eisenstein &Hut 1999 hierarchical FOF Gottloeberg et al. 1974 SO Press & Schechter 1985 FOF Davis et al. The Halo-Finder Comparison Project [Knebe et al, 2011]
Memory issue All current halo finders requires to load all the data into memory Each time snapshot from the simulation with 10 12 particles will require 12 terabytes of memory To build a scalable solution we need to develop an algorithm with sublinear memory usage
Streaming Solution: Our goal: haloes ≈ heavy hitters? Reduce halos finding problem to one of the existing problems in streaming setting Apply ready-to-use algorithms haloes ≈ heavy hitters? To make a reduction to heavy hitters we need to discretize the space. Naïve solution is to use 3D mesh: Each particle now replaced by cell id Heavy cells represent mass concentration Grid size is chosen according to typical halo size
Count Sketch
Streaming Solution: Our goal: haloes ≈ heavy hitters? Reduce halos finding problem to one of the existing problems in streaming setting Apply ready-to-use algorithms haloes ≈ heavy hitters? To make a reduction to heavy hitters we need to discretize the space. Naïve solution is to use 3D mesh: Each particle now replaced by cell id Heavy cells represent mass concentration Grid size is chosen according to typical halo size
Count Sketch
Memory Memory is the most significant advantage of applying streaming algorithms. Dataset size: ~ 10 9 particles Any in-memory algorithm: 12 GB Pick-and-Drop: 30 MB GPU acceleration One instance of Pick-and-Drop algorithm can be fully implemented by separate thread of GPU Count Sketch algorithm have two time-consuming procedures: evaluating the hash functions and updating the queue. The first one can be naively ported to GPU Zaoxing Liu , Nikita Ivkin , Lin F. Yang , Mark Neyrinck , Gerard Lemson, Alexander S. Szalay, Vladimir Braverman, Tamas Budavari, Randal Burns, Xin Wang, IEEE eScience Conference (2015)
Summary Large data sets are here Need new approaches => computable statistics It is all about systematic errors Streaming, sampling, robust techniques Dimensional reduction (PCA, random projections, importance sampling) More data from fewer telescopes Large simulations present additional challenges Time domain data emerging, requiring fast triggers New paradigm of analyzing large public data sets