Presentation is loading. Please wait.

Presentation is loading. Please wait.

Streaming Problems in Astrophysics

Similar presentations


Presentation on theme: "Streaming Problems in Astrophysics"— Presentation transcript:

1 Streaming Problems in Astrophysics
Alex Szalay Institute for Data-Intensive Engineering and Science The Johns Hopkins University

2 Sloan Digital Sky Survey
“The Cosmic Genome Project” Started in 1992, finished in 2008 Data is public 2.5 Terapixels of images => 5 Tpx of sky 10 TB of raw data => 100TB processed 0.5 TB catalogs => 35TB in the end Database and spectrograph built at JHU (SkyServer) Now SDSS-3/4 data served from JHU

3 Statistical Challenges
Data volume and computing power double every year, no polynomial algorithm can survive, only N log N Minimal variance estimators scale as N3, they also optimize on the wrong thing The problem today is not the statistical variance systematic errors => optimal subspace filtering (PCA) We need incremental algorithms, where computing is part of the cost function: the best estimator in a minute, day, week, year?

4 Randomization and Sampling
In many data sets lots of redundancy Random subsampling is an obvious choice Sublinear scaling Streaming algorithms (linear in the number of items drawn) How do we sample from highly skewed distributions? Sample in linear transform space? Central limit theorem -> approximate Gaussian Random projections, FFT Remap PDF onto Gaussian PDF Compressed sensing

5 Streaming PCA Initialization Incremental updates
Eigensystem of a small, random subset Truncate at p largest eigenvalues Incremental updates Mean and the low-rank A matrix SVD of A yields new eigensystem Randomized sublinear algorithm! Mishin, Budavari, Ahmad and Szalay (2012)

6 Robust PCA PCA minimizes σRMS of the residuals r = y – Py
Quadratic formula: r2 extremely sensitive to outliers We optimize a robust M-scale σ2 (Maronna 2005) Implicitly given by Fits in with the iterative method!

7 Eigenvalues in Streaming PCA
Classic Robust

8 Cyberbricks 36-node Amdahl cluster using 1200W total
Zotac Atom/ION motherboards 4GB of memory, N330 dual core Atom, 16 GPU cores Aggregate disk space 148TB (HDD+SSD) Blazing I/O Performance: 18GB/s Amdahl number = 1 for under $30K Using SQL+GPUs for machine learning: 6.4B multidimensional regressions in 5 minutes over 1.2TB Ported Random Forest module from R to SQL/CUDA Szalay, Bell, Huang, Terzis, White (Hotpower-09)

9 Numerical Laboratories
Similarities between Turbulence/CFD, N-body, ocean circulation and materials science On Exascale everything will be a Big Data problem Memory footprint will be >2PB With 5M timesteps => 10,000 Exabytes/simulation Impossible to store Doing all in-situ limits the scope of science How can we use streaming ideas to help?

10 Cosmology Simulations
Simulations are becoming an instrument on their own Millennium DB is the poster child/ success story Built by Gerard Lemson (now at JHU) 600 registered users, 17.3M queries, 287B rows Dec 2012 Workshop at MPA: 3 days, 50 people Data size and scalability PB data sizes, trillion particles of dark matter Value added services Localized Rendering Global analytics

11 Halo finding algorithms
2001 SKID Stadel 2001 enhanced BDM Bullock et al. 2001 SUBFIND Springel 2004 MHF Gill, Knebe & Gibson 2004 AdaptaHOP Aubert, Pichon & Colombi 2005 improved DENMAX Weller et al. 2005 VOBOZ Neyrinck et al. 2006 PSB Kim & Park 2006 6DFOF Diemand et al. 2007 subhalo finder Shaw et al. 2007 Ntropy-fofsv Gardner, Connolly & McBride 2009 HSF Maciejewski et al. 2009 LANL finder Habib et al. 2009 AHF Knollmann & Knebe 2010 pHOP Skory et al. 2010 ASOHF Planelles & Quilis 2010 pSO Sutter & Ricker 2010 pFOF Rasera et al. 2010 ORIGAMI Falck et al. 2010 HOT Ascasibar 2010 Rockstar Behroozi Three pictures here (before FOF, threshold, and final clusters) 1992 DENMAX Gelb & Bertschinger 1995 Adaptive FOF van Kampen et al. 1996 IsoDen Pfitzner & Salmon 1997 BDM Klypin & Holtzman 1998 HOP Eisenstein &Hut 1999 hierarchical FOF Gottloeberg et al. 1974 SO Press & Schechter 1985 FOF Davis et al. The Halo-Finder Comparison Project [Knebe et al, 2011]

12 Memory issue All current halo finders requires to load all the data into memory Each time snapshot from the simulation with particles will require 12 terabytes of memory To build a scalable solution we need to develop an algorithm with sublinear memory usage

13 Streaming Solution: Our goal: haloes ≈ heavy hitters?
Reduce halos finding problem to one of the existing problems in streaming setting Apply ready-to-use algorithms haloes ≈ heavy hitters? To make a reduction to heavy hitters we need to discretize the space. Naïve solution is to use 3D mesh: Each particle now replaced by cell id Heavy cells represent mass concentration Grid size is chosen according to typical halo size

14 Count Sketch

15 Streaming Solution: Our goal: haloes ≈ heavy hitters?
Reduce halos finding problem to one of the existing problems in streaming setting Apply ready-to-use algorithms haloes ≈ heavy hitters? To make a reduction to heavy hitters we need to discretize the space. Naïve solution is to use 3D mesh: Each particle now replaced by cell id Heavy cells represent mass concentration Grid size is chosen according to typical halo size

16 Count Sketch

17 Memory Memory is the most significant advantage of applying streaming algorithms. Dataset size: ~ particles Any in-memory algorithm: 12 GB Pick-and-Drop: 30 MB GPU acceleration One instance of Pick-and-Drop algorithm can be fully implemented by separate thread of GPU Count Sketch algorithm have two time-consuming procedures: evaluating the hash functions and updating the queue. The first one can be naively ported to GPU Zaoxing Liu , Nikita Ivkin , Lin F. Yang , Mark Neyrinck , Gerard Lemson, Alexander S. Szalay, Vladimir Braverman, Tamas Budavari, Randal Burns, Xin Wang, IEEE eScience Conference (2015)

18 Summary Large data sets are here
Need new approaches => computable statistics It is all about systematic errors Streaming, sampling, robust techniques Dimensional reduction (PCA, random projections, importance sampling) More data from fewer telescopes Large simulations present additional challenges Time domain data emerging, requiring fast triggers New paradigm of analyzing large public data sets


Download ppt "Streaming Problems in Astrophysics"

Similar presentations


Ads by Google