Streaming Problems in Astrophysics

Slides:



Advertisements
Similar presentations
Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University.
Advertisements

The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
1 Scoped and Approximate Queries in a Relational Grid Information Service Dong Lu, Peter A. Dinda, Jason A. Skicewicz Prescience Lab, Dept. of Computer.
Agile Views in a Dynamic Data Management System Oliver Kennedy 1+3, Yanif Ahmad 2, Christoph Koch 1
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM.
Dimensionality Reduction
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
Random Forest Photometric Redshift Estimation Samuel Carliles 1 Tamas Budavari 2, Sebastien Heinis 2, Carey Priebe 3, Alex Szalay 2 Johns Hopkins University.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Distributed and Streaming Evaluation of Batch Queries for Data-Intensive Computational Turbulence Kalin Kanov Department of Computer Science Johns Hopkins.
A Theory of Locally Low Dimensional Light Transport Dhruv Mahajan (Columbia University) Ira Kemelmacher-Shlizerman (Weizmann Institute) Ravi Ramamoorthi.
Dimensionality Reduction
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
FLANN Fast Library for Approximate Nearest Neighbors
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
The Statistical Properties of Large Scale Structure Alexander Szalay Department of Physics and Astronomy The Johns Hopkins University.
 CDM Subhalos P.Nurmi, P.Heinämäki, E. Saar, M. Einasto, J. Holopainen, V.J. Martinez, J. Einasto Subhalos in LCDM cosmological simulations: Masses and.
Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
Millennium Data Dissemination MPA institute seminar1 possible extensions how ambitious can/should we be?
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
Distributed Multigrid for Processing Huge Spherical Images Michael Kazhdan Johns Hopkins University.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time.
Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford),
Various Mostly Lagrangian Things Mark Neyrinck Johns Hopkins University Collaborators: Bridget Falck, Miguel Aragón-Calvo, Xin Wang, Donghui Jeong, Alex.
Extreme Data-Intensive Scientific Computing Alex Szalay The Johns Hopkins University.
Streaming Problems in Astrophysics
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Model Based Event Detection in Sensor Networks Jayant Gupchup, Andreas Terzis, Randal Burns, Alex Szalay.
1 Database Systems Group Research Overview OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:
Martina Uray Heinz Mayer Joanneum Research Graz Institute of Digital Image Processing Horst Bischof Graz University of Technology Institute for Computer.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Extreme Data-Intensive Computing in Science Alex Szalay The Johns Hopkins University.
Loco: Distributing Ridge Regression with Random Projections Yang Song Department of Statistics.
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
A novel approach to visualizing dark matter simulations
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.
Dense-Region Based Compact Data Cube
SketchVisor: Robust Network Measurement for Software Packet Processing
International Conference on Data Engineering (ICDE 2016)
Applying Control Theory to Stream Processing Systems
A Black-Box Approach to Query Cardinality Estimation
Tamas Szalay, Volker Springel, Gerard Lemson
So far we have covered … Basic visualization algorithms
Qun Huang, Patrick P. C. Lee, Yungang Bao
Logistic Regression & Parallel SGD
Overview Massive data sets Streaming algorithms Regression
What is Regression Analysis?
Results (Accuracy of Low Rank Approximation by iSVD)
Dimension reduction : PCA and Clustering
Fast and Exact K-Means Clustering
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Introduction to Sensor Interpretation
Feature Selection Methods
Minwise Hashing and Efficient Search
Introduction to Sensor Interpretation
Carlos Ordonez, Javier Garcia-Garcia,
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

Streaming Problems in Astrophysics Alex Szalay Institute for Data-Intensive Engineering and Science The Johns Hopkins University

Sloan Digital Sky Survey “The Cosmic Genome Project” Started in 1992, finished in 2008 Data is public 2.5 Terapixels of images => 5 Tpx of sky 10 TB of raw data => 100TB processed 0.5 TB catalogs => 35TB in the end Database and spectrograph built at JHU (SkyServer) Now SDSS-3/4 data served from JHU

Statistical Challenges Data volume and computing power double every year, no polynomial algorithm can survive, only N log N Minimal variance estimators scale as N3, they also optimize on the wrong thing The problem today is not the statistical variance systematic errors => optimal subspace filtering (PCA) We need incremental algorithms, where computing is part of the cost function: the best estimator in a minute, day, week, year?

Randomization and Sampling In many data sets lots of redundancy Random subsampling is an obvious choice Sublinear scaling Streaming algorithms (linear in the number of items drawn) How do we sample from highly skewed distributions? Sample in linear transform space? Central limit theorem -> approximate Gaussian Random projections, FFT Remap PDF onto Gaussian PDF Compressed sensing

Streaming PCA Initialization Incremental updates Eigensystem of a small, random subset Truncate at p largest eigenvalues Incremental updates Mean and the low-rank A matrix SVD of A yields new eigensystem Randomized sublinear algorithm! Mishin, Budavari, Ahmad and Szalay (2012)

Robust PCA PCA minimizes σRMS of the residuals r = y – Py Quadratic formula: r2 extremely sensitive to outliers We optimize a robust M-scale σ2 (Maronna 2005) Implicitly given by Fits in with the iterative method!

Eigenvalues in Streaming PCA Classic Robust

Cyberbricks 36-node Amdahl cluster using 1200W total Zotac Atom/ION motherboards 4GB of memory, N330 dual core Atom, 16 GPU cores Aggregate disk space 148TB (HDD+SSD) Blazing I/O Performance: 18GB/s Amdahl number = 1 for under $30K Using SQL+GPUs for machine learning: 6.4B multidimensional regressions in 5 minutes over 1.2TB Ported Random Forest module from R to SQL/CUDA Szalay, Bell, Huang, Terzis, White (Hotpower-09)

Numerical Laboratories Similarities between Turbulence/CFD, N-body, ocean circulation and materials science On Exascale everything will be a Big Data problem Memory footprint will be >2PB With 5M timesteps => 10,000 Exabytes/simulation Impossible to store Doing all in-situ limits the scope of science How can we use streaming ideas to help?

Cosmology Simulations Simulations are becoming an instrument on their own Millennium DB is the poster child/ success story Built by Gerard Lemson (now at JHU) 600 registered users, 17.3M queries, 287B rows http://gavo.mpa-garching.mpg.de/Millennium/ Dec 2012 Workshop at MPA: 3 days, 50 people Data size and scalability PB data sizes, trillion particles of dark matter Value added services Localized Rendering Global analytics

Halo finding algorithms 2001 SKID Stadel 2001 enhanced BDM Bullock et al. 2001 SUBFIND Springel 2004 MHF Gill, Knebe & Gibson 2004 AdaptaHOP Aubert, Pichon & Colombi 2005 improved DENMAX Weller et al. 2005 VOBOZ Neyrinck et al. 2006 PSB Kim & Park 2006 6DFOF Diemand et al. 2007 subhalo finder Shaw et al. 2007 Ntropy-fofsv Gardner, Connolly & McBride 2009 HSF Maciejewski et al. 2009 LANL finder Habib et al. 2009 AHF Knollmann & Knebe 2010 pHOP Skory et al. 2010 ASOHF Planelles & Quilis 2010 pSO Sutter & Ricker 2010 pFOF Rasera et al. 2010 ORIGAMI Falck et al. 2010 HOT Ascasibar 2010 Rockstar Behroozi Three pictures here (before FOF, threshold, and final clusters) 1992 DENMAX Gelb & Bertschinger 1995 Adaptive FOF van Kampen et al. 1996 IsoDen Pfitzner & Salmon 1997 BDM Klypin & Holtzman 1998 HOP Eisenstein &Hut 1999 hierarchical FOF Gottloeberg et al. 1974 SO Press & Schechter 1985 FOF Davis et al. The Halo-Finder Comparison Project [Knebe et al, 2011]

Memory issue All current halo finders requires to load all the data into memory Each time snapshot from the simulation with 10 12 particles will require 12 terabytes of memory To build a scalable solution we need to develop an algorithm with sublinear memory usage

Streaming Solution: Our goal: haloes ≈ heavy hitters? Reduce halos finding problem to one of the existing problems in streaming setting Apply ready-to-use algorithms haloes ≈ heavy hitters? To make a reduction to heavy hitters we need to discretize the space. Naïve solution is to use 3D mesh: Each particle now replaced by cell id Heavy cells represent mass concentration Grid size is chosen according to typical halo size

Count Sketch

Streaming Solution: Our goal: haloes ≈ heavy hitters? Reduce halos finding problem to one of the existing problems in streaming setting Apply ready-to-use algorithms haloes ≈ heavy hitters? To make a reduction to heavy hitters we need to discretize the space. Naïve solution is to use 3D mesh: Each particle now replaced by cell id Heavy cells represent mass concentration Grid size is chosen according to typical halo size

Count Sketch

Memory Memory is the most significant advantage of applying streaming algorithms. Dataset size: ~ 10 9 particles Any in-memory algorithm: 12 GB Pick-and-Drop: 30 MB GPU acceleration One instance of Pick-and-Drop algorithm can be fully implemented by separate thread of GPU Count Sketch algorithm have two time-consuming procedures: evaluating the hash functions and updating the queue. The first one can be naively ported to GPU Zaoxing Liu , Nikita Ivkin , Lin F. Yang , Mark Neyrinck , Gerard Lemson, Alexander S. Szalay, Vladimir Braverman, Tamas Budavari, Randal Burns, Xin Wang, IEEE eScience Conference (2015)

Summary Large data sets are here Need new approaches => computable statistics It is all about systematic errors Streaming, sampling, robust techniques Dimensional reduction (PCA, random projections, importance sampling) More data from fewer telescopes Large simulations present additional challenges Time domain data emerging, requiring fast triggers New paradigm of analyzing large public data sets