Large Scale Computations in Astrophysics: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ACAT2000,

Slides:



Advertisements
Similar presentations
Trying to Use Databases for Science Jim Gray Microsoft Research
Advertisements

1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research

The Australian Virtual Observatory e-Science Meeting School of Physics, March 2003 David Barnes.
Development of China-VO ZHAO Yongheng NAOC, Beijing Nov
Hopkins Storage Systems Lab, Department of Computer Science Automated Physical Design in Database Caches T. Malik, X. Wang, R. Burns Johns Hopkins University.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Galaxy Distributions Analysis of Large-scale Structure Using Visualization and Percolation Technique on the SDSS Early Data Release Database Yuk-Yan Lam.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
CERN - IT Department CH-1211 Genève 23 Switzerland t The High Performance Archiver for the LHC Experiments Manuel Gonzalez Berges CERN, Geneva.
Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
1 1 Slide Introduction to Data Mining and Business Intelligence.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Spatial Indexing of large astronomical databases László Dobos, István Csabai, Márton Trencséni ELTE, Hungary.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
National Virtual Observatory Theory,Computation, and Data Exploration Panel of the AASC Charles Alcock, Tom Prince, Alex Szalay.
Alex Szalay Department of Physics and Astronomy The Johns Hopkins University and the SDSS Project The Sloan Digital Sky Survey.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Indexing and Visualizing Multidimensional Data I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,Budapest.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. 1 Main Frame Computing Objectives Explain why data resides on mainframe.
Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.
1 (Brief) Introductory Remarks On Behalf of the U.S. Department of Energy ESnet Site Coordinating Committee (ESCC) W.Scott Bradley ESCC Chairman
Tackling I/O Issues 1 David Race 16 March 2010.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Budapest Group Eötvös University MAGPOP kick-off meeting Cassis 2005 January
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
IT-DSS Alberto Pace2 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission.
Software Systems Development
CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037
Data Issues Julian Borrill
National Virtual Observatory
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Ch 4. The Evolution of Analytic Scalability
Google Sky.
Database System Architectures
Presentation transcript:

Large Scale Computations in Astrophysics: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ACAT2000, Fermilab, Oct 19, 2000

Nature of Astronomical Data Imaging –2D map of the sky at multiple wavelengths Derived catalogs –subsequent processing of images –extracting object parameters (400+ per object) Spectroscopic follow-up –spectra: more detailed object properties –clues to physical state and formation history –lead to distances: 3D maps Numerical simulations All inter-related!

Imaging Data

3D Maps

N-body Simulations

Trends Future dominated by detector improvements Total area of 3m+ telescopes in the world in m 2, total number of CCD pixels in Megapix, as a function of time. Growth over 25 years is a factor of 30 in glass, 3000 in pixels. Moore’s Law growth in CCD capabilities Gigapixel arrays on the horizon Improvements in computing and storage will track growth in data volume Investment in software is critical, and growing

The Age of Mega-Surveys The next generation mega-surveys and archives will change astronomy, due to –top-down design –large sky coverage –sound statistical plans –well controlled systematics The technology to store and access the data is here we are riding Moore’s law Data mining will lead to stunning new discoveries Integrating these archives is for the whole community => Virtual Observatory

Ongoing surveys Large number of new surveys –multi-TB in size, 100 million objects or more –individual archives planned, or under way Multi-wavelength view of the sky –more than 13 wavelength coverage in 5 years Impressive early discoveries –finding exotic objects by unusual colors L,T dwarfs, high-z quasars –finding objects by time variability gravitational microlensing MACHO 2MASS DENIS SDSS GALEX FIRST DPOSS GSC-II COBE MAP NVSS FIRST ROSAT OGLE... MACHO 2MASS DENIS SDSS GALEX FIRST DPOSS GSC-II COBE MAP NVSS FIRST ROSAT OGLE...

The Necessity of the VO Enormous scientific interest in the survey data The environment to exploit these huge sky surveys does not exist today! –1 Terabyte at 10 Mbyte/s takes 1 day –Hundreds of intensive queries and thousands of casual queries per-day –Data will reside at multiple locations, in many different formats –Existing analysis tools do not scale to Terabyte data sets Acute need in a few years, solution will not just happen

VO- The challenges Size of the archived data 40,000 square degrees is 2 Trillion pixels –One band 4 Terabytes –Multi-wavelength Terabytes –Time dimension10 Petabytes Current techniques inadequate –new archival methods –new analysis tools –new standards Hardware/networking requirements –scalable solutions required Transition to the new astronomy

VO: A New Initiative Priority in the Astronomy and Astrophysics Survey Enable new science not previously possible Maximize impact of large current and future efforts Create the necessary new standards Develop the software tools needed Ensure that the community has network and hardware resources to carry out the science

New Astronomy- Different! Data “Avalanche” –the flood of Terabytes of data is already happening, whether we like it or not –our present techniques of handling these data do not scale well with data volume Systematic data exploration –will have a central role –statistical analysis of the “typical” objects –automated search for the “rare” events Digital archives of the sky –will be the main access to data –hundreds to thousands of queries per day

Examples: Data Pipelines

Examples: Rare Events Discovery of several new objects by SDSS & 2MASS SDSS T-dwarf (June 1999)

Examples: Reprocessing Gravitational lensing 28,000 foreground galaxies over 2,045,000 background galaxies in test data (McKay etal 1999)

Examples: Galaxy Clustering Shape of fluctuation spectrum –cosmological parameters and initial conditions The new surveys (SDSS) are the first when logN~30 Starts with a query Compute correlation function –All pairwise distances N 2, N log N possible Power spectrum –Optimal: the Karhunen-Loeve transform –Signal-to-noise eigenmodes –N 3 in the number of pixels Needs to be done many times over

Relation to the HEP Problem Similarities –need to handle large amounts of data –data is located at multiple sites –data should be highly clustered –substantial amounts of custom reprocessing –need for a hierarchical organization of resources –scalable solutions required Differences of Astro from HEP –data migration is in opposite direction –the role of small queries is more important –relations between separate data sets (same sky) –data size currently smaller, we can keep it all on disk

Data Migration Path portal HEP Astro Tier 0 Tier 1 Tier 2 Tier 3

Queries are I/O limited In our applications few fixed access patterns –one cannot build indices for all possible queries –worst case scenario is linear scan of the whole table Increasingly large differences between –Random access controlled by seek time (5-10ms), <1000 random I/O /sec –Sequential I/O dramatic improvements, 100 MB/sec per SCSI channel easy reached 215 MB/sec on a single 2-way Dell server Often much faster to scan than to seek Good layout => more sequential I/O

Distributed Archives Networks are slower than disks: –minimize data transfer –run queries locally I/O will scale linearly with nodes –1 GB/sec aggregate I/O engine can be built for <$100K Non-trivial problems in –load balancing –query parallelization –queries across inhomogeneous data sources These problems are not specific to astronomy –commercial solutions are around the corner

Geometric Approach Main problem –fast, indexed searches of Terabytes in N-dim space –searches are not axis-parallel simple B-tree indexing does not work Geometric approach –use the geometric nature of the data –quantize data into containers of `friends’ objects of similar colors close on the sky clustered together on disk –containers represent coarse-grained map of the data multidimensional index-tree (eg KD-tree)

Geometric Indexing “Divide and Conquer” Partitioning 3  N  M3  N  M 3  N  M3  N  M Hierarchical Triangular Mesh Split as k-d tree Stored as r-tree of bounding boxes Using regular indexing techniques AttributesNumber Sky Position 3 Multiband FluxesN = 5+ Other M= 100+ AttributesNumber Sky Position 3 Multiband FluxesN = 5+ Other M= 100+

User Interface Analysis Engine Master Objectivity RAID Slave Objectivity RAID Slave Objectivity RAID Slave Objectivity RAID Slave SX Engine Objectivity Federation SDSS: Distributed Archive

Computing Virtual Data Analyze large output volumes next to the database –send results only (`Virtual Data’): the system `knows’ how to compute the result (Analysis Engine) Analysis: different CPU to I/O ratio than database –multilayered approach Highly scalable architecture required –distributed configuration – scalable to data grids Multiply redundant network paths between data-nodes and compute-nodes –`Data-wolf’ cluster

SDSS Data Flow

A Data Grid Node Hardware requirements Large distributed database engines –with few Gbyte/s aggregate I/O speed High speed (>10 Gbit/s) backbones –cross-connecting the major archives Scalable computing environment –with hundreds of CPUs for analysis Objectivity RAID Objectivity RAID Objectivity RAID Objectivity RAID Objectivity RAID Objectivity RAID Objectivity RAID Database layer 2 GBytes/sec Compute node Compute layer 200 CPUs Interconnect layer 1 Gbits/sec/node Other nodes 10 Gbits/s

GriPhyN GriPhyN - Grid for Physics Networks Goal: build a prototype Virtual Data Grid Create ~20 Tier 2 centers for the data processing Includes ATLAS, ALICE, CMS, LIGO and SDSS Principals –Paul Avery, Ian Foster, Harvey Newman, Carl Kasselman –collaboration of 16 universities and laboratories Phase 1 funding announced a week ago –NSF ITR Program: $11.9M Total scope about $60M

SDSS in GriPhyN Two Tier 2 Nodes (FNAL + JHU) –testing framework on real data in different scenarios FNAL node –reprocessing of images fast and full regeneration of catalogs from the images on disk gravitational lensing, finer morphological classification JHU node –statistical calculations, integrated with catalog database tasks require lots of data, can be run in parallel various statistical calculations, likelihood analyses power spectra, correlation functions, Monte-Carlo

Clustering of Galaxies Generic features of galaxy clustering: Self organized clustering driven by long range forces These lead to clustering on all scales Clustering hierarchy: distribution of galaxy counts is approximately lognormal Scenarios: ‘top-down’ vs ‘bottom-up’

Clustering of Computers Problem sizes have lognormal distribution –multiplicative process Optimal queuing strategy –run smallest job in queue –median scale set by local resources: largest jobs never finish Always need more computing –‘infall’ to larger clusters nearby –asymptotically long-tailed distribution of compute power Short range forces: supercomputers Long range forces: onset of high speed networking Self-organized clustering of computing resources –the Computational Grid

Conclusions Databases became an essential part of astronomy: most data access will soon be via digital archives Data at separate locations, distributed worldwide, evolving in time: move queries not data! Computations in both processing and analysis will be substantial: need to create a `Virtual Data Grid’ Problems similar to HEP, lot of commonalities, but data flow more complex Interoperability of archives is essential: the Virtual Observatory is inevitablewww.voforum.orgwww.voforum.orgwww.sdss.orgwww.sdss.org