Data-Intensive Computing in the Science Community Alex Szalay, JHU.

Slides:

Advertisements

Similar presentations

Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.

Advertisements

Trying to Use Databases for Science Jim Gray Microsoft Research

1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research

Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY

Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.

The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.

Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.

Data-Intensive Scientific Computing in Astronomy Alex Szalay The Johns Hopkins University.

Thursday, April 8 th Agenda  Finish Section 18.1: The Universe  Origin of the universe, red shift, big bang theory  In-Class Assignments Section 18.1.

László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,

Petascale Data Intensive Computing Alex Szalay The Johns Hopkins University.

Astrophysics with Terabytes of Data Alex Szalay The Johns Hopkins University.

Petascale Data Intensive Computing for eScience Alex Szalay, Maria Nieto-Santisteban, Ani Thakar, Jan Vandenberg, Alainna Wonders, Gordon Bell, Dan Fay,

Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.

Database Software File Management Systems Database Management Systems.

Database Management: Getting Data Together Chapter 14.

Cosmology Astronomy 315 Professor Lee Carkner Lecture 22 "In the beginning the Universe was created. This has made a lot of people very angry and been.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.

Pan-STARRS: Learning to Ride the Data Tsunami María A. Nieto-Santisteban 1, Yogesh Simmhan 3, Roger Barga 3, Tamas Budávari 1, László Dobos 1, Nolan Li.

Astro-DISC: Astronomy and cosmology applications of distributed super computing.

SQL Server 2008 Overview Lubor Kollar, Group Program Manager.

Big Data Use Cases in the cloud Peter Sirota, GM Elastic

Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.

Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,

Systems analysis and design, 6th edition Dennis, wixom, and roth

Copyright © 2011, Oracle and/or its affiliates. All rights reserved.

Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.

László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.

Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.

The McGraw-Hill Companies, Inc Information Technology & Management Thompson Cats-Baril Chapter 3 Content Management.

National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:

Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.

Theory in the German Astrophysical VO Summary: We show results of efforts done within the German Astrophysical Virtual Observatory (GAVO). GAVO has paid.

1 1 Slide Introduction to Data Mining and Business Intelligence.

Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the relational model.

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

Victoria, May 2006 DAL for theorists: Implementation of the SNAP service for the TVO Claudio Gheller, Giuseppe Fiameni InterUniversitary Computing Center.

Future Directions of the VO Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research.

1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.

Kevin Cox – SQL CAT Microsoft Corporation What are the largest SQL projects in the world? SESSION CODE: DAT305 Srik Raghavan –

LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.

Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.

Cosmology and extragalactic astronomy

Federated Discovery and Access in Astronomy Robert Hanisch (NIST), Ray Plante (NCSA)

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

Data resource management

Extreme Data-Intensive Scientific Computing Alex Szalay The Johns Hopkins University.

Streaming Problems in Astrophysics

Pan-STARRS PS1 Published Science Products Subsystem Presentation to the PS1 Science Council August 1, 2007.

1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research

Reproducing the Observed Universe with Simulations Qi Guo Max Planck Institute for Astrophysics MPE April 8th, 2008.

Data and storage services on the NGS.

ORACLE & VLDB Nilo Segura IT/DB - CERN. VLDB The real world is in the Tb range (British Telecom - 80Tb using Sun+Oracle) Data consolidated from different.

Sensors Sensors Everywhere! IoT by Wire – The New Relevance Rob Milks – CEO B-LineLogic, Inc.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

Amdahl’s Laws and Extreme Data-Intensive Computing Alex Szalay The Johns Hopkins University.

IPv6 Matrix Project - Page 1 IPv6 Matrix Project Tracking IPv6 connectivity Worldwide Dr. Olivier MJ.

Astronomy toolkits and data structures Andrew Jenkins Durham University.

Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.

Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.

Extreme Data-Intensive Computing in Science Alex Szalay The Johns Hopkins University.

There is an inherent meaning in everything. “Signs for people who can see.”

Jialin Liu, Surendra Byna, Yong Chen Oct Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.

CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037

BARC Scaleable Servers

Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.

Using Columnstore indexes in Azure DevOps Services. Lessons learned

Using Columnstore indexes in Azure DevOps Services. Lessons learned

Presentation transcript:

Data-Intensive Computing in the Science Community Alex Szalay, JHU

Emerging Trends Large data sets are here, solutions are not Scientists are “cheap” –Giving them SW is not enough –Need recipe for solutions Emerging sociological trends: –Data collection in ever larger collaborations (VO) –Analysis decoupled, off archived data by smaller groups Even HPC projects choking on IO Exponential data growth –> data will be never co-located “Data cleaning” is much harder than data loading

Reference Applicatons We (JHU) have several key projects –SDSS (10TB total, 3TB in DB, soon 10TB, in use for 6 years) –PanStarrs (80TB by 2009, 300+ TB by 2012) –Immersive Turbulence: 30TB now, 300TB next year, can change how we use HPC simulations worldwide –SkyQuery: perform fast spatial joins on the largest astronomy catalogs / replicate multi-TB datasets 20 times for much faster query performance (1Bx1B in 3 mins) –OncoSpace: 350TB of radiation oncology images today, 1PB in two years, to be analyzed on the fly –Sensor Networks: 200M measurements now, billions next year, forming complex relationships

PAN-STARRS PS1 –detect ‘killer asteroids’, starting in November 2008 –Hawaii + JHU + Harvard + Edinburgh + Max Planck Society Data Volume –>1 Petabytes/year raw data –Over 5B celestial objects plus 250B detections in DB –100TB SQL Server database –PS4: 4 identical telescopes in 2012, generating 4PB/yr

Cosmological Simulations Cosmological simulations have 10 9 particles and produce over 30TB of data (Millennium) Build up dark matter halos Track merging history of halos Use it to assign star formation history Combination with spectral synthesis Realistic distribution of galaxy types Hard to analyze the data afterwards -> need DB What is the best way to compare to real data?

Immersive Turbulence Unique turbulence database –Consecutive snapshots of a 1,024 3 simulation of turbulence: now 30 Terabytes –Hilbert-curve spatial index –Soon 6K 3 and 300 Terabytes –Treat it as an experiment, observe the database! –Throw test particles in from your laptop, immerse yourself into the simulation, like in the movie Twister New paradigm for analyzing HPC simulations! with C. Meneveau, S. Chen, G. Eyink, R. Burns, E. Perlman

advect backwards in time ! minus Not possible during DNS Sample code (gfortran 90!) running on a laptop

Commonalities Huge amounts of data, aggregates needed –But also need to keep raw data –Need for parallelism Requests enormously benefit from indexing Very few predefined query patterns –Everything goes…. –Rapidly extract small subsets of large data sets –Geospatial everywhere Data will never be in one place –Remote joins will not go away Not much need for transactions Data scrubbing is crucial

Continuing Growth How long does the data growth continue? High end always linear Exponential comes from technology + economics  rapidly changing generations –like CCD’s replacing plates, and become ever cheaper How many new generations of instruments do we have left? Are there new growth areas emerging? Note, that software is also an instrument –hierarchical data replication –Value added data materializedv –data cloning