CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University.

Slides:



Advertisements
Similar presentations
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Advertisements

3 September 2004NVO Coordination Meeting1 Grid-Technologies NVO and the Grid Reagan W. Moore George Kremenek Leesa Brieger Ewa Deelman Roy Williams John.
VO Standards – Catalog Access Tamás Budavári Johns Hopkins University.
Global Hands-On Universe meeting July 15, 2007 Authentic Data in the Classroom with the Sloan Digital Sky Survey Jordan Raddick (Johns Hopkins University)
Remote Visualisation System (RVS) By: Anil Chandra.
The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR Yogesh Simmhan, Ed Lazowska, Alex Szalay, and Catharine.
From FITS to SQL Loading and Publishing the SDSS Data Ani Thakar and Alex Szalay (JHU), Jim Gray (Microsoft Research) SDSS Data Release 1
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Clouds C. Vuerli Contributed by Zsolt Nemeth. As it started.
László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
User Patterns from SkyServer 1/f power law of session times and request sizes –No discrete classes of users!! Users are willing to learn SQL for advantage.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Pan-STARRS: Learning to Ride the Data Tsunami María A. Nieto-Santisteban 1, Yogesh Simmhan 3, Roger Barga 3, Tamas Budávari 1, László Dobos 1, Nolan Li.
SDSS Web Services Tamás Budavári Johns Hopkins University Coding against the Universe.
Web-Enabling the Warehouse Chapter 16. Benefits of Web-Enabling a Data Warehouse Better-informed decision making Lower costs of deployment and management.
Teaching Science with Sloan Digital Sky Survey Data GriPhyN/iVDGL Education and Outreach meeting March 1, 2002 Jordan Raddick The Johns Hopkins University.
Platform as a Service (PaaS)
Cloud Computing Systems Lin Gu Hong Kong University of Science and Technology Sept. 26, 2011 Windows Azure—Microsoft Cloud Computing Technologies.
Ch 4. The Evolution of Analytic Scalability
Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick.
Spatial Indexing and Visualizing Large Multi-dimensional Databases I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,
László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
WP-8, ZIB WP-8: Data Handling And Visualization Review Meeting Report Felix Hupfeld, Andrei Hutanu, Andre Merzky, Thorsten Schütt, Brygg Ullmer Zuse-Institute-Berlin.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
WSRF Supported Data Access Service (VO-DAS)‏ Chao Liu, Haijun Tian, Dan Gao, Yang Yang, Yong Lu China-VO National Astronomical Observatories, CAS, China.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
Cloud Computing & Amazon Web Services – EC2 Arpita Patel Software Engineer.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
NEON Obs School 11-Aug-2005 Archival Data and Virtual Observatories 1 Virtual Observatories...or how to do your research from a beach in the Bahamas rather.
The Pan-STARRS Data Challenge Jim Heasley Institute for Astronomy University of Hawaii ICS 624 – 28 March 2011.
Knowledge Extraction from Scientific Data Roy Williams California Institute of Technology SDMIV 24 October 2002 Edinburgh KE ToolsS Data.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Windows Azure Conference 2014 Designing Applications for Scalability.
The Pan-STARRS Data Challenge Jim Heasley Institute for Astronomy University of Hawaii.
Making FITS available in.NET and its Applications Vivek Haridas 1, Tamas Budavari 1, William O'Mullane 1, Alex Szalay 1, Alberto Conti 2, Bill Pence 3,
Students: Anurag Anjaria, Charles Hansen, Jin Bai, Mai Kanchanabal Professors: Dr. Edward J. Delp, Dr. Yung-Hsiang Lu CAM 2 Continuous Analysis of Many.
AstroGrid: The UK’s Virtual Observatory Dr Dugan Witherick – Astrophysics Group, UCL Wednesday 5 th December 2007 The University of Warwick.
Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Integration of Astro-WISE with Grid storage.
The Sloan Digital Sky Survey ImgCutout: The universe at your fingertips Maria A. Nieto-Santisteban Johns Hopkins University
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
May 17, 2005Maria Nieto-Santisteban, JHU / IVOA - Kyoto1 VO JHU Open SkyQuery and more … T. Budavari, S. Carliles, L. Dobos, G. Fekete,
06-1L ASTRO-E2 ASTRO-E2 User Group - 14 February, 2005 Astro-E2 Archive Lorella Angelini/HEASARC.
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Pan-STARRS PS1 Published Science Products Subsystem Presentation to the PS1 Science Council August 1, 2007.
Web Technologies Lecture 13 Introduction to cloud computing.
Gijs Verdoes Kleijn Edwin Valentijn Marjolein Cuppen for the Astro-WISE consortium.
Next Generation of Apache Hadoop MapReduce Owen
Wide-field Infrared Survey Explorer (WISE) is a NASA infrared- wavelength astronomical space telescope launched on December 14, 2009 It’s an Earth-orbiting.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Platform as a Service (PaaS)
Platform as a Service (PaaS)
CMS High Level Trigger Configuration Management
Joseph JaJa, Mike Smorul, and Sangchul Song
API Aspect of the Science Platform
TYPES OF SERVER. TYPES OF SERVER What is a server.
What is an Operating System?
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Ch 4. The Evolution of Analytic Scalability
Tiers vs. Layers.
Google Sky.
CEA Experiences Paul Harrison ESO.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

What is CASJobs  Terabytes of scientific data  Web based system  Data distribution  Server-side analysis  Optimize user work patterns  Server-side user storage and programmability

Sloan Digital Sky Survey (SDSS)  Astronomical Survey  Images (fits) TB  Other data products ( masks, jpeg images, etc.) (DAS, fits format) TBDAS  Catalogs (CAS, SQL database) - 18 TBCAS  Data is public  Delivery?

Database  Bandwidth is expensive!  10 terabytes is big!  So database it (SkyServer)  Partial delivery  Move work to data  Scalability  Traffic++  Complexity ++  Data++  So…  Cap execution time  Cap results  Build something else

CASJobs  Catalog Archive Server Jobs  Server-side user storage and programmability MyDB  Hardware abstraction and long-term query portability Contexts  Complete, automatic query logging  Scalable performance Controlled asynchronous query execution  Data sharing Groups 

MyDB  Server-side user database  Intermediate storage  Data import  User programmable SELECT * FROM DR4 WHERE a.objid = OR a.objid = OR a.objid = OR a.objid = SELECT * FROM DR4 a, MyDB.MyTable b WHERE a.objid = b.objid

Logging  Automatically log all user queries  Resubmit old queries  Reconstruct database objects

Contexts  Databases are identified by their data, not their location  Queries are independent of hardware configuration SELECT TOP 10 * FROM [server].[catalog].[user].MyTable SELECT TOP 10 * FROM DR4.MyTable

Quick Jobs  Executes right away  But not for very long  Restricted memory usage  For things like…  How many objects ?  Table previews  Preliminary queries  System queries

Long Jobs  Asynchronous  Less restricted execution time  Storage capped by MyDB size  For things like…  Heavy IO  Heavy computation

Groups  Non exclusive sets of CASJobs users  Share data  Keep more work at the data SELECT * FROM myGroup.otherUser.theirTable

Hardware  Flexible configuration  1+ machine per context (non exclusive)  1+ machine for MyDBs

Interface Web SiteWeb Services

Usage  > two million jobs  > 2200 users  Astro deployments  Galaxy Evolution Explorer (GALEX)  Palomar Quest  Panoramic Survey Telescope and Rapid Response System (Pan-STARRS)[3].  Non Astro deployments  Ameriflux  Swiss Institute of Bioinformatics (ISB)