THOUGHTS ON DATA MANAGEMENT by Justin Burruss & David Schissel SWIM Workshop November 7-9, 2005 Oak Ridge, TN.

Slides:



Advertisements
Similar presentations
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Advertisements

Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.
Cognos Web Services Business Intelligence. SOA SOA (Service Oriented Architecture) The SOA approach involves seven key principles: -- Coarse -grained.
High Performance Computing Course Notes Grid Computing.
Simple and Secure Approach to Discovery at the Desktop.
ABSTRACT The goal of this project was to create a more realistic and interactive appliance interface for a Usability Science class here at Union. Usability.
MDSplus Tom Fredian MIT Plasma Science and Fusion Center.
SWIM WEB PORTAL by Dipti Aswath SWIM Meeting ORNL Oct 15-17, 2007.
Caching the MDSPlus Data via Hibernate By Ajith M Jose Comp6703 Project Client: Raju Karia Supervisor: Dr. Henry Gardner (Development of “WebScope”)
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
File Management.
SM3121 Software Technology Mark Green School of Creative Media.
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
AMIR RACHUM CHAI RONEN FINAL PRESENTATION INDUSTRIAL SUPERVISOR: DR. ROEE ENGELBERG, LSI Optimized Caching Policies for Storage Systems.
1 Programming James King 12 August Aims Give overview of concepts addressed in Web based programming module Teach you enough Java to write simple.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
GRAPPA Part of Active Notebook Science Portal project A “notebook” like GRAPPA consists of –Set of ordinary web pages, viewable from any browser –Editable.
No, Thanks, I’ll Use a Spreadsheet
Automated Data Analysis National Center for Immunization & Respiratory Diseases Influenza Division Nishan Ahmed Data Management Training Cairo, Egypt April.
Standard Grade Computing System Software & Operating Systems.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
© Geodise Project, University of Southampton, Data Management in Geodise Zhuoan Jiao, Jasmin Wason and Marc Molinari
Requirements Engineering Requirements Elicitation Process Lecture-8.
File System Management File system management encompasses the provision of a way to store your data in a computer, as well as a way for you to find and.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Usability Issues Facing 21st Century Data Archives Joey Mukherjee and David Winningham
- Ahmad Al-Ghoul Data design. 2 learning Objectives Explain data design concepts and data structures Explain data design concepts and data structures.
The european ITM Task Force data structure F. Imbeaux.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
Holding slide prior to starting show. A Portlet Interface for Computational Electromagnetics on the Grid Maria Lin and David Walker Cardiff University.
ABSTRACT The JDBC (Java Database Connectivity) API is the industry standard for database- independent connectivity between the Java programming language.
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
Claims-Based Identity Solution Architect Briefing zoli.herczeg.ro Taken from David Chappel’s work at TechEd Berlin 2009.
May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.
Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.
PACS in Radiology By Alanoud Al Saleh.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Interactive Data Analysis on the “Grid” Tech-X/SLAC/PPDG:CS-11 Balamurali Ananthan David Alexander
WHAT IS PHP FRAMEWORK? Set of Classes and Functions(Methods) Design for the development of web applications Provide basic structure Rapid application development(RAD)
Lesson 29: Building a Database. Learning Objectives After studying this lesson, you will be able to:  Identify key database design techniques  Open.
4/26/2017 Use Cloud-Based Load Testing Service to Find Scale and Performance Bottlenecks Randy Pagels Sr. Developer Technology Specialist © 2012 Microsoft.
John Kewley e-Science Centre All Hands Meeting st September, Nottingham GROWL: A Lightweight Grid Services Toolkit and Applications John Kewley.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
By David P. Schissel and Reza Shakoori Presented at DOE Office of Science High-Performance Network Research PI Meeting Brookhaven National Lab September.
Ellis Paul Technical Solution Specialist – System Center Microsoft UK Operations Manager Overview.
Introduction to Databases Angela Clark University of South Alabama.
CyVerse-enabled NCBI Sequence Read Archive (SRA) Submission Pipeline
Vincenzo Innocente, CERN/EPUser Collections1 Grid Scenarios in CMS Vincenzo Innocente CERN/EP Simulation, Reconstruction and Analysis scenarios.
Globus Data Storage Interface (DSI) - Enabling Easy Access to Grid Datasets Raj Kettimuthu, ANL and U. Chicago DIALOGUE Workshop August 2, 2005.
I NTRODUCTION TO N ETWORK A DMINISTRATION. W HAT IS A N ETWORK ? A network is a group of computers connected to each other to share information. Networks.
Manchester Computing Supercomputing, Visualization & eScience Seamless Access to Multiple Datasets Mike AS Jones ● Demo Run-through.
Holding slide prior to starting show. Lessons Learned from the GECEM Portal David Walker Cardiff University
1 AQA ICT AS Level © Nelson Thornes 2008 Operating Systems What are they and why do we need them?
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Peer 2 Peer & Client Server
SRA Submission Pipeline
Types of SQL Commands Farrokh Alemi, PhD
A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen,
Azure's Performance, Scalability, SQL Servers Automate Real Time Data Transfer at Low Cost MINI-CASE STUDY “Azure offers high performance, scalable, and.
Learning to Program in Python
RecTech - Associated Recreation Council
Objectives Explain the role of computers in client-server and peer-to-peer networks Explain the advantages and disadvantages of client- server and peer-to-peer.
Presentation transcript:

THOUGHTS ON DATA MANAGEMENT by Justin Burruss & David Schissel SWIM Workshop November 7-9, 2005 Oak Ridge, TN

Key Points Experimental data management has greatly improved over the years, to the benefit of experimentalists; the lessons learned can be applied to simulation data –What has been referred to as “every man for himself” data management does not scale Simulation data management should provide long-term, searchable storage of important code runs accessible in a secure, uniform way –SWIM leading to FSP should be a national resource Comparison of theory and experiment is critical for progress –Comparing simulations to measured data during experiments idea

Outline The “bad old days” of experimental data management The value of a standard way of getting data How experimentalists want to use simulation data Ideas for simulation data management

Experimental data was unmanaged back in the “bad old days” Each code had different output and input formats To get data, you had to walk down the hall and ask for it No standard API –Different visualization tools for each data format –Hard to compare data from different sources No way to search or do data mining On the upside, walking down the hallway to get data was a chance to socialize

The “walking down the hall” approach does not scale As the number of different data formats and visualization tools increases, you spend more time figuring out how to get data, leaving less time to actually analyze and compare Need an efficient way to share your data with others When it was difficult to share data, you would print your plots and share those

Today data sharing is routine Widespread collaboration impossible without good data management

The experimental community primarily uses MDSplus as a standard data format Much easier to get data because of a standard API –Application Programming Interface (API) –The set of functions that you call to interact with the application, in this case the data –Examples from MDSplus: MdsOpen opens database MdsValue evaluates expression Standard API means you can write general applications to read and work with MDSplus data of all types –Example: general visualization tools One less new thing to learn when you can reuse the same viz. tool

Example: getting data with and without an API… Scientist #1 “my data is in MDSplus in XYZ shot 123” Scientist #2 “No problem, I know the two commands it takes to get data from MDSplus” –MdsOpen –MdsValue Scientist #1 “my data is somewhere in my home directory in my own special data format” “Oh wait I forgot I moved those files, they’re on that other computer now” “OK, now follow these 15 steps to read data from my file format” “Oh, wait this is an old version, you need these extra 5 steps” …etc. NO API, UNMANAGEDAPI, MANAGED DATA

A standard API can serve as a wrapper, thus leaving legacy systems in place behind the scenes MDSplus has been used as a front end for other data systems –DIII-D (PTDATA), JET, SRB Leave the old data systems in place, but allow them to be called from MDSplus through MdsValue Can be secure via X.509 ? JETDIII-DSRB Future Systems Other MDSplus

A synthetic diagnostic: one way experimentalists would use simulation data Want to be able to use general- purpose visualization tools to compare the two Simulation data in physical units Must be rapid –Quick plots for comparison during tokamak experiments

Lessons learned from NIMROD Storage in MDSplus Experience with NIMROD revealed limitations in MDSplus –Too much data MDSplus updated to accommodate larger “node” and “tree” sizes Sending 100s GBs – TBs of data over WAN is slow –Small chunks, single TCP stream with ACKs over high latency MDSplus being updated with parallel I/O streams (GridFTP) –Will make WAN transfer faster A bulk transfer method would further speed up MDSplus –i.e. send the whole “tree” (database) not pieces of the data

Some other system could be used if experimentalists could get to it through the familiar MDSplus interface SimDB JETDIII-DSRB Future Simulation Database Other MDSplus Experimentalists already familiar with MDSplus Many visualization tools already exist Reuse the standard MDSplus API

New storage system should store code run information Experimental community augments MDSplus with relational database for tracking code runs and for shot summary data Important because it allows for rapid queries across servers/trees/shots –Fewer file opens = faster Identify each code run with a unique ID The best “scratch” runs are “pasted” (aka “blessed”) –Not all runs must be archived permanently Also store other metadata such as comments, “run type”, who ran the code, date started/completed, etc. –Discovery of data much easier

Simulation storage system must allow for retrieval many years after initial code run Cannot have a situation where old data is lost forever Must be able to get to old data Can’t recreate old data if code versions change Should have plenty of useful metadata for better searching Try getting your data from these

Simulation data management scheme should make data accessible by small institutions, too Expensive solutions may preclude widespread collaboration

Conclusion: data management is important Simulation data management means: –Important data is saved forever –Standard way to get to the data –Data is shared, organized, searchable Experimentalists want to compare simulation with experiment –Must be able to do so rapidly –Need “real” units for data Could use MDSplus, or provide an MDSplus façade –Improvements to MDSplus in progress

Aux slides…

MDSplus Security: host-based or certificate-based Host-based is not particularly secure –Great for local access where you trust peers Certificate-based is secure –Each user has their own certificate –Agree on a Certificate Authority –Works with delegated proxy certificates (MyProxy) Your “ID” is on a server No messing around with files –Authorization via Resource Oriented Authorization Manager Flexible, simple, free Empower stakeholders Easy web interface