Igor Gaponenko ( On behalf of LCLS / PCDS ).  An integral part of the LCLS Computing System  Provides:  Mid-term (1 year) storage for experimental.

Slides:



Advertisements
Similar presentations
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
Advertisements

STANFORD UNIVERSITY INFORMATION TECHNOLOGY SERVICES IT Services Storage And Backup Low Cost Central Storage (LCCS) January 9,
LCLS Data Systems Amedeo Perazzo SLAC HSF Workshop, January 20 th 2015.
SWIM WEB PORTAL by Dipti Aswath SWIM Meeting ORNL Oct 15-17, 2007.
Summary Role of Software (1 slide) ARCS Software Architecture (4 slides) SNS -- Caltech Interactions (3 slides)
Toolbox Mirror -Overview Effective Distributed Learning.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Components and Architecture CS 543 – Data Warehousing.
Kian-Tat Lim Offline Computing November 12 th, LCLS Offline Data Management.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
Database Design - Lecture 1
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
1 Web Server Administration Chapter 1 The Basics of Server and Web Server Administration.
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Application Provider Visualization Access Analytics Curation Collection.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Windows Vista, 2007 Office system, and Exchange 2007 Better Together.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Corral: A Texas-scale repository for digital research data Chris Jordan Data Management and Collections Group Texas Advanced Computing Center.
Moving Large Amounts of Data Rob Schuler University of Southern California.
Oracle Advanced Compression – Reduce Storage, Reduce Costs, Increase Performance Session: S Gregg Christman -- Senior Product Manager Vineet Marwah.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
CIT 470: Advanced Network and System AdministrationSlide #1 CIT 470: Advanced Network and System Administration Disaster Recovery.
Hybrid Cloud and Windows Server 2003 end of support on Azure Rod Kruetzfeld Data Center Technical Strategist Microsoft Canada.
Database Environment Chapter 2. Data Independence Sometimes the way data are physically organized depends on the requirements of the application. Result:
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Transformation Provider Visualization Access Analytics Curation Collection.
Measurement Data Workspace and Archive: Current State and Next Steps GEC15 Oct 2012 Giridhar Manepalli Corporation for National Research Initiatives
Storage issues for end-user analysis Bernd Panzer-Steindel, CERN/IT 08 July
Near Real-Time Verification At The Forecast Systems Laboratory: An Operational Perspective Michael P. Kay (CIRES/FSL/NOAA) Jennifer L. Mahoney (FSL/NOAA)
 TERMINOLOGY TERMINOLOGY DATA INFORMATION  NEED OF INFORMATION NEED OF INFORMATION  QUALITIES OF INFORMATION QUALITIES OF INFORMATION  FILE SYSTEM.
PHENIX and the data grid >400 collaborators 3 continents + Israel +Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.
Computing R&D and Milestones LHCb Plenary June 18th, 1998 These slides are on WWW at:
Evolving Scientific Data Workflow CAS 2011 Pamela Gillman
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Role Activity Sub-role Functional Components Control Data Software.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Chang, Wen-Hsi Division Director National Archives Administration, 2011/3/18/16:15-17: TELDAP International Conference.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
1 Case Study: Business Intelligence & Customer Data Customer Support Web-based Dashboard VP Marketing SQL XSLT XML Data Grid Customer Data Customer Order.
Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.
High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.
A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.
ATLAS – statements of interest (1) A degree of hierarchy between the different computing facilities, with distinct roles at each level –Event filter Online.
Compute and Storage For the Farm at Jlab
(Prague, March 2009) Andrey Y Shevel
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Overview of the Belle II computing
SuperB and its computing requirements
LHC experiments Requirements and Concepts ALICE
Joseph JaJa, Mike Smorul, and Sangchul Song
Introduction to Data Management in EGI
Ákos Frohner EGEE'08 September 2008
Computing Infrastructure for DAQ, DM and SC
Research Data Archive - technology
USF Health Informatics Institute (HII)
JDAT Production Hardware
Chapter 1 Database Systems
SDM workshop Strawman report History and Progress and Goal.
Database Systems Chapter 1
TeraScale Supernova Initiative
Wide Area Workload Management Work Package DATAGRID project
Ron Carovano Manager, Business Development F5 Networks
Chapter 1 Database Systems
Data Management Components for a Research Data Archive
IBM Tivoli Storage Manager
Presentation transcript:

Igor Gaponenko ( On behalf of LCLS / PCDS )

 An integral part of the LCLS Computing System  Provides:  Mid-term (1 year) storage for experimental data  Long term archival to tapes  Data export for off-SLAC usage  Access to the data from processing farms  File catalogs and metadata  Data privacy and security  Accepts data from 6 DAQ systems of LCLS instr.  Primary data formats: XTC, HDF5  Designed to cope with GB/s data rates  Storage-centric architecture

 Quite different from HEP experiments:  Small user teams (5-20)  Short (2-5 days, 3-10 shifts) experiments  Quick turnaround of experiments (within 1 hour)  Data privacy is critical!  Data sharing is rare  Variety of frameworks, tools, algorithms and methods  Huge data rates (1.2 GB/s in CXI) and data amounts (PB+/year)  No data reduction in ONLINE (yet)  (In many cases) Impossible to export all raw data due to limited usable network bandwidth and lack of storage resources at users’ sites  Hence the design (see next slide)…

LUSTRE - High Bandwidth Cluster Storage 2 PB (4 PB in Aug 2011) ONLINEOFFLINESLAC COMPUTER CENTER iRODS File Manager XTC to HDF5 Translation HPSS Databases ( MySQL ) Web Apps & Services ( Apache ) Data Export Nodes AMO DAQ CXI DAQ.. MEC DAQ Analysis Farms ESNet LCLS Users 100 TB

 Automated data flows within the system  Limit file sizes to 100 GB  MD5 checksums calculated at the source (DAQ) and recorded  Duplicate data on HPSS (two “streams” of tapes)  Manual data export ( bbcp, sftp, etc.)  Data retention policies (Lustre: 1 year, HPSS: 10 years )  Privacy, security, access control:  Group-based authorization (1 POSIX group per experiment)  Enforced file access control ( ACL, file ownership)  Single-sign up authentication for Web apps ( WebAuth )  Tools:  Web Portal for users (File Catalogs, e- Log, HDF5 Translation)  Various Web tools for internal/administrative use  Custom command-line applications  UNIX commands for direct (!) manipulation of data files  Not everything can be automated:  Substantial amount of human interaction required

CXI shutdown AMO 72 exp. >300 users 30,000 runs 52,000 files SXR XPP

31 exp. 12,000 runs 22,400 files 4 exp. 2,400 runs 8,300 files 20 exp. 6,000 runs 9,800 files 17 exp. 8,900 runs 11,500 files

 Data compression to reduce sizes and I/O rates:  Compress images in DAQ before migrating to Lustre  Can do 0.6 for most data intensive experiments (CSPad of CXI)  Compress HDF5 payload (current algorithm limits translator’s perf.)  More sophisticated Data Exportation tools needed:  Data transfer over WAN is bumpy and unpredictable  The current tools are too low-level, recovery may not work  Various options under discussion (including GRIDFTP)  Data (pre-)processing during HDF5 translation  background correction and hit finder  Intermediate data definition language for XTC and HDF5  New data types are introduced in ONLINE  A problem for (in development) analysis framework(s)  Indexing of datagrams within XTC:  Only serial access to events in the original XTC format  Required by the new (in development) analysis frameworks  Sampling (1%) of archived (HPSS) files for verification