High Energy Physics Data Management Richard P. Mount Stanford Linear Accelerator Center DOE Office of Science Data Management Workshop, SLAC March 16-18,

Slides:



Advertisements
Similar presentations
Tivoli SANergy. SANs are Powerful, but... Most SANs today offer limited value One system, multiple storage devices Multiple systems, isolated zones of.
Advertisements

Object Persistency & Data Handling Session C - Summary Object Persistency & Data Handling Session C - Summary Dirk Duellmann.
Jens G Jensen Atlas Petabyte store Supporting Multiple Interfaces to Mass Storage Providing Tape and Mass Storage to Diverse Scientific Communities.
Randall Sobie The ATLAS Experiment Randall Sobie Institute for Particle Physics University of Victoria Large Hadron Collider (LHC) at CERN Laboratory ATLAS.
Quark Net 2010 Wayne State University Physics Department.
View from Experiment/Observation driven Applications Richard P. Mount May 24, 2004 DOE Office of Science Data Management Workshop.
Data Analysis in High Energy Physics, Weird or Wonderful? Richard P. Mount Director: Scientific Computing and Computing Services Stanford Linear Accelerator.
1 Andrew Hanushevsky - HEPiX, October 6-8, 1999 Mass Storage For BaBar at SLAC Andrew Hanushevsky Stanford.
Scientific Computing at SLAC Richard P. Mount Director: Scientific Computing and Computing Services DOE Review June 15, 2005.
HEP Prospects, J. Yu LEARN Strategy Meeting Prospects on Texas High Energy Physics Network Needs LEARN Strategy Meeting University of Texas at El Paso.
CERN/IT/DB Multi-PB Distributed Databases Jamie Shiers IT Division, DB Group, CERN, Geneva, Switzerland February 2001.
CHEP 2004 September 2004Richard P. Mount, SLAC Huge-Memory Systems for Data-Intensive Science Richard P. Mount SLAC CHEP, September 29, 2004.
Richard P. Mount CHEP 2000Data Analysis for SLAC Physics Richard P. Mount CHEP 2000 Padova February 10, 2000.
Terabyte IDE RAID-5 Disk Arrays David A. Sanders, Lucien M. Cremaldi, Vance Eschenburg, Romulus Godang, Christopher N. Lawrence, Chris Riley, and Donald.
25 February 2000Tim Adye1 Using an Object Oriented Database to Store BaBar's Terabytes Tim Adye Particle Physics Department Rutherford Appleton Laboratory.
Scientific Computing for SLAC Science Bebo White Stanford Linear Accelerator Center October 2006.
Object Databases as Data Stores for HEP Dirk Düllmann IT/ASD & RD45.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
May Richard P. Mount, SLAC Advanced Computing Technology Overview Richard P. Mount Director: Scientific Computing and Computing Services Stanford.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
PPDG and ATLAS Particle Physics Data Grid Ed May - ANL ATLAS Software Week LBNL May 12, 2000.
Finnish DataGrid meeting, CSC, Otaniemi, V. Karimäki (HIP) DataGrid meeting, CSC V. Karimäki (HIP) V. Karimäki (HIP) Otaniemi, 28 August, 2000.
28 April 2003Imperial College1 Imperial College Site Report HEP Sysman meeting 28 April 2003.
Supercomputing in High Energy Physics Horst Severini OU Supercomputing Symposium, September 12-13, 2002.
Data Grid projects in HENP R. Pordes, Fermilab Many HENP projects are working on the infrastructure for global distributed simulated data production, data.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
1 Experimental HEP at Syracuse Marina Artuso Steven Blusk Tomasz Skwarnicki Sheldon Stone Postdoctoral Researchers Graduate StudentsResearch Professors.
Lessons Learned from Managing a Petabyte Jacek Becla Stanford Linear Accelerator Center (SLAC) Daniel Wang now University of CA in Irvine, formerly SLAC.
Fermi National Accelerator Laboratory SC2006 Fermilab Data Movement & Storage Multi-Petabyte tertiary automated tape store for world- wide HEP and other.
… where the Web was born 11 November 2003 Wolfgang von Rüden, IT Division Leader CERN openlab Workshop on TCO Introduction.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
Indexing and Selection of Data Items Using Tag Collections Sebastien Ponce CERN – LHCb Experiment EPFL – Computer Science Dpt Pere Mato Vila CERN – LHCb.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
The Particle Physics Data Grid Collaboratory Pilot Richard P. Mount For the PPDG Collaboration DOE SciDAC PI Meeting January 15, 2002.
Particle Physics Research - Birmingham group HEFCE academic staff 7 Research staff 13, Engineers 2, Technicians 2 PhD students 12.
DoE NGI Program PI Meeting, October 1999Particle Physics Data Grid Richard P. Mount, SLAC Particle Physics Data Grid Richard P. Mount SLAC Grid Workshop.
National HEP Data Grid Project in Korea Kihyeon Cho Center for High Energy Physics (CHEP) Kyungpook National University CDF CAF & Grid Meeting July 12,
PPDGLHC Computing ReviewNovember 15, 2000 PPDG The Particle Physics Data Grid Making today’s Grid software work for HENP experiments, Driving GRID science.
1D. Olson, SDM-ISIC Mtg, 26 Mar 2002 Scientific Data Management: An Incomplete Experimental HENP Perspective D. Olson, LBNL 26 March 2002 SDM-ISIC Meeting.
FSU Experimental HEP Faculty Todd Adams Susan Blessing Harvey Goldman S Sharon Hagopian Vasken Hagopian Kurtis Johnson Harrison Prosper Horst Wahl.
Scientific Computing at SLAC: The Transition to a Multiprogram Future Richard P. Mount Director: Scientific Computing and Computing Services Stanford Linear.
Some Ideas for a Revised Requirement List Dirk Duellmann.
Outline  Higgs Particle Searches for Origin of Mass  Grid Computing  A brief Linear Collider Detector R&D  The  The grand conclusion: YOU are the.
Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Xrootd Proxy Service Andrew Hanushevsky Heinz Stockinger Stanford Linear Accelerator Center SAG September-04
W.A.Wojcik/CCIN2P3, Nov 1, CCIN2P3 Site report Wojciech A. Wojcik IN2P3 Computing Center URL:
01. December 2004Bernd Panzer-Steindel, CERN/IT1 Tape Storage Issues Bernd Panzer-Steindel LCG Fabric Area Manager CERN/IT.
HEP and Non-HEP Computing at a Laboratory in Transition Richard P. Mount Director: Scientific Computing and Computing Services Stanford Linear Accelerator.
Hans Wenzel CDF CAF meeting October 18 th -19 th CMS Computing at FNAL Hans Wenzel Fermilab  Introduction  CMS: What's on the floor, How we got.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Nigel Lockyer Fermilab Operations Review 16 th -18 th May 2016 Fermilab in the Context of the DOE Mission.
PetaCache: Data Access Unleashed Tofigh Azemoon, Jacek Becla, Chuck Boeheim, Andy Hanushevsky, David Leith, Randy Melen, Richard P. Mount, Teela Pulliam,
Nigel Lockyer Fermilab Operations Review 16 th -18 th May 2016 Fermilab in the Context of the DOE Mission.
1 Particle Physics Data Grid (PPDG) project Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99.
Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.
CASTOR: possible evolution into the LHC era
GridPP10 Meeting CERN June 3 rd 2004
Understanding the nature of matter -
UK GridPP Tier-1/A Centre at CLRC
Network Requirements Javier Orellana
Ákos Frohner EGEE'08 September 2008
High Energy Physics at UTA
Nuclear Physics Data Management Needs Bruce G. Gibbard
High Energy Physics at UTA
Particle Physics Theory
SLAC B-Factory BaBar Experiment WAN Requirements
Using an Object Oriented Database to Store BaBar's Terabytes
Presentation transcript:

High Energy Physics Data Management Richard P. Mount Stanford Linear Accelerator Center DOE Office of Science Data Management Workshop, SLAC March 16-18, 2004

The Science (1) Understand the nature of the Universe (experimental cosmology?) –BaBar at SLAC (1999 on): measuring the matter-antimatter asymmetry –CMS and Atlas at CERN (2007 on): understanding the origin of mass and other cosmic problems

The Science (2) From the Fermilab Web Research at Fermilab will address the grand questions of particle physics today. –Why do particles have mass?mass –Does neutrino mass come from a different source? –What is the true nature of quarks and leptons? Why are there three generations of elementary particles? –What are the truly fundamental forces? –How do we incorporate quantum gravity into particle physics?quantum gravity –What are the differences between matter and antimatter? –What are the dark particles that bind the universe together? –What is the dark energy that drives the universe apart?dark energy –Are there hidden dimensions beyond the ones we know? –Are we part of a multidimensional megaverse?megaverse –What is the universe made of? –How does the universe work?

Experimental HENP Large (500 – 2000 physicist) international collaborations 5 – 10 years accelerator and detector construction 10 – 20 years data-taking and analysis Countable number of experiments: –Alice, Atlas, BaBar, Belle, CDF, CLEO, CMS, D0, LHCb, PHENIX, STAR … BaBar at SLAC –Measuring matter-antimatter asymmetry (why we exist?) –500 Physicists –Data taking since 1999 –More data than any other experiment (but likely to overtaken by CDF, D0 and STAR soon and will be overtaken by Alice, Atlas and CMS later)

Hydrogen Bubble Chamber Photograph 1970 CERN Photo

UA1 Experiment, CERN 1982: Discovery of the W Boson (Nobel Prize 1983)

BaBar Experiment at SLAC Taking data since Now at 1 TB/day rising rapidly Over 1 PB in total. Matter-antimatter asymmetry Understanding the origins of our universe

CMS Experiment: “Find the Higgs” ~10 PB/year by 2010

Characteristics of HENP Experiments 1980 – present  Typical data volumes: 10000n tapes (1  n  20) Large, complex  Large, (approaching worldwide) detectorscollaborations: 500 – 2000 physicists  Long (10 – 20 year) timescales God does play dice  High statistics (large volumes of data) needed for precise physics

HEP Data Models HEP data models are complex! –Typically hundreds of structure types (classes) –Many relations between them –Different access patterns Most experiments now rely on OO technology –OO applications deal with networks of objects –Pointers (or references) are used to describe relations Event TrackList TrackerCalor. Track Track Track Track Track HitList Hit Hit Hit Hit Hit Dirk Düllmann/CERN

Today’s HENP Data Management Challenges Sparse access to objects in petabyte databases: –Natural object size 100 bytes – 10 kbytes –Disk (and tape) non-streaming performance dominated by latency –Approach - current: Instantiate richer database subsets for each analysis application –Approaches – possible Abandon tapes (use tapes only for backup, not for data-access) Hash data over physical disks Queue and reorder all disk access requests Keep the hottest objects in (tens of terabytes of) memory etc.

Today’s HENP Data Management Challenges Millions of Real or Virtual Datasets: –BaBar has a petabyte database and over 60 million “collections”. (lists of objects in the database that somebody found relevant) –Analysis groups or individuals create new collections of new and/or old objects –It is nearly impossible to make optimal use of existing collections and objects

Latency and Speed – Random Access

Storage Characteristics – Cost Storage Hosted on NetworkCost per PB ($M) net after RAID, hot spares etc. Cost per GB/s ($M) Streaming Random access to typically accessed objects Cost per GB/s ($M) Object Size Good Memory * bytes Cheap Memory bytes Enterprise SAN maxed out kbytes High-quality fibrechannel disk * kbytes Tolerable IDE disk kbytes Robotic tape (STK 9480C) Mbytes Robotic tape (STK 9940B) * Mbytes * Current SLAC choice

Storage-Cost Notes Memory costs per TB are calculated: Cost of memory + host system Memory costs per GB/s are calculated: (Cost of typical memory + host system)/(GB/s of memory in this system) Disk costs per TB are calculated: Cost of disk + server system Disk costs per GB/s are calculated: (Cost of typical disk + server system)/(GB/s of this system) Tape costs per TB are calculated: Cost of media only Tape costs per GB/s are calculated: (Cost of typical server+drives+robotics only)/(GB/s of this server+drives+robotics)

Storage Issues Tapes: –Still cheaper than disk for low I/O rates –Disk becomes cheaper at, for example, 300MB/s per petabyte for random- accessed 500 MB files –Will SLAC every buy new tape silos?

Storage Issues Disks: –Random access performance is lousy, independent of cost unless objects are megabytes or more –Google people say: “If you were as smart as us you could have fun building reliable storage out of cheap junk” –My Systems Group says: “Accounting for TCO, we are buying the right stuff”

Client Disk Server Tape Server Generic Storage Architecture

Client Disk Server Tape Server SLAC-BaBar Storage Architecture IP Network (Cisco) 120 dual/quad CPU Sun/Solaris 300 TB Sun FibreChannel RAID arrays 1500 dual CPU Linux 900 single CPU Sun/Solaris 25 dual CPU Sun/Solaris 40 STK 9940B 6 STK 9840A 6 STK Powderhorn over 1 PB of data Objectivity/DB object database + HEP-specific ROOT software HPSS + SLAC enhancements to Objectivity and ROOT server code

Quantitatively (1) Volume of data per experiment: –Today: 1 petabyte –2009: 10 petabytes Bandwidths: –Today: ~1 Gbyte/s (read) –2009 (wish): ~1 Tbyte/s (read) Access patterns: –Sparse iteration, 5kbyte objects –2009 (wish): sparse iteration/random, 100 byte objects

Quantitatively (2) File systems: –Fundamental unit is an object (100 – 5000 bytes) –Files are WORM containers, of arbitrary size, for objects –File systems should be scalable, reliable, secure and standard Transport and remote replication: –Today: A data volume equivalent to ~100% of all data is replicated, more-or-less painfully, on another continent –2009 (wish): painless worldwide replication and replica management Metadata management: –Today: a significant data-management problem (e.g 60 million collections) –2009 (wish): miracles

Quantitatively (3) Heterogeneity and data transformation: –Today: not considered an issue … 99.9% of the data are only accessible to and intelligible by the members of a collaboration –Tomorrow: we live in terror of being forced to make data public (because it is unintelligible and so the user-support costs would be devastating) Ontology, Annotation, Provenance: –Today: we think we know what provenance means –2009 (wish): Know the provenance of every object Create new objects and collections making optimal use of all pre-existing objects