Scientific Computing at SLAC Richard P. Mount Director: Scientific Computing and Computing Services DOE Review June 15, 2005.

Slides:



Advertisements
Similar presentations
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
Advertisements

Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Duke Atlas Tier 3 Site Doug Benjamin (Duke University)
Silicon Graphics, Inc. Poster Presented by: SGI Proprietary Technologies for Breakthrough Research Rosario Caltabiano North East Higher Education & Research.
Data Analysis in High Energy Physics, Weird or Wonderful? Richard P. Mount Director: Scientific Computing and Computing Services Stanford Linear Accelerator.
SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
Evolution of Enterprise Services in the Statistics Canada IT Environment Silver Buckler Chief, Managed Storage Section Informatics Technology Services.
1 Andrew Hanushevsky - HEPiX, October 6-8, 1999 Mass Storage For BaBar at SLAC Andrew Hanushevsky Stanford.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Plans for Exploitation of the ORNL Titan Machine Richard P. Mount ATLAS Distributed Computing Technical Interchange Meeting May 17, 2013.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
CHEP 2004 September 2004Richard P. Mount, SLAC Huge-Memory Systems for Data-Intensive Science Richard P. Mount SLAC CHEP, September 29, 2004.
Terabyte IDE RAID-5 Disk Arrays David A. Sanders, Lucien M. Cremaldi, Vance Eschenburg, Romulus Godang, Christopher N. Lawrence, Chris Riley, and Donald.
Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.
Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)
Virtualization Lab 3 – Virtualization Fall 2012 CSCI 6303 Principles of I.T.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
May Richard P. Mount, SLAC Advanced Computing Technology Overview Richard P. Mount Director: Scientific Computing and Computing Services Stanford.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.
21 st October 2002BaBar Computing – Stephen J. Gowdy 1 Of 25 BaBar Computing Stephen J. Gowdy BaBar Computing Coordinator SLAC 21 st October 2002 Second.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
SoCal Infrastructure OptIPuter Southern California Network Infrastructure Philip Papadopoulos OptIPuter Co-PI University of California, San Diego Program.
High Energy Physics Data Management Richard P. Mount Stanford Linear Accelerator Center DOE Office of Science Data Management Workshop, SLAC March 16-18,
SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Large Scale Parallel File System and Cluster Management ICT, CAS.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.
ATLAS Computing at SLAC Future Possibilities Richard P. Mount Western Tier 2 Users Forum April 7, 2009.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Facilities and How They Are Used ORNL/Probe Randy Burris Dan Million – facility administrator.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Probe Plans and Status SciDAC Kickoff July, 2001 Dan Million Randy Burris ORNL, Center for.
Jefferson Lab Site Report Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606
HEP and Non-HEP Computing at a Laboratory in Transition Richard P. Mount Director: Scientific Computing and Computing Services Stanford Linear Accelerator.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Capability Computing – High-End Resources Wayne Pfeiffer Deputy Director NPACI & SDSC NPACI.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National Laboratory.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory Review of U.S. LHC Software and Computing Projects Fermi National Laboratory November.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.
Performance and Scalability of xrootd Andrew Hanushevsky (SLAC), Wilko Kroeger (SLAC), Bill Weeks (SLAC), Fabrizio Furano (INFN/Padova), Gerardo Ganis.
Xrootd Present & Future The Drama Continues Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University HEPiX 13-October-05
RAL Site report John Gordon ITD October 1999
PPDGLHC Computing ReviewNovember 15, 2000 PPDG The Particle Physics Data Grid Making today’s Grid software work for HENP experiments, Driving GRID science.
Scientific Computing at SLAC: The Transition to a Multiprogram Future Richard P. Mount Director: Scientific Computing and Computing Services Stanford Linear.
 The End to the Means › (According to IBM ) › 03.ibm.com/innovation/us/thesmartercity/in dex_flash.html?cmp=blank&cm=v&csr=chap ter_edu&cr=youtube&ct=usbrv111&cn=agus.
SLAC Status, Les CottrellESnet International Meeting, Kyoto July 24-25, 2000 SLAC Update Les Cottrell & Richard Mount July 24, 2000.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
US ATLAS Tier 1 Facility Rich Baker Deputy Director US ATLAS Computing Facilities October 26, 2000.
Tackling I/O Issues 1 David Race 16 March 2010.
HEP and Non-HEP Computing at a Laboratory in Transition Richard P. Mount Director: Scientific Computing and Computing Services Stanford Linear Accelerator.
INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.
Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
Hans Wenzel CDF CAF meeting October 18 th -19 th CMS Computing at FNAL Hans Wenzel Fermilab  Introduction  CMS: What's on the floor, How we got.
PetaCache: Data Access Unleashed Tofigh Azemoon, Jacek Becla, Chuck Boeheim, Andy Hanushevsky, David Leith, Randy Melen, Richard P. Mount, Teela Pulliam,
10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
GridPP10 Meeting CERN June 3 rd 2004
Western Analysis Facility
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
OffLine Physics Computing
IBM Power Systems.
Presentation transcript:

Scientific Computing at SLAC Richard P. Mount Director: Scientific Computing and Computing Services DOE Review June 15, 2005

June 2005Richard P. Mount, SLAC2 Scientific Computing The relationship between Science and the components of Scientific Computing

June 2005Richard P. Mount, SLAC3 Scientific Computing: SLAC’s goals for leadership in Scientific Computing Collaboration with Stanford and Industry

June 2005Richard P. Mount, SLAC4 Scientific Computing: Current SLAC leadership and recent achievements in Scientific Computing

June 2005Richard P. Mount, SLAC5 SLAC Scientific Computing Drivers BaBar (data-taking ends December 2008) –The world’s most data-driven experiment –Data analysis challenges until the end of the decade KIPAC –From cosmological modeling to petabyte data analysis Photon Science at SSRL and LCLS –Ultrafast Science, modeling and data analysis Accelerator Science –Modeling electromagnetic structures (PDE solvers in a demanding application) The Broader US HEP Program (aka LHC) –Contributes to the orientation of SLAC Scientific Computing R&D

June 2005Richard P. Mount, SLAC6 Client Disk Server Tape Server SLAC-BaBar Computing Fabric IP Network (Cisco) 120 dual/quad CPU Sun/Solaris ~400 TB Sun FibreChannel RAID arrays 1700 dual CPU Linux 400 single CPU Sun/Solaris 25 dual CPU Sun/Solaris 40 STK 9940B 6 STK 9840A 6 STK Powderhorn over 1 PB of data HEP-specific ROOT software (Xrootd) + Objectivity/DB object database HPSS + SLAC enhancements to ROOT and Objectivity server code

June 2005Richard P. Mount, SLAC7 BaBar Computing at SLAC Farm Processors (5 generations, 3700 CPUs) Servers (the majority of the complexity) Disk storage (2+ generations, 400+ TB) Tape storage (40 Drives) Network “backplane” (~26 large switches) External network

June 2005Richard P. Mount, SLAC8 Rackable Intel P4 Farm (bought in 2003/4) 384 machines, 2 per rack unit, dual 2.6 GHz CPU

June 2005Richard P. Mount, SLAC9 Disks and Servers 1.6 TB usable per tray, ~160 trays bought 2003/4

June 2005Richard P. Mount, SLAC10 Tape Drives 40 STK 9940B (200 GB) Drives 6 STK 9840 (20 GB) Drives 6 STK Silos (capacity 30,000 tapes)

June 2005Richard P. Mount, SLAC11 BaBar Farm-Server Network ~26 Cisco 65xx Switches Farm/Server Network

June 2005Richard P. Mount, SLAC12 SLAC External Network (June 14, 2005) 622 Mbits/ to ESNet 1000 Mbits/s to Internet 2 ~300 Mbits/s average traffic Two 10 Gbits/s wavelengths to ESNET, UltraScience Net/NLR coming in July

June 2005Richard P. Mount, SLAC13 Research Areas (1) (Funded by DOE-HEP and DOE SciDAC and DOE-MICS) Huge-memory systems for data analysis (SCCS Systems group and BaBar) –Expected major growth area (more later) Scalable Data-Intensive Systems: (SCCS Systems and Physics Experiment Support groups) –“The world’s largest database” (OK not really a database any more) –How to maintain performance with data volumes growing like “Moore’s Law”? –How to improve performance by factors of 10, 100, 1000, … ? (intelligence plus brute force) –Robustness, load balancing, troubleshootability in 1000 – box systems –Astronomical data analysis on a petabyte scale (in collaboration with KIPAC)

June 2005Richard P. Mount, SLAC14 Research Areas (2) (Funded by DOE-HEP and DOE SciDAC and DOE MICS) Grids and Security: (SCCS Physics Experiment Support. Systems and Security groups) –PPDG: Building the US HEP Grid – OSG; –Security in an open scientific environment; –Accounting, monitoring, troubleshooting and robustness. Network Research and Stunts: (SCCS Network group – Les Cottrell et al.) –Land-speed record and other trophies Internet Monitoring and Prediction: (SCCS Network group) –IEPM: Internet End-to-End Performance Monitoring (~5 years) SLAC is the/a top user of ESNet and the/a top user of Internet2. (Fermilab doesn’t do so badly either) –INCITE: Edge-based Traffic Processing and Service Inference for High- Performance Networks

June 2005Richard P. Mount, SLAC15 Research Areas (3) (Funded by DOE-HEP and DOE SciDAC and DOE MICS) GEANT4: Simulation of particle interactions in million to billion-element geometries: (SCCS Physics Experiment Support Group – M. Asai, D. Wright, T. Koi, J. Perl …) –BaBar, GLAST, LCD … –LHC program –Space –Medical PDE Solving for complex electromagnetic structures: (Kwok ‘s advanced Computing Department + SCCS clusters)

June 2005Richard P. Mount, SLAC16 Growing Competences Parallel Computing (MPI …) –Driven by KIPAC (Tom Abel) and ACD (Kwok Ko) –SCCS competence in parallel computing (= Alf Wachsmann currently) –MPI clusters and SGI SSI system Visualization –Driven by KIPAC and ACD –SCCS competence is currently experimental-HEP focused (WIRED, HEPREP …) –(A polite way of saying that growth is needed)

A Leadership-Class Facility for Data-Intensive Science Richard P. Mount Director, SLAC Computing Services Assistant Director, SLAC Research Division Washington DC, April 13, 2004 The PetaCache Project

June 2005Richard P. Mount, SLAC18 Technology Issues in Data Access Latency Speed/Bandwidth (Cost) (Reliabilty)

June 2005Richard P. Mount, SLAC19 Latency and Speed – Random Access

June 2005Richard P. Mount, SLAC20 Latency and Speed – Random Access

June 2005Richard P. Mount, SLAC21 The Strategy There is significant commercial interest in an architecture including data-cache memory But: from interest to delivery will take 3-4 years And: applications will take time to adapt not just codes, but their whole approach to computing, to exploit the new architecture Hence: two phases 1.Development phase (years 1,2,3) –Commodity hardware taken to its limits –BaBar as principal user, adapting existing data-access software to exploit the configuration –BaBar/SLAC contribution to hardware and manpower –Publicize results –Encourage other users –Begin collaboration with industry 2.Production-Class Facility (year 3 onwards) –Optimized architecture –Strong industrial collaboration –Wide applicability

June 2005Richard P. Mount, SLAC22 PetaCache The Team David Leith, Richard Mount, PIs Randy Melen, Project Leader Bill Weeks, performance testing Andy Hanushevsky, xrootd Systems group members Network group members BaBar (Stephen Gowdy)

June 2005Richard P. Mount, SLAC23 Development Machine Design Principles Attractive to scientists –Big enough data-cache capacity to promise revolutionary benefits –1000 or more processors Processor to (any) data-cache memory latency < 100  s Aggregate bandwidth to data-cache memory > 10 times that to a similar sized disk cache Data-cache memory should be 3% to 10% of the working set (approximately 10 to 30 terabytes for BaBar) Cost effective, but acceptably reliable –Constructed from carefully selected commodity components Cost no greater than (cost of commodity DRAM) + 50%

June 2005Richard P. Mount, SLAC24 Development Machine Design Choices Intel/AMD server mainboards with 4 or more ECC dimm slots per processor 2 Gbyte dimms (4 Gbyte too expensive this year) 64-bit operating system and processor –Favors Solaris and AMD Opteron Large (500+ port) switch fabric –Large IP switches are most cost-effective Use of ($10M+) BaBar disk/tape infrastructure, augmented for any non-BaBar use

June 2005Richard P. Mount, SLAC25 Development Machine Deployment – Proposed Year 1 Memory Interconnect Switch Fabric 650 Nodes, each 2 CPU, 16 GB memory Storage Interconnect Switch Fabric > 100 Disk Servers Provided by BaBar Cisco/Extreme/Foundry

June 2005Richard P. Mount, SLAC26 Development Machine Deployment – Currently Funded Cisco Switch Data-Servers Nodes, each Sun V20z, 2 Opteron CPU, 16 GB memory Up to 2TB total Memory Solaris Cisco Switches Clients up to 2000 Nodes, each 2 CPU, 2 GB memory Linux PetaCache MICS Funding Existing HEP-Funded BaBar Systems

June 2005Richard P. Mount, SLAC27 Latency (1) Ideal Client Application Memory

June 2005Richard P. Mount, SLAC28 Latency (2) Current reality Client Application Data-Server-Client OS TCP Stack NIC Data Server OS TCP Stack NIC Network Switches OS File System Disk

June 2005Richard P. Mount, SLAC29 Latency (3) Immediately Practical Goal Client Application Data-Server-Client OS TCP Stack NIC Data Server OS TCP Stack NIC Network Switches OS File System Disk Memory

June 2005Richard P. Mount, SLAC30 Latency Measurements (Client and Server on the same switch)

June 2005Richard P. Mount, SLAC31 Development Machine Deployment Likely “Low-Risk” Next Step Cisco Switch Data-Servers 80 Nodes, each 8 Opteron CPU, 128 GB memory Up to 10TB total Memory Solaris Cisco Switch Clients up to 2000 Nodes, each 2 CPU, 2 GB memory Linux

June 2005Richard P. Mount, SLAC32 Development Machine Complementary “Higher Risk” Approach Add Flash-Memory based subsystems –Quarter to half the price of DRAM –Minimal power and heat –Persistent –25  s chip-level latency (but hundreds of  s latency in consumer devices) –Block-level access (~1kbyte) –Rated life of 10,000 writes for two-bit-per-cell devices (NB BaBar writes FibreChannel disks < 100 times in their entire service life) Exploring necessary hardware/firmware/software development with PantaSys Inc.

June 2005Richard P. Mount, SLAC33 Object-Serving Software AMS and Xrootd (Andy Hanushevsky/SLAC) –Optimized for read-only access –Make 1000s of servers transparent to user code –Load balancing –Automatic staging from tape –Failure recovery Can allow BaBar to start getting benefit from a new data-access architecture within months without changes to user code Minimizes impact of hundreds of separate address spaces in the data-cache memory

June 2005Richard P. Mount, SLAC34 Summary: SLAC’s goals for leadership in Scientific Computing Collaboration with Stanford and Industry