Download presentation
Presentation is loading. Please wait.
Published byElijah Fox Modified over 9 years ago
1
HENP Computing at BNL Torre Wenaus STAR Software and Computing Leader BNL RHIC & AGS Users Meeting Asilomar, CA October 21, 1999
2
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Content Bruce’s talk ATLAS Linux Mock Data Challenges D0 focus on areas really changing the scale of HENP comp at BNL Mount’s APOGEE talk Security Software ‘attracting good people’ ROOT; Phenix’s online threaded Objectivity, MySQL RIKEN comp center Esnet Open Science
3
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Historical Perspective Prior to RHIC, BNL has hosted many small to modest scale AGS experiments With RHIC, BNL moves into realm of large collider detectors computing task at a scale similar to SLAC, Fermilab, CERN etc. Has required a dramatic change in scale of HENP computing at BNL RHIC Computing Facility (RCF) established Feb 1997 to supply primary (non-simulation) RHIC computing needs Successful operations in two ‘Mock Data Challenge’ production stress tests and in summer 1999 engineering run First physics run in early 2000 Presence of RCF a strong factor in the selection of BNL as the principal US computing site for the CERN LHC ATLAS experiment Requirements and computing plan similar to RCF Will operate in close coordination with RCF LHC and ATLAS operations begin in 2005
4
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 This Talk Will focus on the major growth of HENP computing as a BNL activity brought by these new programs RHIC computing at BNL ATLAS computing at BNL Brief mention of some other programs Conclusions Thanks to Bruce Gibbard, RHIC computing facility head, and others (indicated on slides) for materials
5
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 RHIC Computing at RCF Four experiments: PHENIX, STAR, PHOBOS, BRAHMS 4:4:2:1 relative scales of computing task Aggregate raw data recording rate of ~60 MBytes/sec Annual raw data volume ~600 TBytes l NB. Size of global WWW content estimated at 7 Tbytes Event reconstruction: 13,000 SPECint95 (450MHz PC = 18 SPECint95) Event filtering (data mining) and physics analysis: 7,000 SPECint95 ‘mining’ interesting data off of tape for physics analyses l aggregate access rates of ~200 MBytes/sec iterative, interactive analysis of disk-based data by hundreds of users l aggregate access rates of ~1000 MBytes/sec Software development and distribution 100’s of developers; many 100k lines of code per experiment RCF is primary development and distribution (AFS) site
6
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Computing Strategies Extensive use of community/commercial/commodity products hardware and software increasing use of open software (eg. Linux, MySQL database) Exploit ‘embarrassingly parallel’ nature of HENP computing farms of loosely coupled processors (Linux PCs on Ethernet) limited use of Sun machines for I/O intensive analysis Hierarchical storage management (disk + tape robot/shelf) and flexible partitioning of event data based on access characteristics optimize storage cost and access latencies to interesting data Extensive use of OO software technologies adopted by all four RHIC experiments, ATLAS, other BNL HENP software efforts (eg. D0), and virtually all other forthcoming expts primarily C++; some Java Object I/O: Objectivity commercial OO database and ROOT community (CERN) developed tool
7
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Event Data Storage and Management Major software challenge: event data storage and management ROOT: HENP community tool (from CERN) l used by all RHIC experiments for event data storage Objectivity: Commercial object database l Used by PHENIX for conditions database l RCF did Linux port Relational databases (MySQL, ORACLE) l Many cataloguing applications in experiments, RCF l MySQL developed by STAR as complement to ROOT for event store, replacing Objectivity Grand Challenge Architecture l Managed access to HPSS-resident data, particularly for data mining l LBNL-led with ANL, BNL participation; deployment at RCF Particle Physics Data Grid: transparent wide-area data processing l US HENP ‘Next Generation Internet’ project, primarily LHC directed l RCF/RHIC will act as early testbed
8
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 ATLAS Computing at BNL A Toroidal LHC ApparatuS One of 4 experiments at LHC 14 TeV pp collider ATLAS computing at CERN estimated to be >10 times that of RHIC Augmented by regional centers outside CERN Total scale similar to CERN installation US ATLAS will have one primary ‘Tier 1’ regional center, at BNL ~20% of CERN facility; ~2x RCF BNL also manages the US ATLAS construction project; ~20% of full ATLAS detector Simulation, data mining, physics analysis, and software development will be primary missions of the BNL Tier 1 center
9
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 ATLAS: Commonality and Synergy with RHIC Qualitative requirements and Tier 1 quantitative requirements similar to RCF Exploit economies of scale in hardware and software Share technical expertise Learn from and build on RHIC computing as a ‘real world testbed’ Commonality: Complete coincidence of supported platforms l Intel/Linux processor farms, Sun/Solaris Objectivity -- and shared concerns over Objectivity! HPSS -- and shared concerns over HPSS! Data mining, Grand Challenge ROOT as an interim analysis tool Particle Physics Data Grid
10
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Current Status RHIC RCF Hardware for first year physics in place, except for some tape store hardware (5 drives; IBM server upgrades) Extensive testing and tuning to be done l performance, reliability, robustness All year 1 requirements satisfied except for disk capacity (later augmentation an option; not critically needed now) In production use by experiments Positive review by Technical Advisory Committee just concluded US ATLAS Tier 1 center Initial facility in place, usage by US ATLAS ramping up Operating out of RCF ATLAS software installed and operating More hardware on the way; further increases at proposal stage Dedicated manpower ramping up
11
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Conclusions RHIC and RCF have brought BNL to the forefront of HENP computing Computing scale, imminent operation, mainstream approaches and community involvement make RHIC computing an important testbed for today’s technologies and a stepping stone to the next generation Performance to date gives confidence for RHIC operations Strong software efforts at BNL in the experiments BNL as host of US ATLAS Tier 1 center will be a leading HENP computing center in the years to come Leveraging the facilities, expertise and experience of RCF and the RHIC program Facility installation to be complemented by a software development effort integrated with the local US ATLAS group Programs well supported by Brookhaven as part of an increased attention to scientific computing at the lab Lots of potential for involvement!
12
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 RIKEN QCDSP Parallel Computer Special purpose massively parallel machine based on DSPs for quantum field theory calculations 4D mesh with nearest neighbor connections 12,288 node, 600 Gflops Custom designed and built Collaboration centered at Columbia RIKEN BNL Research Center 192 mother boards, 64 processors each
13
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 CDIC - Center for Data Intensive Computing Newly established BNL Center developing collaborative projects Close ties to SUNY at Stony Brook Some of the HENP projects proposed or begun RHIC Visualization l Newly established collaboration with Stony Brook to develop dynamic 3D visualization tools for RHIC interactions and `beam’s eye’ view RHIC Computing l Proposed collaboration with IBM to use idle PC cycles for RHIC physics simulation (generator level) Data Mining l New project studying application of `rough sets’ data mining concepts to RHIC event classification and feature extraction Accelerator Design l Proposed parallel simulation of beam dynamics for accelerator design and optimization
14
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Visualization RHIC Au-Au collision animation (Quicktime movie available on web) PHENIX event simulation
15
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 ESnet Utilization
16
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Open Software/Open Science Conference BNL Oct 2, 1999 Educate scientists on open source projects Stimulate open source applications in science Present science applications to open source developers
17
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 HENP Computing Challenges Craig Tull, LBNL
18
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 STAR at RHIC RHIC: Relativistic Heavy Ion Collider at Brookhaven National Laboratory Colliding Au - Au nuclei at 200GeV/nucleon Principal objective: Discovery and characterization of the Quark Gluon Plasma Additional spin physics program in polarized p - p Engineering run 6-8/99; first year physics run 1/00 STAR experiment One of two large ‘HEP-scale’ experiments at RHIC, >400 collaborators each (PHENIX is the other) Heart of experiment is a Time Projection Chamber (TPC) drift chamber (operational) together with Si tracker (year 2) and electromagnetic calorimeter (staged over years 1-3) Hadrons, jets, electrons and photons over large solid angle
19
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 The STAR Computing Task Data recording rate of 20MB/sec; ~12MB raw data per event (~1Hz) ~4000+ tracks/event recorded in tracking detectors (factor of 2 uncertainty in physics generators) High statistics per event permit event by event measurement and correlation of QGP signals such as strangeness enhancement, J/psi attenuation, high Pt parton energy loss modifications in jets, global thermodynamic variables (eg. Pt slope correlated with temperature) 17M Au-Au events (equivalent) recorded in nominal year Relatively few but highly complex events requiring large processing power Wide range of physics studies: ~100 concurrent analyses in ~7 physics working groups
20
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 RHIC/STAR Computing Facilities Dedicated RHIC computing center at BNL, the RHIC Computing Facility Data archiving and processing for reconstruction and analysis Three production components: Reconstruction (CRS) and analysis (CAS) services and managed data store (MDS) 10,000 (CRS) + 7,500 (CAS) SpecInt95 CPU ~50TB disk, 270TB robotic tape, 200MB/s I/O bandwidth, managed by High Performance Storage System (HPSS) developed by DOE/commercial consortium (IBM et al) Current scale: ~2500 Si95 CPU, 3TB disk for STAR Limited resources require the most cost-effective computing possible Commodity Intel farms (running Linux) for all but I/O intensive analysis (Sun SMPs) Smaller outside resources: Simulation, analysis facilities at outside computing centers Limited physics analysis computing at home institutions
21
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Implementation of RHIC Computing Model Incorporation of Offsite Facilities T3EHPSS Tape storeSP2 Many universities, etc. Berkeley Japan MIT Doug Olson, LBNL
22
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 HENP Computing: Today’s Realities Very Large Data Volumes Large, Globally Distributed Collaborations Long Lived Projects (>15 years) Large (1-2M LOC), Complex Analyses Distributed, Heterogeneous Systems Very Limited Computing Manpower Most Computing Manpower are not Professionals Not necessarily a bad thing! Good understanding and direct interest among developers in the problem Reliance on Open and Commercial Software & Standards Evolving Computer Industry & Technology
23
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Event Data Storage Management of Petabyte data volumes arguably the most difficult task in HENP computing today Solutions must map effectively onto OO software technology Intensive community effort in Object Database technology in last 5 years Focus on Objectivity, the only commercial product that scales to PBytes Great early promise; strong potential to minimize in-house development and match well the OO architecture of experiments Reality has been more difficult: development effort much greater than expected, and mixed results on scalability In parallel with Objectivity, community solutions have also been developed Particularly, ROOT system from CERN supporting I/O of C++ based object models When complemented by a relational database, provides a robust and scalable solution that integrates well with experiment software The jury is still out STAR and some other experiments have dropped Objectivity in favor of ROOT+RDBMS BaBar at SLAC is in production with Objectivity, and is working through the problems
24
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Data Management Coupled to the event data storage problem, but distinct, is the problem of managing effective archiving and retrieval of the data Hierarchical storage management system required, capable of managing Terabytes of disk-resident rapid-access data Petabytes of tape-resident data with medium latency access Industry offers very few solutions today One (only) has been identified: HPSS Deployed at RCF (and many other sites), successfully but with caveats l Demands high manpower levels for development and 24x7 support l Still under development, particularly in HENP applications, with stability and robustness issues Community HENP solutions under development in this area as well (Fermilab, DESY)
25
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Distributed Computing In current generation experiments such as RHIC, and to a much greater degree in the next generation such as LHC, distributed computing is essential Fully empowering physicists not at the experimental site to participate in development and analysis, with effective access to the data Distributing the computing and data management task among several large sites l The central site can no longer afford to support computing on its own Near and long term efforts underway to address the need eg. NOVA project at BNL (Networked Object-based enVironment for Analysis): small project to address immediate and near term needs (STAR/RHIC, ATLAS, possibly others) Large, LHC directed projects such as the Particle Physics Data Grid project and the MONARC regional center modelling project
26
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Computing Requirements Nominal year processing and data volume requirements: Raw data volume: 200TB Reconstruction: 2800 Si95 total CPU, 30TB DST data 10x event size reduction from raw to reco 1.5 reconstruction passes/event assumed Analysis: 4000 Si95 total analysis CPU, 15TB micro-DST data 1-1000 Si95-sec/event per MB of DST depending on analysis l Wide range, from CPU-limited to I/O limited ~100 active analyses, 5 passes per analysis micro-DST volumes from.1 to several TB Simulation: 3300 Si95 total including reconstruction, 24TB Total nominal year data volume: 270TB Total nominal year CPU: 10,000 Si95
27
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 STAR Computing Facilities: RCF Data archiving and processing for reconstruction and analysis (not simulation; done offsite) General user services (email, web browsing, etc.) Three production components: Reconstruction and analysis services (CRS, CAS) and managed data store (MDS) Nominal year scale: l 10,000 (CRS) + 7,500 (CAS) SpecInt95 CPU §Intel farms running Linux for almost all processing; limited use of Sun SMPs for I/O intensive analysis §Cost-effective, productive, well-aligned with the HENP community l ~50TB disk, 270TB robotic tape, 200MB/s, managed by HPSS Current scale (when new procurements are in place): l ~2500 Si95 CPU, 3TB disk for STAR l ~8TB of data currently in HPSS
28
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Computing Facilities Dedicated RHIC computing center at BNL, the RHIC Computing Facility Data archiving and processing for reconstruction and analysi l Simulation done offsite 10,000 (reco) + 7,500 (analysis) Si95 CPU l Primarily Linux; some Sun for I/O intensive analysis ~50TB disk, 270TB robotic tape, 200MB/s, managed by HPSS Current scale (STAR allocation, ~40% of total): l ~2500 Si95 CPU l 3TB disk Support for (a subset of) physics analysis computing at home institutions
29
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Mock Data Challenges MDC1: Sep/Oct ‘98 >200k (2TB) events simulated offsite; 170k reconstructed at RCF (goal was 100k) Storage technologies exercised (Objectivity, ROOT) Data management architecture of Grand Challenge project demonstrated Concerns identified: HPSS, AFS, farm management software MDC2: Feb/Mar ‘99 New ROOT-based infrastructure in production AFS improved, HPSS improved but still a concern Storage technology finalized (ROOT) New problem area, STAR program size, addressed in new procurements and OS updates (more memory, swap) Both data challenges: Effective demonstration of productive, cooperative, concurrent (in MDC1) production operations among the four experiments Bottom line verdict: the facility works, and should perform in physics datataking and analysis
30
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Offline Software Environment Current software base a mix of Fortran (55%) and C++ (45%) from ~80%/20% (~95%/5% in non-infrastructure code) in 9/98 New development, and all post-reco analysis, in C++ Framework built over ROOT adopted 11/98 Origins in the ‘Makers’ of ATLFAST Supports legacy Fortran codes, table (IDL) based data structures developed in previous StAF framework without change Deployed in offline production and analysis in our ‘Mock Data Challenge 2’, 2-3/99 Post-reconstruction analysis: C++/OO data model ‘StEvent’ StEvent interface is ‘generic C++’; analysis codes are unconstrained by ROOT and need not (but may) use it Next step: migrate the OO data model upstream to reco
31
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Initial RHIC DB Technology Choices A RHIC-wide Event Store Task Force in Fall ‘97 addressed data management alternatives Requirements formulated by the four experiments Objectivity and ROOT were the ‘contenders’ put forward STAR and PHENIX selected Objectivity as the basis for data management l Concluded that only Objectivity met the requirements of their event stores ROOT selected by the smaller experiments and seen by all as analysis tool with great potential Issue for the two larger experiments: l Where to draw a dividing line between Objectivity and ROOT in the data model and data processing
32
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Event Store Requirements -- And Fall ‘97 View
33
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Requirements: STAR 8/99 View (My Version)
34
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 RHIC Data Management: Factors For Evaluation My perception of changes in the STAR view from ‘97 to now are shown Objy Root+MySQLFactor Cost Performance and capability as data access solution Quality of technical support Ease of use, quality of doc Ease of integration with analysis Ease of maintenance, risk Commonality among experiments Extent, leverage of outside usage Affordable/manageable outside RCF Quality of data distribution mechanisms Integrity of replica copies Availability of browser tools Flexibility in controlling permanent storage location Level of relevant standards compliance, eg. ODMG Java access Partitioning DB and resources among groups
35
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Object Database: Storage Hierarchy vs User View User deals only with ‘object model’ of his own design; storage details are hidden
36
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 ATLAS and US ATLAS One of two large HEP experiments at CERN’s Large Hadron Collider (LHC) Proton-proton collider; 14 TeV in center of mass 1 billion events/year Principal objective: Discovery and characterization of physics ‘beyond the Standard Model’: Higgs, Supersymmetry, … Startup 2005+ Brookhaven hosts the US Project Office for US contributions to ATLAS ~$170M; about 20% of the project Brookhaven recently selected as host lab for US ATLAS Computing and site of US Regional Center Extension of RHIC Computing Facility US ATLAS Computing projected to grow to ~$15M/yr
37
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Conclusions HENP is (unfortunately!) still pushing the envelope in the scale of the data processing and management tasks of present and next generation experiments The HENP community has looked to the commercial and open software worlds for tools and approaches, with strong successes in some areas (OO programming), qualified successes in others (HPSS), and the jury is still out on some (Object Databases) Moore’s Law and the rise of Linux have made provisioning CPU cycles less of an issue The community has converged on OO as the principal tool to make software development tractable But solutions to data storage and management are much less clear A need on the rise is distributed computing, but internet-driven growth in capacities and technologies will be a strong lever Developments within the HENP community continue to be important, either as fully capable solutions or interim solutions pending further commercial/open software developments
38
Torre Wenaus, BNL RHIC/AGS Users Meeting 10/99 Conclusions The circumstances of STAR Startup this year Slow start in addressing event store implementation, C++ migration Large base of legacy software Extremely limited manpower and computing resources drive us to very practical and pragmatic data management choices Beg, steal and borrow from the community Deploy community and industry standard technologies Isolate implementation choices behind standard interfaces, to revisit and re-optimize in the future which leverage existing STAR strengths Component and standards-based software greatly eases integration of new technologies l preserving compatibility with existing tools for selective and fall-back use l while efficiently migrating legacy software and legacy physicists After some course corrections, we have a capable data management architecture for startup that scales to STAR’s data volumes … but Objectivity is no longer in the picture.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.