Download presentation
Presentation is loading. Please wait.
Published byChad Mitchell Modified over 9 years ago
1
STAR C OMPUTING STAR Computing Status, Out-source Plans, Residual Needs Torre Wenaus STAR Computing Leader BNL RHIC Computing Advisory Committee Meeting BNL October 11, 1999
2
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Outline Status of STAR and STAR Computing Out-source plans Residual needs Conclusions - MDC2 participation slide - dismal RCF/BNL networking performance - increasing reliance on LBL for functional security configs - ftp to sol not possible - cannot open X windows from my home internet provider
3
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 The STAR Computing Task STAR trigger system reduces 10MHz collision rate to ~1Hz recorded to tape Data recording rate of 20MB/sec; ~12MB raw data size per event ~4000+ tracks/event recorded in tracking detectors (factor of 2 uncertainty in physics generators) High statistics per event permit event by event measurement and correlation of QGP signals such as strangeness enhancement, J/psi attenuation, high Pt parton energy loss modifications in jets, global thermodynamic variables 17M Au-Au events (equivalent) recorded in nominal year Relatively few but highly complex events requiring large processing power Wide range of physics studies: ~100 concurrent analyses foreseen in ~7 physics working groups
4
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Manpower Planned dedicated database person never hired (funding); databases consequently late but we are now transitioning from an interim to our final database Very important development in last 6 months: Big new influx of postdocs, students into computing and related activities Increased participation and pace of activity in l QA l online computing l production tools and operations l databases l reconstruction software Still missing online/general computing systems support person Open position cancelled due to lack of funding Shortfall continues to be made up by the local computing group
5
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Some of our Youthful Manpower Dave Alvarez, Wayne, SVT Lee Barnby, Kent, QA and production Jerome Baudot, Strasbourg, SSD Selemon Bekele, OSU, SVT Marguerite Belt Tonjes, Michigan, EMC Helen Caines, Ohio State, SVT Manuel Calderon, Yale, StMcEvent Gary Cheung, UT, QA Laurent Conin, Nantes, database Wensheng Deng, Kent, production Jamie Dunlop, Yale, RICH Patricia Fachini, Sao Paolo/Wayne, SVT Dominik Flierl, Frankfurt, L3 DST Marcelo Gameiro, Sao Paolo, SVT Jon Gangs, Yale, online Dave Hardtke, LBNL, Calibrations, DB Mike Heffner, Davis, FTPC Eric Hjort, Purdue, TPC Amy Hummel, Creighton, TPC, production Holm Hummler, MPG, FTPC Matt Horsley, Yale, RICH Jennifer Klay, Davis, PID Matt Lamont, Birmingham, QA Curtis Lansdell, UT, QA Brian Lasiuk, Yale, TPC, RICH Frank Laue, OSU, online Lilian Martin, Subatch, SSD Marcelo Munhoz, Sao Paolo/Wayne, online Aya Ishihara, UT, QA Adam Kisiel, Warsaw, online, Linux Frank Laue, OSU, calibration Hui Long, UCLA, TPC Vladimir Morozov, LBNL, simulation Alex Nevski, RICH Sergei Panitkin, Kent, online Caroline Peter, Geneva, RICH Li Qun, LBNL, TPC Jeff Reid, UW, QA Fabrice Retiere, calibrations Christelle Roy, Subatech, SSD Dan Russ, CMU, trigger, production Raimond Snellings, LBNL, TPC, QA Jun Takahashi, Sao Paolo, SVT Aihong Tang, Kent Greg Thompson, Wayne, SVT Fuquian Wang, LBNL, calibrations Robert Willson, OSU, SVT Richard Witt, Kent Gene Van Buren, UCLA, documentation, tools, QA Eugene Yamamoto, UCLA, calibrations, cosmics David Zimmerman, LBNL, Grand Challenge A partial list of young students and postdocs now active in aspects of software...
6
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 STAR Computing Requirements Nominal year processing and data volume requirements: Raw data volume: 200TB Reconstruction: 2800 Si95 total CPU, 30TB DST data 10x event size reduction from raw to reco 1.5 reconstruction passes/event (only!) assumed Analysis: 4000 Si95 total analysis CPU, 15TB micro-DST data 1-1000 Si95-sec/event per MB of DST depending on analysis l Wide range, from CPU-limited to I/O limited ~100 active analyses, 5 passes over data per analysis micro-DST volumes from.1 to several TB Simulation: 3300 Si95 total including reconstruction, 24TB Total nominal year data volume: 270TB Total nominal year CPU: 10,000 Si95
7
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Current Status of Requirements Simu reqmts review
8
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 STAR Software Environment Current software base a mix of C++ (55%) and Fortran (45%) Rapid evolution from ~20%/80% in 9/98 New development, and all physics analysis, in C++ Framework built over CERN-developed ROOT toolkit adopted 11/98 Supports legacy Fortran codes and data structures without change Deployed in offline production and analysis in Mock Data Challenge 2, 2-3/99 Post-reconstruction: C++/OO data model ‘StEvent’ implemented l Next step: migrate the OO data model upstream to reconstruction Core software development team: ~7 people Almost entirely BNL based l BNL leads STAR’s software and computing effort A small team given the scale of the experiment and data management/analysis task Strong reliance on leveraging community, commercial, open software
9
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Status Relative to Nov Workshop Decisions ROOT-based production software by MDC2 Legacy codes will continue to be supported C++/OO data model for DST, design unconstrained by ROOT Analysis codes in C++; unconstrained by ROOT (templates etc.) ROOT-based analysis framework and tools Choice between Objectivity and ROOT for data store from MDC2 Objectivity for event store metadata, “all being well” All was not well! Concerns over manpower, viability, outside experience (BaBar) … and most importantly success of MySQL and ROOT I/O provided good alternative which does not compromise data model Will use ROOT for micro-DSTs in year 1 Presumed extension of data model to persistent was overambitious; it was, but our hyperindustrious core team (largely Yuri, Victor) did it anyway. Persistent StEvent as basis for micro-DSTs is now an option.
10
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 From MDC2 to the May Workshop MDC2: ROOT based production chain and I/O, StEvent (OO/C++ data model) in place and in use l Statistics suffered from software and hardware problems and the short MDC2 duration l Software objectives were met and the objective of an active physics analysis and QA program was met; 5pm QA meetings instituted Post-MDC2: round of infrastructure modifications in light of MDC2 experience ‘Maker’ scheme incorporating packages in framework and interfacing to I/O revised l Robustness of multi-branch I/O (multiple file streams) improved XDF based I/O maintained as stably functional alternative Rapid change brought a typical user response of Aaaargh! Development completed by the late May computing workshop
11
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Status Relative to Engineering Run Goals established in May Stabilize offline environment as seen by users End of rapid infrastructure development in May and subsequent debugging have improved stability, but the issue hasn’t gone away Shift of focus to consolidation: debugging, usability improvements, documentation, user-driven enhancements, developing and responding to QA Support for DAQ data in offline from raw files through analysis. Done. Provide transparent I/O supporting all file types. Done Universal interface’ to I/O, StIOMaker, developed to provide transparent, backwards compatible I/O for all formats l Future (from Sept) extension to Grand Challenge Stably functional data storage ROOT I/O debugging proceeded into June/July; now appears stably functional. XDF output continues to provide backup. Deploy and migrate to persistent StEvent Deployed; migration to persistent StEvent in progress since the meeting. Period of coexistence to allow people to migrate on their timescale.
12
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Status Relative to May Goals (2) Populated calibration/parameter database in use via stable interface Not there! Still a work in progress! Resume (in persistent form!) the productive QA/analysis operation seen in MDC2 l Big success; thanks to Trainor, Ullrich, Turner and the many participants Extend production DB to real data logging, multi-component events and file catalog needs of Grand Challenge Done, but much work still to come.
13
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Real Data Processing StDaqLib from DAQ group provides real data decoding StIOMaker provides real data access in offline framework St_tpcdaq_Maker interfaces to TPC codes Current status from Iwona: Can read and analyze zero suppressed data all the way to DST DST from real data (360 events) read into StEvent analysis Bad channel suppression implemented and tested. First order alignment was worked out (~1mm), the rest to come from residuals analysis For monitoring purposes we also read and analyze unsuppressed data 75% of TPC read out. 10 000 cosmics with no field and several runs with field on working towards regular production and QA reconstruction of real data
14
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Release Management and Stability Separate infrastructure development release.dev introduced to insulate users.dev migrated to dev weekly (Fri), standard test suite run over weekend QA evaluation by core group, librarian, QA team Monday dev to new migration decision taken at Tuesday computing meeting migration happens weekly unless QA results delay it Better defined criteria for dev->new migration to be instituted soon Objective: A stable, current new version for broad use Currently too heavy a reliance on the unstable dev version l Version usage to date: (pro artifically high; default environment) §Level.dev 2729 §Level dev 18503 §Level new 7641 §Level old 662 §Level pro 57697
15
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Event Store and Data Management Success of ROOT-based event data storage from MDC2 on relegated Objectivity to metadata management role, if any ROOT provides storage for the data itself We can use a simpler, safer tool in metadata role without compromising our data model, and avoid complexities and risks of Objectivity MySQL adopted to try this (relational DB, open software, widely used, very fast, but not a full-featured heavyweight like ORACLE) Wonderful experience so far. Excellent tools, very robust, extremely fast Scalability must be tested but multiple servers can be used as needed to address scalability needs Not taxing the tool because metadata, not large volume data, is stored Metadata management (multi-component file catalog) implemented for simulation data and extended to real data Basis of tools managing data locality, retrieval in development Event-level tags implemented for real data to test scalability and event tag function (but probably not full TagDB, better suited to ROOT files) Will integrate with Grand Challenge in September
16
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 ROOT Status ROOT is with us to stay! No major deficiencies, obstacles found; no post-ROOT visions contemplated ROOT community growing: Fermilab Run II, ALICE, MINOS We are leveraging community developments First US ROOT workshop at FNAL in March l Broad participation, >50 from all major US labs, experiments l ROOT team present; heeded STAR priority requests §I/O improvements: robust multi-stream I/O and schema evolution §Standard Template Library support §Both emerging in subsequent ROOT releases l FNAL participation in development, documentation §ROOT guide and training materials just released The framework is based on ROOT; application codes need not depend on ROOT (neither is it forbidden to use ROOT in application codes).
17
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 QA Peter Jacobs appointed QA Czar by John Harris With Kathy Turner, Tom Trainor and many others, enormous strides have been made in QA with much more to come QA process in the semi-weekly 3pm 902 meetings very effective Suite of histograms and other QA measures in continuous development Automated tools managing production and extraction of QA measures from test and production running being developed QA process as it evolves will be integrated into release procedures, very soon for dev->new and with new->pro to come Acts as a very effective driver for debugging and development of the software, engaging a lot of people
18
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Documentation and Tutorials Greater attention to documentation needed and getting underway Documentation and code navigation tools developed Infrastructure documentation, ease of use a priority as part of ‘consolidation’ effort Many holes to be filled in application code doc as well Entry level, user guides and detailed (reference manuals) needed Standards in place; encouragement (enforcement!) needed to get developers to document their code and users to participate in entry level material Gene Van Buren appointed Software Documentation Coordinator -- thank you Gene! Oversees documentation activities Provides prioritized list of needs Makes sure the doc gets written and checked l Let Gene know what exists Implements easy web-based access; contributes to entry level doc We must all support this effort! Monthly tutorial program ongoing, thanks to Kathy Turner for organizing and other presenters for participating
19
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Computing Facilities Broad local and remote usage of STAR software RCF 35532BNL Sun 10944BNL Linux 6422Desktop 3255 HP 422LBL 9500Rice 3116Indiana 17760 Enabled by AFS based environment; AFS an indispensable tool But stressed when heavily used for data access Agreement reached with RCF to allow read-only access to RCF NFS data disks from STAR BNL computers Urgently needed to avoid total dependence on AFS for distributed access to (limited volumes of) data Trial period during August, followed by evaluation of any problems Linux New RCF farm deployments soon; cf. Bruce’s talk New in STAR: 6 dual 500MHz/18GB (2 arrived), 120GB disk (arrived) l For development, testing, online monitoring, services (web, DB,…) l STAR Linux manageable to date thanks largely to Pavel Nevski’s self- education as a Linux expert l STAR environment provided for Linux desktop boxes
20
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Post Engineering Run Priorities Extending data management tools (MySQL DB + disk file management + HPSS file management + multi-component ROOT files) Complete schema evolution, in collaboration with ROOT team Management of CAS processing and data distribution both for mining and individual physicist level analysis Integration and deployment of Grand Challenge Completion of the DB: integration of slow control as data source, completion of online integration, extending beyond TPC Extend and apply OO data model (StEvent) to reconstruction Continued QA development Improving and better integrating visualization tools Reconstruction and analysis code development directed at addressing QA results and year 1 code completeness and quality responding to the real-data outcomes of the engineering run
21
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 STAR Computing Facilities: Offsite pdsf at LBNL/NERSC Virtually identical configuration to RCF Intel/Linux farm, limited Sun/Solaris, HPSS based data archiving Current (10/99) scale relative to STAR RCF: l CPU ~50% (1200 Si95), disk ~85% (2.5TB) Long term goal: resources ~equal to STAR’s share of RCF l Consistent with long-standing plan that RCF hosts 50% of experiments’ computing facilities: simulation and some analysis offsite l Ramp-up currently being planned Cray T3Es at NERSC, Pittsburgh Supercomputing Center STAR Geant3 based simulation ported to T3Es and used to generate ~7TB of simulated data in support of Mock Data Challenges and software development Physics analysis computing at home institutions Processing of ‘micro-DSTs’ and DST subsets
22
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 PDSF in STAR Probably the only new tidbit is that we now know what our FY2000 allocation is for other NERSC resources, 210,ooo MPP hours and 200,000 SRU's. A discussion of what SRU's are though would just be confusing.
23
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 STAR at RCF Overall: RCF is a well designed and effective facility for STAR data processing and management Sound choices in overall architecture, hardware, software Well aligned with the HENP community l Community tools and expertise easily exploited l Valuable synergies with non-RHIC programs, notably ATLAS §Growing ATLAS involvement within the STAR BNL core computing group Production stress tests have been successful and instructive On schedule; facilities have been there when needed RCF interacts effectively and consults appropriately, and is generally responsive to input Mock Data Challenges have been highly effective in exercising, debugging and optimizing RCF production facilities
24
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Grand Challenge What does the Grand Challenge do for you? Optimizes access to HPSS tape store, That is: The GC Architecture allows users to stage files out of HPSS onto local disk (the GC disk cache) without worrying about l Username and Password for HPSS l Access Tape Numbers l File Names Improves data access for individual users l Allows event access by query: Present query string to GCA. (e.g. NumberLambdas>1) receive iterator over events which satisfy query as files are extracted from HPSS. l Pre-fetches files so that “the next” file is requested from HPSS while you are analysing the data in your first file Coordinates data access among multiple users l Coordinates ftp requests so that a tape is staged only once per set of queries which request files on that tape
25
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Grand Challenge in STAR
26
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 GC Implementation Plan Interface GC client code to STAR framework Already runs on solaris, linux Needs integration into framework I/O management Needs connections to STAR MySQL DB Apply GC index builder to STAR event tags Interface is defined Has been used with non-STAR ROOT files Needs connection to STAR ROOT and mysql DB (New) manpower for implementation now available; needs to come up to speed
27
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Mock Data Challenges MDC1: Sep/Oct ‘98 >200k (2TB) events simulated offsite; 170k reconstructed at RCF (goal was 100k) Storage technologies exercised (Objectivity, ROOT) Data management architecture of Grand Challenge project demonstrated Concerns identified: HPSS, AFS, farm management software MDC2: Feb/Mar ‘99 New ROOT-based infrastructure in production AFS improved, HPSS improved but still a concern Storage technology finalized (ROOT) New problem area, STAR program size, addressed in new procurements and OS updates (more memory, swap) Both data challenges: Effective demonstration of productive, cooperative, concurrent (in MDC1) production operations among the four experiments Bottom line verdict: the facility works, and should perform in physics datataking and analysis
28
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Principal Concerns RCF manpower Understaffing directly impacts l Depth of support/knowledge base in crucial technologies, eg. AFS, HPSS l Level and quality of user and experiment-specific support l Scope of RCF participation in software; Much less central support/development effort in common software than at other labs (FNAL, SLAC) §e.g. ROOT used by all four experiments, but no RCF involvement §Exacerbated by very tight manpower within the experiment software efforts §Some generic software development supported by LDRD (NOVA project of STAR/ATLAS group) The existing overextended staff is getting the essentials done, but the data flood is still to come Computer/network security Careful balance required between ensuring security and providing a productive and capable development and production environment Closely coupled to overall lab computer/network security; coherent site- wide plan, as non-intrusive as possible, is needed In considerable flux in recent months with the jury still out on whether we are in balance
29
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 HPSS Transfer Failures During MDC2 in certain periods up to 20% of file transfers to HPSS failed dangerously transfers seem to succeed; no errors and the file seemingly visible in HPSS with the right size but on reading we find the file not readable l John Riordan has list of errors seen during reading In reconstruction we can guard against this by reading back the file, but it would be a much more serious problem for DAQ data which we cannot afford to read from HPSS to check its integrity.
30
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Public Information and Documentation Needed Clear list of services RCF provides, the level of support of these services, resources allocated to each experiment, personnel support responsibles - rcas: LSF,.... - rcrs: CRS software,.... - AFS: (stability, home directories,...) - Disks: (inside / outside access) - HPSS - experiment DAQ/online interface Web based information is very incomplete e.g. information on planned facilities for year one and after largely a restatement of first point
31
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Current STAR Status at RCF Computing operations during the engineering run fairly smooth, apart from very severe security disruptions Data volumes small, and direct DAQ->RCF data path not yet commissioned Effectively using the newly expanded Linux farm Steady reconstruction production on CRS; transition to year 1 operation should be smooth Analysis software development and production underway on CAS Tools managing analysis operations under development In urgent need of disk; expected imminently Integration of Grand Challenge data management tools into production and physics analysis operations to take place this fall Based on status to date, we expect STAR and RCF to be ready for whatever RHIC Year 1 throws at us.
32
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 RCF Support We found planning and cooperation with RCF and the other experiments before and during MDC1 to be very effective Thanks to the RCF staff for excellent MDC1 support around the clock and seven days a week! Particular thanks to... HPSS l John Riordan AFS on CRS cluster l Tim Sailer, Alex Lenderman, Edward Nicolescu CRS scripts l Tom Throwe LSF support l Razvan Popescu All other problems l Shigeki Misawa
33
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Simulations and data sinking Cray T3E at Pittsburgh Supercomputing Center (PSC) became available for STAR last March, in addition to NERSC, for MDC directed simulation Major effort since then to port simulation software to T3E CERNLIB, Geant3, GSTAR, parts of STAF ported in BNL/LBNL/PSC joint project Production management and data transport (via network) infrastructure developed; production capability reached in August 90k CPU hrs used at NERSC (exhausted our allocation) 50k CPU hrs (only!) allocated in FY99 Production continuing at PSC, working through a ~450k hr allocation through February Prospects for continued use of PSC beyond February completely unknown; dependent on DOE T3E data transferred to RCF via net and migrated to HPSS 50GB/day average transfer rate; 100GB peak
34
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Reconstruction CRS-based reconstruction production chain: TPC, SVT, global tracking Two MDC1 production reconstruction phases Phase 1 (early September) l Production ramp-up l Debugging infrastructure and application codes l Low throughput; bounded by STAR code Phase 2 (from September 25) l New production release fixing phase 1 problems l STAR code robust; bounded by HPSS and resources l HPSS: §size of HPSS disk cash (46GB) §number of concurrent ftp processes l CPU : §from 20 CPUs in early September to §56 CPUs in late October
35
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Production Summary Total on HPSS: 3.8TB Geant simulation: Total: 2.3 TB NERSC:.6 TB PSC: 1.6 TB RCF: ~.1 TB DSTs: XDF: 913 GB in 2569 files Objectivity: 50 GB ROOT: 8 GB
36
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Reconstruction Production Rate 3/98 requirements report: 2.5 kSi95 sec/event =70 kB/sec Average rate in MDC1: 75-100 kB/sec Good agreement!
37
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 DST Generation Production DST is STAF-standard (XDR-based) XDF format Objectivity only in secondary role, and presently incompatible with Linux based production Direct mapping from IDL-defined DST data structures to Objectivity event store objects Objectivity DB built using BaBar event store software as basis Includes tag database info used to build Grand Challenge query index XDF to Objectivity loading performed in post-production Solaris- based step 50GB (disk space limited) Objy DST event store built and used by Grand Challenge to exercise data mining
38
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Data mining and physics analysis Data mining in MDC1 by Grand Challenge team; great progress, yielding an important tool for STAR and RHIC (Doug’s talk) We are preparing for the loading of the full 1TB DST data set into a second generation Objectivity event store to be used for physics analysis via the Grand Challenge software supporting hybrid non-Objectivity (ROOT) event components ‘next week’ for the last month! Soon... Little physics analysis activity in MDC1 Some QA evaluation of the DSTs Large MDC1 effort in preparing CAS software by the event by event physics group l but they missed MDC1 when it was delayed a month; they could only be here and participate in August DST evaluation and physics analysis has ramped up since MDC1 in most physics working group and will play a major role in MDC2
39
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Validation and Quality Assurance Simulation Generation of evaluation plots and standard histograms for all data an integral part of simulation production operations Reconstruction Leak checking, table integrity checking built into framework and effective in debugging and making the framework leak-tight; no memory growth Code acceptance and integration into production chain performed centrally at BNL; log file checks, table row counts TPC and other subsystem responsibles involved in assessing and debugging problems DST based evaluation of upstream codes, physics analysis play a central role in QA, but limited in MDC1 (lack of manpower) Dedicated effort underway for MDC2 to develop framework for validation of reconstruction codes and DSTs against reference histograms
40
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 MDC1 Outcomes and Coming Activity Computing decisions directing development for MDC2 and year 1 MDC1 followed by intensive series of computing workshops addressing (mainly infrastructure) software design and implementation choices in light of MDC1 and MDC1-directed development work Objective: stabilize infrastructure for year 1 by MDC2; direct post- MDC2 attention to physics analysis, reconstruction development Many decisions ROOT-related; summarized in ROOT decision diagram l Productive visit from Rene Brun C++/OO decisions: DST and post-DST physics analysis will be C++ based Event Store: Hybrid Objectivity/ROOT event store will be employed; ROOT used for (at least) post-DST micro, nano DSTs; tag database
41
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Revisited Computing Requirements STAR Offline Computing Requirements Task Force Report P. Jacobs et al, March 1998, Star Note 327, 62pp Physics motivation, data sets, reconstruction, analysis, simulation Addresses fully instrumented detector in nominal year STAR Offline Computing Requirements for RHIC Year One T. Ullrich et al, November 1998, Star Note xxx, 5pp MDC1 experience shows no significant deviation from SN327 l Conclude that numbers remain valid and estimates remain current This new report is an addendum addressing year 1 Detector setup: TPC, central trigger barrel, 10% of EMC, and (probably) FTPC and RICH 20MB/sec rate to tape; central event size 16 +- 4 MB l No data compression during first year
42
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Year 1 Requirements (2) RHIC running STAR’s requirements do not depend on luminosity profile l Can saturate on central events at 2% of nominal RHIC luminosity Rather, on the actual duty factor profile 4M central Au-Au at 100 GeV A/beam expected from official estimated machine performance folded with STAR duty factor l 25% of nominal year l Total raw data volume estimate: 60TB DST production MDC1 experience confirms estimate of 2.5 kSi95/event reconstruction Assuming reprocessing factor of 1.5 (very lean assumption for year 1!) yields 15M kSi95 sec l 630 Si95 units average for the year, assuming 75% STAR/RCF duty factor l Factoring in machine duty factor, can expect bulk of data to arrive late in year
43
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Year 1 Requirements (3) l To arrive at a reasonable number: simulated a range of RHIC performance profiles §Obtained 1.5-2 times average value to achieve year 1 throughput with at most 10% backlog at year end: ~1200 Si95 l DST data volume estimated at 10TB §Data reduction factor of 6 rather than 10 (SN327); additional information retained on DST for reconstruction evaluation and debugging, detector performance studies Data mining and analysis Projects able to participate with year 1 data scaled by event count relative to SN327 nominal l 12M Si95 sec total CPU needs; 2.3TB microDST volume l 480 Si95 units assuming 75% STAR/RCF duty factor l Analysis usage is similarly peaked late in the year following QA, reconstruction validation, calibration §Need estimated at twice the average, ~1000 Si95 units
44
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Year 1 Requirements (4) Simulations Used to derive acceptance and efficiency corrections, signal background due to physics or instrumental response l Largely independent of data sample size; no change relative to nominal year requirements foreseen l Data mining and analysis of simulations will peak late in year 1, with the real data Simulation CPU assumed to be available offsite At RCF, early year CPU not needed yet for reco/analysis can be applied to simulation reconstruction; no additional request for reconstruction of simulated data 24TB simulated data volume Year 1 requirements summarized in report table, p.5
45
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Comments on Facilities All in all, RCF facilities performed well HPSS gave the most trouble, as expected, but it too worked l Tape drive bottleneck will be addressed by MDC2 Robust, flexible batch system like LSF well motivated on CAS Many users and usage modes Need for such a system on CRS not resolved Limitations encountered in script based system: control in prioritizing jobs, handling of different reco job types and output characteristics, control over data placement and movement, job monitoring l Addressable through script extensions? Complementary roles for LSF and in- house scripts? Attempt to evaluate LSF in reconstruction production incomplete; limited availability in MDC1 Would like to properly evaluate LSF (as well as next generation scripts) on CRS in MDC2 as basis for informed assessment of its value (weighing cost also, of course) Flexible, efficient, robust application of computing resources important l Delivered CPU cycles ( available cycles) is the bottom line
46
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 MDC2 Objectives Full year1 geometry (add RICH, measured B field,...) Geometry,parameters, calibration constants from database Use of DAQ form of raw data New C++/OO TPC detector response simulation Reconstruction production in ROOT TPC clustering All year 1 detectors on DST (add EMC, FTPC, RICH, trigger) ROOT based job monitor (and manager?) Much more active CAS physics analysis program C++/OO transient data model in place, in use Hybrid Objectivity/ROOT event store in use l Independent storage of multiple event components Quality assurance package for validation against reference results
47
STAR C OMPUTING Torre Wenaus, BNL RHIC Computing Advisory Committee 10/98 Conclusions MDC1 a success from STAR’s point of view Our important goals were met Outcome would have been an optimist’s scenario last spring MDCs a very effective tool for guiding and motivating effort Good collaboration with RCF and other experiments Demonstrated a functional computing facility in a complex production environment All data processing stages (data transfer, sinking, production reconstruction, analysis) successfully exercised concurrently Good agreement with expectations of our computing requirements document Clear (but ambitious) path to MDC2 Post-MDC2 objective: a steady ramp to the commissioning run and the real data challenge in a stable software environment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.