Proposal to BNL/DOE to Use ACCRE Farm for PHENIX Real Data Reconstruction in 2008 - 2010 Charles Maguire for the VU PHENIX group Carie Kennedy, Paul Sheldon,

Proposal to BNL/DOE to Use ACCRE Farm for PHENIX Real Data Reconstruction in 2008 - 2010
Charles Maguire for the VU PHENIX group Carie Kennedy, Paul Sheldon, and Alan Tackett for ACCRE November 13, 2007 Analysis Meeting

Executive Summary A proposal to BNL/DOE is being made to fund PHENIX data reconstruction at Vanderbilt’s ACCRE farm during Two funding scenarios are envisioned with 3-year total costs of $200K and $300K, respectively, depending on the scope of the work at ACCRE This proposal builds on the experiences gained at Vanderbilt in Run6 and Run7 doing near-real time data reconstruction A concurrent proposal is being submitted to the CMS-HI collaboration in a competition to site their U.S. compute center at ACCRE CMS-HI compute center ~5 times larger than the PHENIX $300K proposal Time scale , competing with MIT and Iowa bids, to be decided Feb. ‘08 PHENIX will gain Benefit of VU-subsidized costs and time-leveraged computing at ACCRE Efficient use of manpower in a large PHENIX group, great service from ACCRE Advantages in keeping pace with CMS-HEP’s technological breakthroughs If DOE invests in ACCRE for CMS-HI, PHENIX may share upgrades (tech solutions) November 13, 2007 Analysis Meeting

Computing and Research at VU
ACCRE: Advanced Computing Center for Research and Education $8.7 M capital project initially funded by Vanderbilt Provost’s office in 2004 $2.0 M subsequent infusions from NIH and NSF Currently ACCRE has more than 1600 CPUs, with space and power to grow to 4000 CPUs Additional CPUs at ACCRE to be purchased as part of new faculty hire start-up funding packages Expecting in 2008 a dedicated Internet2 connection speed of 2.5 Gbits/second ACCRE is implementing a dedicated link of 10 Gbits/second to ESNET POP in Nashville ACCRE has a fast internal network for disk I/O, plus tape archive facility Vanderbilt University will now deliver 1 Gbits/second to faculty desktop machines as justified RHI and HEP Research at Vanderbilt PHENIX group at VU currently at 10 members: 3 faculty, 3 post-docs, 4 graduate students 2 of the students are currently deputy production managers (part of Run7 reco team at ACCRE) Group has formally joined CMS-HI as of May, 2007 Anticipating an eventual 40% group FTE presence in CMS-HI (compute center a major factor) VU HEP group in CMS as of 2006, Paul Sheldon overall group leader for all CMS at Vanderbilt The HEP group supports the installation of CMSSW at Vanderbilt for both HEP and HI use The HEP group is also the leader of the NSF-funded REDDNet project to deploy 500 TBytes disk REDDNet will make use of the L-Store toolkit for high performance disk I/O (couples to ROOT I/O) November 13, 2007 Analysis Meeting

ACCRE Organization Steering Committee ACCRE Staff Management Team
Paul Sheldon, Chair Dave Piston Ron Schrimpf Internal Advisory Committee Dennis Hall (University) Jeff Balser (VUMC), (Co-Chairs) ACCRE Staff Management Team Technical Director: Alan Tackett Finance/User Management: Carie Lee Kennedy Education/Outreach: Rachel Gibbons Technical Staff: Mathew Binkley, Bobby Brown, Kevin Buterbaugh, Laurence Dawson, Santiago de Ledesma, Andrew Lyons, Kelly McCauley Support: Gretchen Green ACCRE Organization First External Advisory Committee Paul Avery Dick Landis Doug Post Faculty Study Group (short-term) Marylyn Ritchie, Chair Faculty Advisory Group (ongoing) Robert Weller, Chair November 13, 2007 Analysis Meeting

Near-Real Time Run 7 PHENIX Data Reconstruction At ACCRE, April - June 2007
45TB and 200 CPUs Continuously Available for Run7 Reconstruction RCF RHIC computing facility Reconstruction 200 jobs/cycle PRDFs nanoDSTs 12 hours/job 3 FDT 45 MB/s FDT 45 MB/s 1 770 GBytes per cycle PRDFs Raw data files GridFTP to VU 30 MBytes/sec 2 nanoDSTs Reco output to RCF GridFTP 23 MB/sec Dedicated GridFTP Server Firebird 4 4.4TB of buffer disk space November 13, 2007

Run7 Data I/O and Reconstruction Experience
I/O processes were completely automated (impossible to do otherwise) Over 100 PERL scripts written at VU to automate and monitor the activity These scripts coordinated computer operations on 4 different systems 5810 PRDF file segments automatically transferred from BNL to local server Contained 30.3 TBytes = ~275M events GridFTP receiving server at Vanderbilt was completely stable after a defective vendor-supplied disk management software was replaced in mid-April Saw sustained 100 MBytes/second I/O on this server (input from BNL + output to BNL) Reconstruction (also completely automated and web-monitored) PRDFs arriving in April but final reconstruction build not ready until June 4 Mostly due to the special circumstance for Run7 Four new detector subsystems were brought on-line for Run 7 reconstruction Plenty of disk space at ACCRE to archive the PRDFs until June No PRDFs were actually deleted until August 1 Main difficulty started in mid-June after first sets of reco output arrived at RCF GridFTP transfers began to fail to NFS disks systems which had suddenly become very busy Wrote “fault-tolerant” scripts to re-start GridFTP from the failure transfer point Wrote a “horse race” competition script to locate the current least busy, destination disk at RCF Processed in 3 weeks the PRDF files which were received during 11 weeks November 13, 2007 Analysis Meeting

Proposal Scenarios www. phenix. bnl
Proposal Scenarios Proposed to reconstruct 25-30% of PHENIX Min Bias Data By 2010 First scenario ($200K total) is only the Central Arm data done in near-real time Second scenario ($300K total) will do both Central and Muon Arm data PRDFs are already at ACCRE, so only more CPUs are needed, not much more disk Needed number of CPUs and disk (tape) space is scaled from Run7 Au+Au project Alternate scenarios also possible (not in proposal document) Alternate scenarios depend on decisions of production managers at BNL Not adamant on doing one kind of project (“factory manager/home office” analogy) Do near-real time Level2 reco (as we did in Run6), depending if it is not done elsewhere Do quick reco of 10% of the data, and then wait ~2 months to do a repass and more Do traditional slow playback of data stored on tape, … Important to us to have a multi-year plan and commitment Some disk/tape resources can be “lent” by ACCRE in first year, and paid in second year If we are selected as the CMS-HI compute center, then we want to know and plan well for what will the scope of our responsibilities to both projects during the same year Out-year plans could still be modified based on prior year experiences November 13, 2007 Analysis Meeting

Proposal Costs www. phenix. bnl
Proposal Costs Time Scales Assume a 8-10% growth per year in scope, i. e. Run 8 at about times Run7 ACCRE will allow us to time-leverage the acquired CPUs For example, if the PHENIX work will be all done in 4 months (near-real time), then we can use at least three times as many CPUs as in our nominal fair-share quota At any given time, there is always the opportunity to use more CPUs if they are not being used others exhausting their fair-share quotas Capital Costs Total cost to purchase 110 (204 scenario 2) processors is $68,200 ($126,480) Total cost to purchase 70 (77 scenario 2) TBytes is $49,000 ($53,900) Operating Costs Operating costs (power, air-conditioning, 24/7 service support) are charged on a FTE basis according to the amount and kind of hardware purchased by a group The FTE amount for scenario 1 in the 3rd year is 0.40 persons The FTE amount for scenario 2 in the 3rd year is 0.54 persons The FTE salary charge at VU for this category is $65,000/FTE % fringe Overhead is added at 53.5% giving a 3rd year support cost of $103K ($141K scenario 2) November 13, 2007 Analysis Meeting

Proposal Scenario 1 (Central Arm reco only) www. phenix. bnl
Proposal Scenario 1 (Central Arm reco only) November 13, 2007 Analysis Meeting

Proposal Scenario 2 (Add muon arm reco) www. phenix. bnl
Proposal Scenario 2 (Add muon arm reco) November 13, 2007 Analysis Meeting

Proposal Additional Details www. phenix. bnl
Proposal Additional Details Half FTE to Develop Globus Grid Middleware for PHENIX Other half support from Vanderbilt Real person (already identified) expert in software Early project to develop Grid-based simulation project submission Will work with ACCRE techs to advance the IBP depot disk system IBP: Internet backplane protocol, designed for Tier3 sites in CMS Enables remote disk access in a distributed university environment Half FTE cost at $32.5K per year + fringe and overhead Relationship to CMS-HI Compute Center Proposal No double counting of hardware resources (CPUs, disk, tape) RHIC and LHC run out-of-phase (winter shutdowns for LHC) Intense input transfer rates from RHIC and LHC to be in different seasons The 10 Gbits/second specified for CMS should vastly exceed PHENIX’s need CMS-HI ramp-up rate projected as 25%, 50%, 100% in Raw data volume for CMS-HI pegged at 300 TBytes (~1 month running) November 13, 2007 Analysis Meeting

Advantages For PHENIX Run 6 and Run 7 Experiences
Early looks at the performance of the detectors, good PR for PHENIX to BNL and DOE Near real time decisions possible in principle Pressure to get reconstruction libraries in order sooner rather than later Funding considerations DOE may be disposed to give additional money to university groups if a good case can be made (and I think it can here) that a significant subsidy is being made for operations costs There is sufficient manpower at Vanderbilt to carry out this work. There are cost savings in bringing the CPUs to the available manpower, instead of the manpower to the CPUs The training of students in this work will give (has already given) benefit to PHENIX in preparing future deputy (or full) production managers It is the future model of large scale data reconstruction and analysis The idea of shared computing responsibilities is central to the LHC computing model Fast network speeds, up to 10 Gbits/second, no longer mandate a centralized facility If the CMS-HI compute center is located at ACCRE, there will be additional technical support able to work on common problems of data I/O and management Even if the CMS-HI compute center is not at ACCRE, ACCRE will still be reaping the advances in technology brought about by the CMS-HEP group here November 13, 2007 Analysis Meeting

Three Year Planning Proposal 2008-2010 Budget Policies
DOE Treats RHIC Computing as a Single Line Item in BNL’s Budget DOE will not fund university groups separately to do computing for PHENIX or STAR It is BNL management’s decision how to disburse the RHIC computing line item BNL management listens to recommendations from PHENIX and STAR Hence PHENIX must first approve a recommendation to fund computing at ACCRE ACCRE Will Not Provide PHENIX with “Free Computing” After 2007 Strict Federal indirect cost overhead rules (after ACCRE’s start-up “grace” period) ACCRE cannot give away to PHENIX/DOE what it will charge NIH, NSF researchers All research group are treated with the same accounting rules Is There a Net Gain for PHENIX Computing at a University? Yes, Vanderbilt is not charging the real ops cost of computing, a subsidy is in effect. Vanderbilt is committed long term to a 50% sharing on total support costs at ACCRE. A university provost will always tell you that he/she loses money on research. Research at a major university is a loss-leader to attract the best students and faculty, to build the university’s reputation, and thence to attract more endowment contributions. That, and generating Ph.D.’s, is why DOE is happy to see new faculty lines in groups. November 13, 2007 Analysis Meeting

Executive Summary Again
A proposal to BNL/DOE is being made to fund PHENIX data reconstruction at Vanderbilt’s ACCRE farm during Two funding scenarios are envisioned with 3-year total costs of $200K and $300K, respectively, depending on the scope of the work at ACCRE Alternate scope scenarios possible year-by-year depending on PHENIX production needs A 3-year plan optimizes resource allocations and readiness for second and third year efforts This proposal builds on the experiences gained at Vanderbilt in Run6 and Run7 doing near-real time data reconstruction A concurrent proposal is being submitted to the CMS-HI collaboration in a competition to site their U.S. compute center at ACCRE This proposal will cite the PHENIX efforts in Run6 and Run7 as advantages over two competitors CMS-HI compute center ~5 times larger than the PHENIX $300K proposal Time scale , competing with MIT and Iowa bids, to be decided Feb. ‘08 PHENIX will gain Benefit of VU-subsidized costs and time-leveraged computing at ACCRE Efficient use of now-expert manpower in a large PHENIX group, great service from ACCRE Advantages in keeping pace with CMS-HEP’s technological breakthroughs If DOE invests in ACCRE for CMS-HI, PHENIX may share upgrades (tech solutions) November 13, 2007 Analysis Meeting

Proposal to BNL/DOE to Use ACCRE Farm for PHENIX Real Data Reconstruction in 2008 - 2010 Charles Maguire for the VU PHENIX group Carie Kennedy, Paul Sheldon,

Similar presentations

Presentation on theme: "Proposal to BNL/DOE to Use ACCRE Farm for PHENIX Real Data Reconstruction in 2008 - 2010 Charles Maguire for the VU PHENIX group Carie Kennedy, Paul Sheldon,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Proposal to BNL/DOE to Use ACCRE Farm for PHENIX Real Data Reconstruction in 2008 - 2010 Charles Maguire for the VU PHENIX group Carie Kennedy, Paul Sheldon,

Similar presentations

Presentation on theme: "Proposal to BNL/DOE to Use ACCRE Farm for PHENIX Real Data Reconstruction in 2008 - 2010 Charles Maguire for the VU PHENIX group Carie Kennedy, Paul Sheldon,"— Presentation transcript:

Similar presentations

About project

Feedback