Proposed 2007 Acquisition Don Holmgren LQCD Project Progress Review May 25-26, 2006 Fermilab.

Slides:



Advertisements
Similar presentations
How to commence the IT Modernization Process?
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Parallel computer architecture classification
IBM 1350 Cluster Expansion Doug Johnson Senior Systems Developer.
Today’s topics Single processors and the Memory Hierarchy
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SDSC RP Update October 21, 2010.
JLab Status & 2016 Planning April 2015 All Hands Meeting Chip Watson Jefferson Lab Outline Operations Status FY15 File System Upgrade 2016 Planning for.
Job Submission on WestGrid Feb on Access Grid.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Design Considerations Don Holmgren Lattice QCD Project Review May 24, Design Considerations Don Holmgren Lattice QCD Computing Project Review Cambridge,
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Chapter 17 Parallel Processing.
Lecture 1: Introduction to High Performance Computing.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Performance benchmark of LHCb code on state-of-the-art x86 architectures Daniel Hugo Campora Perez, Niko Neufled, Rainer Schwemmer CHEP Okinawa.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
1 Comparing The Intel ® Core ™ 2 Duo Processor to a Single Core Pentium ® 4 Processor at Twice the Speed Performance Benchmarking and Competitive Analysis.
HEPIX - Spring 2015 Tony Wong (BNL).  Yearly purchase cycle of hardware for RACF timed with U.S. gov’t fiscal year (October to September)  Aim for delivery.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Computer System Architectures Computer System Software
October 24, 2000Milestones, Funding of USCMS S&C Matthias Kasemann1 US CMS Software and Computing Milestones and Funding Profiles Matthias Kasemann Fermilab.
LQCD Project Overview Don Holmgren LQCD Project Progress Review May 25-26, 2006 Fermilab.
TECHNOLOGY AND THE BOND PROGRAM TECHNOLOGY PLAN
SDSC RP Update TeraGrid Roundtable Reviewing Dash Unique characteristics: –A pre-production/evaluation “data-intensive” supercomputer based.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
QCD Project Overview Ying Zhang September 26, 2005.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
NLIT May 26, 2010 Page 1 Computing Jefferson Lab Users Group Meeting 8 June 2010 Roy Whitney CIO & CTO.
1 What’s Next for Financial Management Line of Business (FMLoB)? AGA/GWSCPA 6 th Annual Conference Dianne Copeland, Director, FSIO May 8, 2007.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Scientific Computing Experimental Physics Lattice QCD Sandy Philpott May 20, 2011 IT Internal Review 12GeV Readiness.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Lattice QCD and the SciDAC-2 LQCD Computing Project Lattice QCD Workflow Workshop Fermilab, December 18, 2006 Don Holmgren,
LQCD Project Review Response Germantown, MD Aug 8, 2005.
ERCOT PROGRESS REPORT Board of Directors Austin, Texas October 15, 2002.
09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
May 25-26, 2006 LQCD Computing Review1 Jefferson Lab 2006 LQCD Analysis Cluster Chip Watson Jefferson Lab, High Performance Computing.
Commodity Node Procurement Process Task Force: Status Stephen Wolbers Run 2 Computing Review September 13, 2005.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
EEL5708/Bölöni Lec 1.1 August 21, 2006 Lotzi Bölöni Fall 2006 EEL 5708 High Performance Computer Architecture Lecture 1 Introduction.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower QCD Project Review May 24-25, 2005 Code distribution see
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
11 January 2005 High Performance Computing at NCAR Tom Bettge Deputy Director Scientific Computing Division National Center for Atmospheric Research Boulder,
Revision - 01 Intel Confidential Page 1 Intel HPC Update Norfolk, VA April 2008.
Computing Performance Recommendations #10, #11, #12, #15, #16, #17.
U.S. Department of Energy’s Office of Science Midrange Scientific Computing Requirements Jefferson Lab Robert Edwards October 21, 2008.
Chapter 8 System Management Semester 2. Objectives  Evaluating an operating system  Cooperation among components  The role of memory, processor,
US ATLAS Tier 1 Facility Rich Baker Deputy Director US ATLAS Computing Facilities October 26, 2000.
Tackling I/O Issues 1 David Race 16 March 2010.
Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
FY2006 Acquisition at Fermilab Don Holmgren LQCD Project Progress Review May 25-26, 2006 Fermilab.
LQCD Computing Project Overview
Project Management – Part I
Computational Requirements
Defining Performance Which airplane has the best performance?
Lattice QCD Computing Project Review
for the Offline and Computing groups
Project Management – Part II
Super Computing By RIsaj t r S3 ece, roll 50.
Scientific Computing At Jefferson Lab
Patrick Dreher Research Scientist & Associate Director
BlueGene/L Supercomputer
IBM Power Systems.
Presentation transcript:

Proposed 2007 Acquisition Don Holmgren LQCD Project Progress Review May 25-26, 2006 Fermilab

Type of Hardware Cluster versus BlueGene/L discussion Based on BG/L single rack at MIT and results reported from KEK, performance on LQCD codes is about 25% of peak for assembly language codes, and less than 10% of peak for C/C++ codes, or 2.1 – 5.4 $/Mflops ($1.5M/rack) Cluster price/performance history: “Pion” 1 st half: $1.15/Mflops “Pion” 2 nd half: $0.94/Mflops “6N”: $0.86/MfLops “Kaon” (projected): $0.69/Mflops We will pursue BG/L discussions with IBM, but an Infiniband cluster in FY07 at this point is the most cost effective choice.

Location Project cluster expertise exists at JLab and FNAL, but not at BNL Last year, the review committee recommended at most one cluster per year, alternating sites This year, we placed a small analysis cluster at JLab, leveraging the 5 th -year SciDAC prototype, large enough for allocated science tasks and for building Infiniband expertise at sufficient scale. We are also installing a large capability cluster at FNAL (“Kaon”) suitable for a mixture of configuration generation and analysis computing

Location, cont’d I recommended JLab to the LQCD Executive Committee as the site for FY07 deployment for the following reasons: The next procurement should start as early as possible, with an RFI in September or October Fermilab will have just finished integrating “Kaon” by the end of September. Operational issues may remain for several months. “Pion” plus “Kaon” will represent the bulk of the US LQCD analysis computing capacity for much of FY2007, plus significant configuration generation capability. It is critical that FNAL deliver this capacity competently and not be distracted by another large procurement.

Location, cont’d Additional reasons: The successful deployment of “6N” at JLab established that Infiniband cluster expertise has been sufficiently developed, though at smaller scale. Since configuration jobs can’t span heterogeneous clusters, there is no physics advantage for this type of computing of putting the FY07 machine next to “Kaon”. Distributing capacity at the two sites mitigates consequences of site-related outages in that a significant event will not disable all LQCD analysis capacity. We must ensure that expertise is developed and maintained at both sites. Also, we must foster shared development

Location, cont’d Drawbacks to deployment at JLab: Significant experience with delivering I/O to analysis computing (distributed file system access via dCache) exists at FNAL. The project must plan for establishing expertise at JLab, including consideration of dCache and other alternatives. Larger existing mass storage capacity at FNAL, for example, availability of shared tape drives. We will have to understand needs and budget appropriately at JLab (and at FNAL).

Location, cont’d Discussions regarding FY07 location were held with: LQCD Executive Committee Site managers The LQCD Executive Committee approved the JLab site recommendation at a meeting on March 29.

Design Issues for FY07 The obvious hardware candidates are: Intel Woodcrest 1333 MHz FSB, FBDIMM technology Lower power consumption Lower latency SSE (all instructions now 1 cycle) Benchmarking in April showed significant performance boost on DWF relative to “Dempsey” and “Opteron” Less of an advantage on MILC Intel single socket 1333 MHz Same microarchitecture as Woodcrest The obvious hardware candidates are: Intel Woodcrest 1333 MHz FSB, FBDIMM technology Lower power consumption Lower latency SSE (all instructions now 1 cycle) Benchmarking in April showed significant performance boost on DWF relative to “Dempsey” and “Opteron” Less of an advantage on MILC Intel single socket 1333 MHz (“Conroe”) Same microarchitecture as Woodcrest Better per socket memory bandwidth?

Processor Candidates, cont’d AMD “Socket F” (available July/August) Transition of Opteron memory technology to DDR2 from DDR DDR2 either 667 (matches Intel 1333) or 800

Design Issues Observations from Intel “Dempsey” and “Woodcrest” platforms: In-cache performance is very strong, with 8MB total available (2MB L2 per core). However, we would run at MB per core. Neither “Blackford” nor “Greencreek” chipsets deliver better total memory bandwidth than current Opteron All FBDIMM slots must be populated to maximize performance (8 dual rank FBDIMMs) – this drives up cost and power consumption Memory bandwidth must improve from what we’ve observed on early platforms

Design Issues Opteron observations (from dual 280 system): Aggregate performance increases at larger problem sizes using naïve MPI (one process per core), indicating that message passing overheads are affecting performance. This suggests that a multithreaded approach, either implicitely via OpenMP or explicitly via threaded code, will boost performance. But, implementation is tricky because of NUMA architecture. SSE codes developed for Intel are slower (in terms of cycles) on Opteron.

Design Issues For either Intel or AMD, dealing with multicore will be necessary to maximize performance. Software development is out of scope. If LQCD SciDAC-2 proposal is not funded, multicore optimizations will have to come from other sources (base programs, university contributions).

Design Issues Infiniband: DDR, and Infinipath Fermilab “Kaon” will be first test of DDR. The major issue is cable length and related communications reliability issues. Low cost optical fiber solutions are expected in We will test prototypes in Q4. We will have to draw from “Kaon” experience, as soon as it is available, to understand design issues, for example, oversubscription. Infinipath looks promising for scaling at smaller message sizes. Have to understand: Price/performance tradeoff Operational issues

Prototyping If SciDAC-2 LQCD is funded (July): Procure dual Socket F cluster (16 nodes) in August and include Infiniband vs. Infinipath comparison. (Fermilab) Procure best price/performance Intel cluster (16 nodes) in August (JLab): Woodcrest-based, though only if FBDIMM chipset issues are resolved Or, single socket 1066 or 1333 MHz FSB systems

Prototyping If no SciDAC-2 funding: Socket F Opteron testing at AMD Devcenter. Intel Woodcrest testing at TACC (Dell tie-in to Texas Supercomputer Center). Intel single socket testing at ? (likely APPRO). Would also need to devote some budget to buying single nodes.

Performance Estimates Woodcrest at % theoretical boost over 1066 If FBDIMM chipset issues resolved, theoretical throughput should be 21 GB/sec, with achievable throughput of perhaps 10 GB/sec This would double “Dempsey” out-of-cache performance (Dempsey “stream” = 4.5 GB/sec) Can FBDIMM issues can be resolved in a timely fashion? If resolved, Woodcrest system might sustain as much as 8 Gflops on asqtad, over 10 Gflops on DWF

Performance AMD “Socket F” DDR2-800 would give a doubling of memory bandwidth, DDR2-667 a 67% increase If floating point on cores can keep up, a single “Socket F” box could sustain ~ 8.4 Gflops (DDR2-667) to ~ 10 Gflops (DDR2-800) on asqtad

Performance Scaling Assuming similar factors: 0.73 for scaling from single node to 64-node run 1.19 for asqtad/DWF average Then: An 8-10 Gflop asqtad box for $2650 including Infiniband will deliver $0.31-$0.38/Mflop For $1.5M, 3.9 – 4.8 Tflops. “Deploy” milestone is 3.1 Tflops 6.35 Gflop asqtad box  3.1 Tflops “Kaon” nodes are 4.44 Gflop – need a factor of 1.43 in price/performance (May  March, 21 month halving  1.39) Revise milestone downwards?

Schedule RFI Sept/Oct Draw from Socket F, Woodcrest, Conroe RFP Release as soon as budget allows Aim to issue in November/December If under C.R. and partial funding is available, issue RFP with option to buy additional hardware Integration: Begin in March Release milestone: June 30

FY08/FY09 For planning purposes, Fermilab needs a commitment to FY08/FY09 system locations FNAL directorate strongly supports putting both FY08 and FY09 systems at FNAL Budget profile allows for a large purchase ($1.5M) in FY08, and a smaller purchase ($0.7M) in FY09 If a mechanism can be found, there are clear advantages to combining the smaller FY09 acquisition with that from 2008.

FY08/FY09 cont’d Disadvantages of a small FY09 purchase: Because of Moore’s Law, we would expect faster hardware to be available in FY09 However, faster hardware could not be integrated with an FY08 system in the sense of jobs spanning both sets of hardware A larger capability machine would result from a combined FY08/FY09 purchase Integrated physics production would be greater Procurement requires manpower A combined purchase would allow for a reduction in budgeted effort for FY09 (that is, a shift of budget from deployment effort to hardware)

FY08/FY09 cont’d Compares: FY Tflop + FY Tflop to FY08/FY =6.1 Tflop Crossover takes 32 months