Proposed 2007 Acquisition Don Holmgren LQCD Project Progress Review May 25-26, 2006 Fermilab
Type of Hardware Cluster versus BlueGene/L discussion Based on BG/L single rack at MIT and results reported from KEK, performance on LQCD codes is about 25% of peak for assembly language codes, and less than 10% of peak for C/C++ codes, or 2.1 – 5.4 $/Mflops ($1.5M/rack) Cluster price/performance history: “Pion” 1 st half: $1.15/Mflops “Pion” 2 nd half: $0.94/Mflops “6N”: $0.86/MfLops “Kaon” (projected): $0.69/Mflops We will pursue BG/L discussions with IBM, but an Infiniband cluster in FY07 at this point is the most cost effective choice.
Location Project cluster expertise exists at JLab and FNAL, but not at BNL Last year, the review committee recommended at most one cluster per year, alternating sites This year, we placed a small analysis cluster at JLab, leveraging the 5 th -year SciDAC prototype, large enough for allocated science tasks and for building Infiniband expertise at sufficient scale. We are also installing a large capability cluster at FNAL (“Kaon”) suitable for a mixture of configuration generation and analysis computing
Location, cont’d I recommended JLab to the LQCD Executive Committee as the site for FY07 deployment for the following reasons: The next procurement should start as early as possible, with an RFI in September or October Fermilab will have just finished integrating “Kaon” by the end of September. Operational issues may remain for several months. “Pion” plus “Kaon” will represent the bulk of the US LQCD analysis computing capacity for much of FY2007, plus significant configuration generation capability. It is critical that FNAL deliver this capacity competently and not be distracted by another large procurement.
Location, cont’d Additional reasons: The successful deployment of “6N” at JLab established that Infiniband cluster expertise has been sufficiently developed, though at smaller scale. Since configuration jobs can’t span heterogeneous clusters, there is no physics advantage for this type of computing of putting the FY07 machine next to “Kaon”. Distributing capacity at the two sites mitigates consequences of site-related outages in that a significant event will not disable all LQCD analysis capacity. We must ensure that expertise is developed and maintained at both sites. Also, we must foster shared development
Location, cont’d Drawbacks to deployment at JLab: Significant experience with delivering I/O to analysis computing (distributed file system access via dCache) exists at FNAL. The project must plan for establishing expertise at JLab, including consideration of dCache and other alternatives. Larger existing mass storage capacity at FNAL, for example, availability of shared tape drives. We will have to understand needs and budget appropriately at JLab (and at FNAL).
Location, cont’d Discussions regarding FY07 location were held with: LQCD Executive Committee Site managers The LQCD Executive Committee approved the JLab site recommendation at a meeting on March 29.
Design Issues for FY07 The obvious hardware candidates are: Intel Woodcrest 1333 MHz FSB, FBDIMM technology Lower power consumption Lower latency SSE (all instructions now 1 cycle) Benchmarking in April showed significant performance boost on DWF relative to “Dempsey” and “Opteron” Less of an advantage on MILC Intel single socket 1333 MHz Same microarchitecture as Woodcrest The obvious hardware candidates are: Intel Woodcrest 1333 MHz FSB, FBDIMM technology Lower power consumption Lower latency SSE (all instructions now 1 cycle) Benchmarking in April showed significant performance boost on DWF relative to “Dempsey” and “Opteron” Less of an advantage on MILC Intel single socket 1333 MHz (“Conroe”) Same microarchitecture as Woodcrest Better per socket memory bandwidth?
Processor Candidates, cont’d AMD “Socket F” (available July/August) Transition of Opteron memory technology to DDR2 from DDR DDR2 either 667 (matches Intel 1333) or 800
Design Issues Observations from Intel “Dempsey” and “Woodcrest” platforms: In-cache performance is very strong, with 8MB total available (2MB L2 per core). However, we would run at MB per core. Neither “Blackford” nor “Greencreek” chipsets deliver better total memory bandwidth than current Opteron All FBDIMM slots must be populated to maximize performance (8 dual rank FBDIMMs) – this drives up cost and power consumption Memory bandwidth must improve from what we’ve observed on early platforms
Design Issues Opteron observations (from dual 280 system): Aggregate performance increases at larger problem sizes using naïve MPI (one process per core), indicating that message passing overheads are affecting performance. This suggests that a multithreaded approach, either implicitely via OpenMP or explicitly via threaded code, will boost performance. But, implementation is tricky because of NUMA architecture. SSE codes developed for Intel are slower (in terms of cycles) on Opteron.
Design Issues For either Intel or AMD, dealing with multicore will be necessary to maximize performance. Software development is out of scope. If LQCD SciDAC-2 proposal is not funded, multicore optimizations will have to come from other sources (base programs, university contributions).
Design Issues Infiniband: DDR, and Infinipath Fermilab “Kaon” will be first test of DDR. The major issue is cable length and related communications reliability issues. Low cost optical fiber solutions are expected in We will test prototypes in Q4. We will have to draw from “Kaon” experience, as soon as it is available, to understand design issues, for example, oversubscription. Infinipath looks promising for scaling at smaller message sizes. Have to understand: Price/performance tradeoff Operational issues
Prototyping If SciDAC-2 LQCD is funded (July): Procure dual Socket F cluster (16 nodes) in August and include Infiniband vs. Infinipath comparison. (Fermilab) Procure best price/performance Intel cluster (16 nodes) in August (JLab): Woodcrest-based, though only if FBDIMM chipset issues are resolved Or, single socket 1066 or 1333 MHz FSB systems
Prototyping If no SciDAC-2 funding: Socket F Opteron testing at AMD Devcenter. Intel Woodcrest testing at TACC (Dell tie-in to Texas Supercomputer Center). Intel single socket testing at ? (likely APPRO). Would also need to devote some budget to buying single nodes.
Performance Estimates Woodcrest at % theoretical boost over 1066 If FBDIMM chipset issues resolved, theoretical throughput should be 21 GB/sec, with achievable throughput of perhaps 10 GB/sec This would double “Dempsey” out-of-cache performance (Dempsey “stream” = 4.5 GB/sec) Can FBDIMM issues can be resolved in a timely fashion? If resolved, Woodcrest system might sustain as much as 8 Gflops on asqtad, over 10 Gflops on DWF
Performance AMD “Socket F” DDR2-800 would give a doubling of memory bandwidth, DDR2-667 a 67% increase If floating point on cores can keep up, a single “Socket F” box could sustain ~ 8.4 Gflops (DDR2-667) to ~ 10 Gflops (DDR2-800) on asqtad
Performance Scaling Assuming similar factors: 0.73 for scaling from single node to 64-node run 1.19 for asqtad/DWF average Then: An 8-10 Gflop asqtad box for $2650 including Infiniband will deliver $0.31-$0.38/Mflop For $1.5M, 3.9 – 4.8 Tflops. “Deploy” milestone is 3.1 Tflops 6.35 Gflop asqtad box 3.1 Tflops “Kaon” nodes are 4.44 Gflop – need a factor of 1.43 in price/performance (May March, 21 month halving 1.39) Revise milestone downwards?
Schedule RFI Sept/Oct Draw from Socket F, Woodcrest, Conroe RFP Release as soon as budget allows Aim to issue in November/December If under C.R. and partial funding is available, issue RFP with option to buy additional hardware Integration: Begin in March Release milestone: June 30
FY08/FY09 For planning purposes, Fermilab needs a commitment to FY08/FY09 system locations FNAL directorate strongly supports putting both FY08 and FY09 systems at FNAL Budget profile allows for a large purchase ($1.5M) in FY08, and a smaller purchase ($0.7M) in FY09 If a mechanism can be found, there are clear advantages to combining the smaller FY09 acquisition with that from 2008.
FY08/FY09 cont’d Disadvantages of a small FY09 purchase: Because of Moore’s Law, we would expect faster hardware to be available in FY09 However, faster hardware could not be integrated with an FY08 system in the sense of jobs spanning both sets of hardware A larger capability machine would result from a combined FY08/FY09 purchase Integrated physics production would be greater Procurement requires manpower A combined purchase would allow for a reduction in budgeted effort for FY09 (that is, a shift of budget from deployment effort to hardware)
FY08/FY09 cont’d Compares: FY Tflop + FY Tflop to FY08/FY =6.1 Tflop Crossover takes 32 months