May 25-26, 2006 LQCD Computing Review1 Jefferson Lab 2006 LQCD Analysis Cluster Chip Watson Jefferson Lab, High Performance Computing.

Slides:



Advertisements
Similar presentations
PowerEdge T20 Customer Presentation. Product overview Customer benefits Use cases Summary PowerEdge T20 Overview 2 PowerEdge T20 mini tower server.
Advertisements

Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 1.
Living with Exadata Presented by: Shaun Dewberry, OS Administrator, RDC Tom de Jongh van Arkel, Database Administrator, RDC Komaran Hansragh, Data Warehouse.
Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
JLab Status & 2016 Planning April 2015 All Hands Meeting Chip Watson Jefferson Lab Outline Operations Status FY15 File System Upgrade 2016 Planning for.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
Designing Lattice QCD Clusters Supercomputing'04 November 6-12, 2004 Pittsburgh, PA.
Design Considerations Don Holmgren Lattice QCD Project Review May 24, Design Considerations Don Holmgren Lattice QCD Computing Project Review Cambridge,
A Commodity Cluster for Lattice QCD Calculations at DESY Andreas Gellrich *, Peter Wegner, Hartmut Wittig DESY CHEP03, 25 March 2003 Category 6: Lattice.
SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
Academic and Research Technology (A&RT)
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
VAP What is a Virtual Application ? A virtual application is an application that has been optimized to run on virtual infrastructure. The application software.
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
LQCD Project Overview Don Holmgren LQCD Project Progress Review May 25-26, 2006 Fermilab.
Fermilab Oct 17, 2005Database Services at LCG Tier sites - FNAL1 FNAL Site Update By Anil Kumar & Julie Trumbo CD/CSS/DSG FNAL LCG Database.
Site Lightning Report: MWT2 Mark Neubauer University of Illinois at Urbana-Champaign US ATLAS Facilities UC Santa Cruz Nov 14, 2012.
NSTXpool Computer Upgrade WP #1685 Bill Davis December 9, 2010.
Outline IT Organization SciComp Update CNI Update
Evaluation of the LDC Computing Platform for Point 2 SuperMicro X6DHE-XB, X7DB8+ Andrey Shevel CERN PH-AID ALICE DAQ CERN 10 October 2006.
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Site Networking Anna Jordan April 28, 2009.
CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
LQCD Clusters at JLab Chip Watson Jie Chen, Robert Edwards Ying Chen, Walt Akers Jefferson Lab.
Business Intelligence Appliance Powerful pay as you grow BI solutions with Engineered Systems.
Computing and IT Update Jefferson Lab User Group Roy Whitney, CIO & CTO 10 June 2009.
NLIT May 26, 2010 Page 1 Computing Jefferson Lab Users Group Meeting 8 June 2010 Roy Whitney CIO & CTO.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Scientific Computing Experimental Physics Lattice QCD Sandy Philpott May 20, 2011 IT Internal Review 12GeV Readiness.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.
Proposed 2007 Acquisition Don Holmgren LQCD Project Progress Review May 25-26, 2006 Fermilab.
ALMA Archive Operations Impact on the ARC Facilities.
Jefferson Lab Site Report Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606
HEPiX FNAL ‘02 25 th Oct 2002 Alan Silverman HEPiX Large Cluster SIG Report Alan Silverman 25 th October 2002 HEPiX 2002, FNAL.
1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
1 Agility in Virtualized Utility Computing Hangwei Qian, Elliot Miller, Wei Zhang Michael Rabinovich, Craig E. Wills {EECS Department, Case Western Reserve.
ClinicalSoftwareSolutions Patient focused.Business minded. Slide 1 Opus Server Architecture Fritz Feltner Sept 7, 2007 Director, IT and Systems Integration.
Hands-On Virtual Computing
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.
DAQ Status & Plans GlueX Collaboration Meeting – Feb 21-23, 2013 Jefferson Lab Bryan Moffit/David Abbott.
CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Next Generation of Apache Hadoop MapReduce Owen
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
FY2006 Acquisition at Fermilab Don Holmgren LQCD Project Progress Review May 25-26, 2006 Fermilab.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
LQCD Computing Project Overview
Getting the Most out of Scientific Computing Resources
Project Management – Part I
Getting the Most out of Scientific Computing Resources
Computational Requirements
Lattice QCD Computing Project Review
NGS Oracle Service.
FY09 Tactical Plan Status Report for Site Networking
LQCD Computing Operations
Scientific Computing At Jefferson Lab
Patrick Dreher Research Scientist & Associate Director
IBM Power Systems.
Designing a PC Farm to Simultaneously Process Separate Computations Through Different Network Topologies Patrick Dreher MIT.
Cluster Computers.
Presentation transcript:

May 25-26, 2006 LQCD Computing Review1 Jefferson Lab 2006 LQCD Analysis Cluster Chip Watson Jefferson Lab, High Performance Computing

May 25-26, 2006 LQCD Computing Review2 Analysis Cluster Goals Additional Capacity for Data Analysis Clusters were oversubscribed, additional capacity needed primarily for data analysis Goal: 400 GFlops, to be deployed as early in the year as possible (funding constrained) Extend Infiniband Experience to Jefferson Lab Anticipating that the FY 2007 machine might be sited at Jefferson Lab, the project wanted Jefferson Lab staff to gain the necessary experience on Infiniband

May 25-26, 2006 LQCD Computing Review3 Strategy As presented last year, the plan was to replicate the Fermilab “pion” cluster (at reduced scale) at JLab, combining funds from the last year of SciDAC (prototype hardware) and the first year of the LQCD Computing project. One promising new alternative (Intel dual core Pentium-D) was evaluated, leading to a decision to make the first step in the direction of dual core CPU’s with this cluster...

May 25-26, 2006 LQCD Computing Review4 SciDAC Prototyping Under the SciDAC project, the collaboration has for the last 5 years been evaluating a series of commodity technologies for optimized clusters for LQCD. As a continuation of this work, the Pentium-D was evaluated in the Fall of Expectations were that dual core scaling would be poor, as it has been for the last 4 years with dual CPU Xeons. We were pleasantly surprised!

May 25-26, 2006 LQCD Computing Review5 DWF Scaling to 2 nd core is quite good Total performance also very good

May 25-26, 2006 LQCD Computing Review6 At small local volumes, additional communication overhead of 2 processes hurts dual core performance. Multi-threading the code should further improve this.

May 25-26, 2006 LQCD Computing Review7 Decision Process Option 1: stay with “pion” for deployment Pro: known platform, lower risk Con: give up some potential performance gain Option 2: switch to Pentium-D for deployment Pro: 22% increase in performance (at fixed cost) + start on experience with dual core Con: slight risk in using new chip, motherboard Note: same HCA; can use 2 cores in multi-process mode so no software changes anticipated initially

May 25-26, 2006 LQCD Computing Review8 Decision Process & Plan Change Control Cost neutral node cost 10% higher, but we can buy fewer nodes Slight increased risk 20% improvement in performance and price / performance Large Scale Prototype First Use JLab SciDAC funds, plus JLab theory group funds to build a prototype of scale 128 nodes Once operational, use project funds to add an additional 128+ nodes

May 25-26, 2006 LQCD Computing Review9 Timeline Sept 2005 – procured Infiniband fabric for 130 nodes (end of year funds) Oct 2005 – procured two Pentium-D nodes for testing November 2005 – Two node performance data very encouraging (Streams, Chroma DWF application, MILC staggered); decision taken to use SciDAC + base funds to procure 128+ node cluster for large scale test of dual core platform (approved by SciDAC oversight committee) December 2005 – Procurement of 140 nodes, half from SciDAC funds, half from JLab theory group (this project decides to await full scale tests) Jan/Feb 2006 – Cluster installed at JLab. Based upon good performance (and behavior), Change Request submitted for LQCD Computing Project Change Control Board recommending to procure dual core instead of “pion” nodes Feb 2006 – Change Request approved, additional 140 nodes ordered March/April 2006 – Machine commissioned (tested with several versions of IB and Linux O/S) May 1, 2006 – Start of production running

May 25-26, 2006 LQCD Computing Review10 Cluster Details 280 node 2006 Infiniband cluster – 6n Dell GHz Pentium-D dual core 1 GByte DDR2-667 memory (800 MHz fsb) 80 GB SATA disk IPMI for node monitoring, control (can reboot hung node from home) Mellanox PCI-Express 4x “memfree” cards (same as FNAL) 24 port Flextronics switches (same as FNAL) 35 nodes / rack, 17 or 18 nodes per leaf switch (2 per rack) spine switch built from 5 of 24 port switches (modular and fault tolerant)

May 25-26, 2006 LQCD Computing Review11 6n Cluster 2:1 oversubscription with smart layout, else 3.5:1 up to 2.5 GFlops / node DWF, 2.3 staggered Single job on combined resources achieved > 600 GFlops, $0.8 / Mflops Operational since May 1

May 25-26, 2006 LQCD Computing Review12 Results 480 GFlops added to project analysis capacity Project procurement: 140 nodes, $250K plus spares, installation 2.3 GFlops / node (average, sustained) => $0.80 / MFlops SciDAC (70 nodes, ~0.16 TFlops) + Project (140 nodes, 0.32 TFlops) yields 480 GFlops managed by this project Two staff members (Balint Joo and Jie Chen) have now gained proficiency with Infiniband Balint Joo manages the fabric (upgrade firmware, run diagnostics) Balint identified an error in latest IB-2 stack Jie Chen localized it to bad memory management in the eager send protocol when shared received queue is enabled (default for PCI- Express platforms); we have now disabled this mode Bug report submitted, bug fix received last week

May 25-26, 2006 LQCD Computing Review13 Summary JLab cluster “6n” is now in production Price / performance $0.80 / Mflops JLab is now running over 1000 nodes, with infiniband in the mix JLab is ready to host the FY 2007 cluster – lots of space!

May 25-26, 2006 LQCD Computing Review14 (Backup Slides)

May 25-26, 2006 LQCD Computing Review15 Architecture Diagram

May 25-26, 2006 LQCD Computing Review16 Operational Data Approximately 50 users, 20% active Most jobs on gigE mesh clusters are 128 node (4x4x8) Uptime for the year was below 90%, due to installation of new wing (& computer room); currently doing > 90%. All first year allocations have now been delivered (except for one small project that didn’t use their time); most projects will end the first “year” with 15% extra CPU cycles since this resource was not formally allocated for the May-June friendly running period. 60 terabytes now in JLab silo ( < 5% of resource); project pays only for tape, not slots nor share of robotics Many detailed views of the computing systems, batch system and resource usage available via the web,