LQCD Clusters at JLab Chip Watson Jie Chen, Robert Edwards Ying Chen, Walt Akers Jefferson Lab.

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

ARCHITECTURE OF APPLE’S G4 PROCESSOR BY RON WEINWURZEL MICROPROCESSORS PROFESSOR DEWAR SPRING 2002.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Beowulf Supercomputer System Lee, Jung won CS843.
Segmentation and Paging Considerations
The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.
Designing Lattice QCD Clusters Supercomputing'04 November 6-12, 2004 Pittsburgh, PA.
Design Considerations Don Holmgren Lattice QCD Project Review May 24, Design Considerations Don Holmgren Lattice QCD Computing Project Review Cambridge,
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
HELICS Petteri Johansson & Ilkka Uuhiniemi. HELICS COW –AMD Athlon MP 1.4Ghz –512 (2 in same computing node) –35 at top500.org –Linpack Benchmark 825.
COMP3221: Microprocessors and Embedded Systems Lecture 15: Interrupts I Lecturer: Hui Wu Session 1, 2005.
SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
Summary Ted Liu, FNAL Feb. 9 th, 2005 L2 Pulsar 2rd IRR Review, ICB-2E, video: 82Pulsar
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Server Types Different servers do different jobs. Proxy Servers Mail Servers Web Servers Applications Servers FTP Servers Telnet Servers List Servers Video/Image.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
QCD Project Overview Ying Zhang September 26, 2005.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
The NE010 iWARP Adapter Gary Montry Senior Scientist
OSes: 1. Intro 1 Operating Systems v Objectives –introduce Operating System (OS) principles Certificate Program in Software Development CSE-TC and CSIM,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1  2004 Morgan Kaufmann Publishers Multilevel cache Used to reduce miss penalty to main memory First level designed –to reduce hit time –to be of small.
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.
Proposed 2007 Acquisition Don Holmgren LQCD Project Progress Review May 25-26, 2006 Fermilab.
09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.
May 25-26, 2006 LQCD Computing Review1 Jefferson Lab 2006 LQCD Analysis Cluster Chip Watson Jefferson Lab, High Performance Computing.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower QCD Project Review May 24-25, 2005 Code distribution see
1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
 The End to the Means › (According to IBM ) › 03.ibm.com/innovation/us/thesmartercity/in dex_flash.html?cmp=blank&cm=v&csr=chap ter_edu&cr=youtube&ct=usbrv111&cn=agus.
1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
1 Farm Issues L1&HLT Implementation Review Niko Neufeld, CERN-EP Tuesday, April 29 th.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Compute and Storage For the Farm at Jlab
LQCD Computing Project Overview
OPERATING SYSTEMS CS 3502 Fall 2017
Lattice QCD Computing Project Review
Presented by Kristen Carlson Accardi
LQCD Computing Operations
Patrick Dreher Research Scientist & Associate Director
A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen,
Constructing a system with multiple computers or processors
Virtual Memory: Working Sets
Designing a PC Farm to Simultaneously Process Separate Computations Through Different Network Topologies Patrick Dreher MIT.
Cluster Computers.
Presentation transcript:

LQCD Clusters at JLab Chip Watson Jie Chen, Robert Edwards Ying Chen, Walt Akers Jefferson Lab

Jefferson Lab SciDAC Prototype Clusters The SciDAC project is funding a sequence of cluster prototypes which allow us to track industry developments and trends, while also deploying critical compute resources. Myrinet + Pentium single 2.0 GHz P4 Xeon (Summer 2002) 64 Gbytes memory Gigabit Ethernet Mesh + Pentium 4 (An alternative cost effective cluster design now being evaluated) 256 (8x8x4) single 2.66 GHz P4 Xeon (Fall 2003) 64 Gbytes memory And of course, Infiniband appears as a strong future choice.

128 Node Myrinet JLab Myrinet 2 GHz P4 1U, 256Mb

2002 Myrinet Cluster Performance Each Myrinet cluster node delivers ~600 for the DWF inverter for large problems (16 4 ), and sustains this performance down to 2 x 4 3, yielding ~75 Gflops for the cluster (rising as more SSE code is integrated into SZIN and Chroma; 150 Gflops on Wilson Dirac) Small problem (cache resident) performance on a single processor is 3x that of memory bandwidth constrained. This cache boost offsets network overhead. Memory bandwidth is not enough for 2 nd processor. Network characteristics: MB/s bandwidth 8 usec RTT/2 latency 4 usec software overhead for send+receive

red=X, green=Y, grey=Z yellow=I/O 256 Node GigE Mesh JLab

2003 GigE Cluster Performance The gigE cluster nodes deliver ~700 Mflops/node for the inverter on large problems (16 4 ), but degrades for small problems due to the not-yet-optimized gigE software. Nodes are barely faster than earlier cluster despite 33% faster bus – lower quality chipset implementation. Network characteristics: MB/s b/w 1 link MB/s b/w 3 link 19 usec RTT/2 latency (12 usec effective latency at interrupt level, planned for fast global sum) 13 usec software overhead for send+receive (plan to reduce this) Wilson Inverter 12% faster, proportional to Streams

GigE Application to Application Latency 18.5  s Interrupt to interrupt latency is 12.5 usec; this will be used to accelerate global sums.

GigE point to point bandwidth

Host Overheads of M-VIA Os~6  s Or~6  s 500MB/Sec

GigE Aggregated Bandwidths Bi-directional BandwidthLatencyCost Myrinet/GM 260 MB/s (480MB/s) 9  s (7) GigaE/VIA 440 MB/s 18.5  s (12.5) 75 x 6 Memory bandwidth saturation, copy in interrupt routine starves benchmark task for cycles

GigE Evaluation & Status GigE mesh reduced the system cost by 30%, allowing JLab to build a 256 node cluster instead of a 128 node cluster (largest 2 N partition) within the SciDAC + NP matching budget. Software development was hard! –The initial VIA low level code development was quick, but QMP was more lengthy (message segmentation, multi-link management) –Strange VIA driver bug only recently found (only occurs when running a process consuming all of physical memory) –One known bug in VIA finalize at job end sometimes hangs a node –System is perhaps only now becoming stable Hardware seems fairly reliable (assuming all or most hangs are due to the pesky VIA bug, now deceased) –However, IPMI is faulty, had to disable SMbus on gigE chips –Handful of early card & cable failures, since then modest, about 1 disk failure a month on the 2 clusters’ 400 nodes; power supply failures less frequent

Modeling Cluster Performance Model curves include CPU in- and out-of-cache performance, PCI and link bandwidth, latency, etc. Here a moderately simple model predicts cluster Wilson operator performance pretty well. Also can do inverter. Reminder: We can effectively model the cluster performance using node and network characteristics.

GigE Mesh Cluster Efficiency 2003: –533 MHz FSB –900 Mflops out of cache –256 nodes –Production running is near 7^4 x 16, or ~85% 2004: –800 MHz FSB –1500+ Mflops out of cache –512 nodes –Vectorize in 5 th dimension to lower # messages Even though the hypothetical 2004 cluster is bigger and has 66% faster nodes, the efficiency is improved because a)a faster memory bus – the copies to run the protocol become cheaper (modest effect) b)fewer messages (better algorithm) – big effect! (dilutes software overhead) production

256 Node Infiniband Cluster Efficiency GigE vs Infiniband –at 16 4, no difference in efficiency –At 4 4, Infiniband does 15% better, at 30% higher cost –Only need bandwidth of MB/sec for well structured code (good I/O, compute overlap) Infiniband cost can be trimmed by not building a full fat tree; use 20:4 instead of 16:8 on edge switches (20:4 has been assumed in cost model)

What about “SuperClusters”? A supercluster is one big enough to run LQCD in cache. For domain wall, this would be 2x4x4x4x16 (about 1 MB single precision). At this small volume, cluster efficiency drops to around 30% (gigE) or 40% (Infiniband), but CPU performance goes up by 3x, yielding performance BETTER than for l6 5 for the Infiniband cluster. Performance for clusters thus has 2 maxima, one at large problems, one at the cache “sweet spot”. Cluster size? For 32^3x48x16 one needs 6K processors. Too big.

Summer 2004 JLab Cluster 512 node 8x8x8 GigE Mesh – $1.60-$1.25 / Mflops (est, problem size dependent) –2U rackmount –On node failure, segment to plane pairs 384 node Infiniband (plus a few spares) – $2.10-$1.60 / Mflops (est) –Pedestal instead of rackmount to save cost –Can run as 384 nodes if problem has a factor of 3 –Fault tolerance since spare nodes are in the same fabric –More flexibility in scheduling small jobs If these estimates are born out by firm quotes, and if operational experience on existing gigE mesh is good, then GigE is the favored solution; expectation is that FNAL will deliver the complementary solution as Infiniband prices drop further.

JLab Computing Environment 1.Red Hat 9 compute and interactive nodes 2.5 TB disk pool Auto migrate to silo (JSRM daemon) < 100 GB/day is OK; more if purchase dedicated drive Pin / unpin, permanent / volatile 3.PBS batch system Separate servers per cluster 4.Two interactive nodes per cluster 5.Some unfriendly features (to be fixed) 2 hop ssh from offsite, scp (must go through JLab gateway)

For More Information Lattice QCD Web Server / Home Page: The Lattice Portal at JLab High Performance Computing at JLab