The QCDOC Project Overview and Status Norman H. Christ DOE LGT Review May 24-25, 2005.

Slides:



Advertisements
Similar presentations
I/O and the SciDAC Software API Robert Edwards U.S. SciDAC Software Coordinating Committee May 2, 2003.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
SciDAC Software Infrastructure for Lattice Gauge Theory
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
14 Macintosh OS X Internals. © 2005 Pearson Addison-Wesley. All rights reserved The Macintosh Platform 1984 – first affordable GUI Based on Motorola 32-bit.
CM-5 Massively Parallel Supercomputer ALAN MOSER Thinking Machines Corporation 1993.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Beowulf Supercomputer System Lee, Jung won CS843.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
OGO 2.1 SGI Origin 2000 Robert van Liere CWI, Amsterdam TU/e, Eindhoven 11 September 2001.
Design Considerations Don Holmgren Lattice QCD Project Review May 24, Design Considerations Don Holmgren Lattice QCD Computing Project Review Cambridge,
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Amoeba Distributed Operating System James Schultz CPSC 550 Spring 2007.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Interconnection and Packaging in IBM Blue Gene/L Yi Zhu Feb 12, 2007.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Network File System (NFS) in AIX System COSC513 Operation Systems Instructor: Prof. Anvari Yuan Ma SID:
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
Computer System Architectures Computer System Software
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower Annual Progress Review JLab, May 14, 2007 Code distribution see
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
QCD Project Overview Ying Zhang September 26, 2005.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
LQCD Clusters at JLab Chip Watson Jie Chen, Robert Edwards Ying Chen, Walt Akers Jefferson Lab.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Lattice QCD: Past, Present and Future Norman H. Christ RHIC Physics in the Context of the Standard Model June 20, 2006.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
May 25-26, 2006 LQCD Computing Review1 Jefferson Lab 2006 LQCD Analysis Cluster Chip Watson Jefferson Lab, High Performance Computing.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Chroma: An Application of the SciDAC QCD API(s) Bálint Joó School of Physics University of Edinburgh UKQCD Collaboration Soon to be moving to the JLAB.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
EXTENSIBILITY, SAFETY AND PERFORMANCE IN THE SPIN OPERATING SYSTEM
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower QCD Project Review May 24-25, 2005 Code distribution see
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
A QCD Grid: 5 Easy Pieces? Richard Kenway University of Edinburgh.
Interconnection network network interface and a case study.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Background Computer System Architectures Computer System Software.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
LQCD Computing Project Overview
Computational Requirements
Lattice QCD Computing Project Review
CS775: Computer Architecture
Jun Doi Tokyo Research Laboratory IBM Research
Computer Architecture
BlueGene/L Supercomputer
Chapter 3: Operating-System Structures
Chapter 2: The Linux System Part 1
Five Key Computer Components
The University of Adelaide, School of Computer Science
Utsunomiya University
Husky Energy Chair in Oil and Gas Research
Presentation transcript:

The QCDOC Project Overview and Status Norman H. Christ DOE LGT Review May 24-25, 2005

DOE Review -May 2005(2) Outline Project goals QCDOC collaboration Architecture Software –Operating system –Run-time environment –Programming environment Construction and packaging Construction status Final bring-up issues Application performance Future plan

DOE Review -May 2005(3) Project Goals Lattice QCD provides the only first- principles window into non-perturbative phenomena of QCD. All significant errors are controlled and can be reduced with faster computers or better algorithms. Simple formulation enables targeted computer architecture. Regular space-time description: easily mounted on a parallel computer node, 0.4 Tflops QCDSP machine

DOE Review -May 2005(4) Project Goals (con’t) Massively parallel machine capable of strong scaling: use many nodes on a small problem. –Large inter-node bandwidth. –Small communications latency. $1/sustained Mflops cost/performance. Low power, easily maintained modular design.

DOE Review -May 2005(5) QCDOC Collaboration (people) Columbia (DOE) –Norman Christ –Saul Cohen * –Calin Cristian * –Zhihua Dong –Changhoan Kim * –Ludmila Levkova * –Sam Li * –Xiaodong Liao * –Guofeng Liu * –Meifeng Lin * –Robert Mawhinney –Azusa Yamaguchi BNL (SciDAC) –Robert Bennett –Chulwoo Jung –Konstantin Petrov –David Stampf UKQCD (PPARC) –Peter Boyle –Mike Clark –Balint Joo RBRC (RIKEN) –Shigemi Ohta –Tilo Wettig IBM –Dong Chen –Alan Gara –Design groups: Yorktown Heights, NY Rochester, MN Raleigh, NC *CU graduate student

DOE Review -May 2005(6) QCDOC Collaboration (money) Institution/funding source Design and proto-typing Large installations Columbia/DOE$500K$1M ( UKQCD) RBRC/RIKEN$400K$5M UKQCD/PPARC$1M$5.2M BNL/DOE-$5.1M Personnel and site prep costs are not included.

DOE Review -May 2005(7) QCDOC Architecture IBM-fabricated, single-chip node. [50 million transistors, 5 Watt, 1.3cm x 1.3cm] Processor: –PowerPC 32-bit RISC. –64-bit, 1 Gflops floating point unit. Memory/node: 4 Mbyte (on-chip) & O 2 Gbyte DIMM. Communications network: –6-dim, supporting lower dimensional partitions. –Global sum/broadcast functionality. –Multiple DMA engines/minimal processor overhead. Ethernet connection to each node: booting, I/O, host control. ~7-8 Watt/node, 15 in 3 per node.

DOE Review -May 2005(8) Software Environment Lean kernel on each node –Protected kernel mode and address space. –RPC support for host access. –NFS access to NAS disks (/pfs). –Normal Unix services including stdout and stderr. Threaded host kernel –Efficient performance on 8-processor SMP host. –User shell (qsh) with extended commands. –Host file system (/host). –Simple remapping of 6-D machine to (6-n)-D torus. Programming environment –POSIX compatible, open-source libc. –gcc and xlc compilers SciDAC standards –Level-1, QMP protocol –Level-2 parallelized linear algebra, QDP & QDP++. –Efficient level-3 inverters Wilson/clover Domain wall fermions ASQTAD p4 (underway)

DOE Review -May 2005(9) Red boxes are nodes. Blue boxes mother boards. Red lines are communications links. Green lines are Ethernet connections. Green boxes are Ethernet switches. Pink boxes are host CPU processors. Network Architecture

DOE Review -May 2005(10) QCDOC Chip 50 million transistors, 0.18 micron, 1.3 x 1.3 cm die, 5 Watt

DOE Review -May 2005(11) Daughter board (2 nodes)

DOE Review -May 2005(12) Mother board (64 nodes)

DOE Review -May 2005(13) 512-Node Machine

DOE Review -May 2005(14) UKQCD Machine (12,288 nodes/10 Tflops)

DOE Review -May 2005(15) Brookhaven Installation RBRC (right) and DOE (left) 12K-node QCDOC machines

DOE Review -May 2005(16) Project Status UKQCD – 13,312 nodes --$5.2M – 3-5 Tflops sustained. –Installed in Edinburgh 12/04. –Running production at 400 MHz/100% reprod. RBRC – 12,288 nodes -- $5M – 3-5 Tflops sustained. –Installed at BNL 2/05. –1/3 in production/100% reprod. –1/3 performing physics tests. –1/3 speed sorting 420. DOE – 12,288 nodes -- $5.1M – 3-5 Tflops sustained –Installed at BNL 4/05. –1/2 performing physics tests. –1/2 being debugged. Price/performance of ~$1/Mflops.

DOE Review -May 2005(17) Final Bring-up issues FPU errors –Lowest two bits infrequently incorrect (not seen at 400MHz). –Remove slow nodes at 432MHz and run at 400MHz. Serial communication errors. –Induced by Ethernet activity. –0.25/month at 400 MHz/1K nodes. –Further reduced by PLL tuning. –Protected by hardware checksums with no performance loss. Parallel disk system –24 Tbyte RAID servers. –512-nodes achieve 12 Mbytes/sec. –Installed 05/05? Larger machine partitions –Three 4096-node partitions assembled. –Expect to run as node machines. Spares –1% non-functioning daughter boards –1.5% non-functioning mother boards –~18 mother boards for small jobs/code development.

DOE Review -May 2005(18) Application Performance (double precision) DWF/24 3 x 64/RHMC (Local vol: 6 x 6 x 6 x 2 x 8) CG: Complete code: 1.1 Tflops (34%) 0.95 Tflops (29%) 1024-node machine: Fermion action Local volume Dirac performance CG performance Wilson %32% Wilson4 44%38% Clover4 54%47.5% DWF %42% ASQTAD4 42%40% 4096-node machine (UKQCD) :

DOE Review -May 2005(19) QCDOC Summary Present DOE QCDOC machine use: –Alpha users developing code on 1 mother board machines. –MILC (staggered 2+1 flavor) using 1K-node machine. –JLAB (DWF 2+1 flavor) using 1K-node machine. –RBC/BNL (QCD thermo) using 2, 1K-node machines. –RBC (DWF) using 1K-node machine. 4K node machine being debugged for MILC use. Most of machine in production by early June?

DOE Review -May 2005(20) The Future: QCDOC++ Reduced feature size and increased integration permits many nodes per chip (multi-core trend). QCDOC  QCDOC++ –Clock speed (GHz): 0.4  1? –Integration (nodes/chip): 1  64? –Performance (Gflops): 0.4  2? –Inter-node comms (Gbyte/sec): 1  10? –On chip memory (Mbytes/chip): 4  32? Target: $0.01/Mflops price performance (1/100 x QCDOC). 100x speed-up permits increased 2x problem size per chip. Design starts 2006 (with off-project support). Target is ambitious with risk. May provide a candidate production machine in 2009.