Protein Explorer: A Petaflops Special Purpose Computer System for Molecular Dynamics Simulations David Gobaud Computational Drug Discovery Stanford University.

Slides:



Advertisements
Similar presentations
Clusters, Grids and their applications in Physics David Barnes (Astro) Lyle Winton (EPP)
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Computer Abstractions and Technology
Tuan Tran. What is CISC? CISC stands for Complex Instruction Set Computer. CISC are chips that are easy to program and which make efficient use of memory.
Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A.
Blue brain Copyright © cs-tutorial.com.
2. Computer Clusters for Scalable Parallel Computing
Beowulf Supercomputer System Lee, Jung won CS843.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Petaflops Special-Purpose Computer for Molecular Dynamics Simulations Makoto Taiji High-Performance Molecular Simulation Team Computational & Experimental.
GravitySimulator Beyond the Million Body Problem Collaborators:Rainer Spurzem (Heidelberg) Peter Berczik (Heidelberg/Kiev) Simon Portegies Zwart (Amsterdam)
10/21/20091 Protein Explorer: A Petaflops Special-Purpose Computer System for Molecular Dynamics Simulations Makoto Taiji, Tetsu Narumi, Yousuke Ohno,
History of Distributed Systems Joseph Cordina
Processor Technology and Architecture
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Parallel Algorithms - Introduction Advanced Algorithms & Data Structures Lecture Theme 11 Prof. Dr. Th. Ottmann Summer Semester 2006.
Chapter 13 The First Component: Computer Systems.
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
1 Extending Summation Precision for Network Reduction Operations George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf Computer Architecture.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Prepared by: Hind J. Zourob Heba M. Matter Supervisor: Dr. Hatem El-Aydi Faculty Of Engineering Communications & Control Engineering.
 Chasis / System cabinet  A plastic enclosure that contains most of the components of a computer (usually excluding the display, keyboard and mouse)
Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.
Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
 Design model for a computer  Named after John von Neuman  Instructions that tell the computer what to do are stored in memory  Stored program Memory.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.
Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963.
80-Tile Teraflop Network-On- Chip 1. Contents Overview of the chip Architecture ▫Computational Core ▫Mesh Network Router ▫Power save features Performance.
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
Distributed Computing CSC 345 – Operating Systems By - Fure Unukpo 1 Saturday, April 26, 2014.
Molecular Dynamics Simulations of Proteins with Petascale Special-Purpose Computer MDGRAPE-3 Makoto Taiji Deputy Project Director Computational & Experimental.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
EE3A1 Computer Hardware and Digital Design
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
Principles of Linear Pipelining
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Anton, a Special-Purpose Machine for Molecular Dynamics Simulation By David E. Shaw et al Presented by Bob Koutsoyannis.
Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.
Interconnection network network interface and a case study.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Chapter 2 Turning Data into Something You Can Use
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Central Processing Unit (CPU) The Computer’s Brain.
Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Computers - Hardware
Backprojection Project Update January 2002
Super Computing By RIsaj t r S3 ece, roll 50.
Architecture & Organization 1
Components of Computer
BlueGene/L Supercomputer
Architecture & Organization 1
Constructing a system with multiple computers or processors
The Design of a Grid Computing System for Drug Discovery and Design
Memory Organization.
Digital Signal Processors-1
Presentation transcript:

Protein Explorer: A Petaflops Special Purpose Computer System for Molecular Dynamics Simulations David Gobaud Computational Drug Discovery Stanford University 7 March 2006

Outline Overview Background Delft Molecular Dynamics Processor GRAPE Protein Explorer Summary MDGRAPE-3 Chip Force Calculation Pipeline J-Particle Memory and Control Units System Architecture Software Cost Questions

Overview Protein Explorer Petaflop special-purpose computer system for molecular dynamics simulations High-precision screening for drug design Large-scale simulations of huge proteins/complexes PC cluster with special-purpose engines to perform the most time-consuming calculations Dedicated LSI MDGRAPE-3 chip performs force calculations at 165 Gflops or higher ETA 2006

Background PCs are universal machines Various applications Hardware can be designed independent of applications Obstacles to high-performance Memory bandwidth bottleneck Heat dissipation problem Can be overcome by developing specialized architectures

Delft Molecular Dynamics Processor (DMDP) Pioneered high-performance special-purpose systems Not able to achieve effective cost-performance Demanded too much time and money in development state Speed of development is a crucial factor affecting cost- performance because electronic device technology continues to develop rapidly Almost all calculations performed by DMDP making hardware very complex

GRAPE (GRAvity PipE) One of the most successful attempts to develop high-performance special-purpose systems Specialized for simulations of classical particles Most time spent on calculation of long-range forces (gravitational, Coulomb, and van der Waals) Thus special hardware only performs these calculations Hardware very simple and cost-effective

GRAPE (GRAvity PipE) In 1995 first machine to break teraflops barrier in nominal peak performance Since 2001 leader in performance has been Molecular Dynamics Machine at RIKEN at 78- TFlops University of Tokyo a 64-TFlop GRAPE-6 completed Protein Explorer launched based on 2002 University of Tokyo success

Protein Explorer Summary Host PC cluster with special purpose boards attached Boards calculate only non-bounded forces Very simple hardware and software No detailed knowledge of hardware needed to write programs Communication time between host and boards is proportional to number of particles Calculation time proportional to N^2 for direct summation of long-range forces N*Nc for short range forces where Nc is the average number of particles within the cutoff radius 0.25 byte/1000 operations

MDGRAPE-3 Chip - Force Calculation Pipeline 3 subtractor units 6 adder units 8 multiplier units 1 function-evaluation unit Can perform ~33 equivalent operations/sec when it calculates the Coulomb force

MDGRAPE-3 Chip - Force Calculation Pipeline

Most operations done in 32-bit single precision floating point format Force accumulation is 80-bit fixed point format Can be converted to 64-bit double precision floating point Coordinates stored in 40-bit fixed-point format Makes implementation of periodic boundary condition easy

MDGRAPE-3 Chip - Force Calculation Pipeline Function Evaluator Most important part of pipeline Allows calculation of arbitrary smooth function Has memory unit which contains a table for polynomial coefficients and exponents and a hardwired pipeline for fourth-order polynomial evaluation Interpolates an arbitrary smooth function g(x) using segmented fourth-order polynomials by Homer’s method

MDGRAPE-3 Chip - J-Particle Memory and Control Units 20 Force Calculation Pipelines j-Particle Memory Unit 32,768 bodies “Main Memory” 6.6 Mbits constructed by static RAM Cell-Index Controller Controls j-Particle memory – generates addresses Force Simulation Unit Master Controller Manages timings and inputs/outputs of the chip

MDGRAPE-3 Chip 2 virtual pipelines/physical pipeline Physical bandwidth of j-particle unit 2.5 Gbytes/sec but virtual bandwidth will reach 100 Gbytes/sec 340 arithmetic units 20 function-evaluator units which work simultaneously 165 Gflops at 250MHz

MDGRAPE-3 Chip

Chip made by Hitachi 6M gates 10M bits of memory Chip size is ~220 mm^2 Dissipate 20 watts at core voltage of +1.2V.12 W/Gflops much better than P4 3GHz which is 14 W/Gflop

System Architecture Host PC cluster will use Itanium or Opteron CPU 256 nodes with 512 CPUs each Performance of node is 3.96 Tflops Total reaches a petaflop Require 10G-bit/sec network Infiniband 10G Ethernet or future Myrinet Network topology will be a 2D hyper-crossbar Each node has 24 MDGRAPE-3 chips MDGRAPE-3 chips connected via 2 PCI-X busses at 133 MHz 19” rack can house 6 nodes 43 racks total Power dissipation ~150 KWatts Occupy 100 m^2

System Architecture

Protein Explorer Board

Software Very easy to create programs for All computational abilities provided in a library No special knowledge of device needed

Cost $20 million including labor Less than $10/Gflop At least ten times better than general- purpose computers even when compared with relatively cheap BlueGene/L ($140/Gflop)

Questions What is Myrinet? What is a two-dimensional hyper- crossbar network topology? How does this compare to massive distributed computing such as Advantages? Disadvantages?