December 10, 2002 SS-FQ02-W: 1 Stanford Streaming Supercomputer (SSS) Fall Quarter 2002 Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford.

Slides:

Advertisements

Similar presentations

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Parallel computer architecture classification

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Compiler Challenges for High Performance Architectures

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.

June 11, 2002 SS-SQ-W: 1 Stanford Streaming Supercomputer (SSS) Spring Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University.

Oct 2, 2001 SSS: 1 Stanford Streaming Supercomputer (SSS) Project Meeting Bill Dally, Pat Hanrahan, and Ron Fedkiw Computer Systems Laboratory Stanford.

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

SSS Software Update Ian Buck Mattan Erez August 2002.

Jan 30, 2003 GCAFE: 1 Compilation Targets Ian Buck, Francois Labonte February 04, 2003.

Slide 1Michael Flynn EE382 Winter/99 EE382 Processor Design Winter Chapter 7 and Green Book Lectures Concurrent Processors, including SIMD and.

March 11, 2003 SS-SQ03-W: 1 Stanford Streaming Supercomputer (SSS) Winter Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Improving Network I/O Virtualization for Cloud Computing.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1. Prerequisites.

Morgan Kaufmann Publishers

 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.

Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Lawrence Livermore National Laboratory S&T Principal Directorate - Computation Directorate Tools and Scalable Application Preparation Project Computation.

Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

My Coordinates Office EM G.27 contact time:

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Computer Organization and Architecture Lecture 1 : Introduction

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Parallel computer architecture classification

Parallel Algorithm Design

Morgan Kaufmann Publishers

Lecture on High Performance Processor Architecture (CS05162)

Department of Computer Science University of California, Santa Barbara

Mattan Erez The University of Texas at Austin

The Vector-Thread Architecture

Department of Computer Science University of California, Santa Barbara

6- General Purpose GPU Programming

Presentation transcript:

December 10, 2002 SS-FQ02-W: 1 Stanford Streaming Supercomputer (SSS) Fall Quarter 2002 Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University December 10, 2002

SS-FQ02-W: 2 Overview Where we are today –First year goal was met: demonstrated feasibility on single node –Feedback from site visit team was very positive –Potential for a big impact on scientific computing –But still much to do! Key FY03 goals –Get long-term software infrastructure in place Select approach, implement baseline Brook to SSS compiler –Multi-node versions that scale Language, compiler, simulator –Tackle hard problems: 3-D, Irregular neighborhoods/sparse matrix solve Language support, numerics support, evaluate on simulator –Refine architecture Cluster organization, aspect ratio, register organization, memory organization –Industrial Partner Start serious discussions, outreach to build support, close partner in 04

December 10, 2002 SS-FQ02-W: 3 But first, lets review our overall goal Exploit capabilities of VLSI to realize cost- effective scientific computing.

December 10, 2002 SS-FQ02-W: 4 The big picture VLSI technology enables us to put TeraOPS on a chip –Conventional general-purpose architecture cannot exploit this –The problem is bandwidth Streams expose locality and concurrency –Perform operations in record (not operation as with vector) order –Enables compiler optimization at a larger scale than scalar processing A stream architecture achieves high arithmetic intensity –Intensity = arithmetic rate/bandwidth –Bandwidth hierarchy, compound stream operations A Streaming Supercomputer is feasible –100GFLOPS (64-b) on a chip, 1TFLOPS single-board computer, PFLOPS systems

December 10, 2002 SS-FQ02-W: 5 Review – What is the SSS Project About? Exploit streams to give 100x improvement in performance/cost for scientific applications vs. ‘cluster’ supercomputers –From 100 GFLOPS PCs to TFLOPS single-board computers to PFLOPS supercomputers Use layered programming system to simplify development and tuning of applications –Stream languages –Streaming virtual machine Demonstrated feasibility of streaming scientific computing in year 1 Refine architecture and programming system in year 2 –Demonstrate realistic applications (3D, irregular) –Build usable compiler –Resolve architecture questions – aspect ratio, conditional execution, sparse clusters, reg organization, memory system, etc… Build a prototype and demonstrate CITS applications in years 3-6 –With industrial and government partners –Broaden our base of support

December 10, 2002 SS-FQ02-W: 6 Software Infrastructure Compiler –Decide on flow from Brook->SVM->SSS –Select base compiler ORC, Gnu, SUIF, Tendra, others… –“Spike” a simple program from Brook->SSS –Optimizations SVM Simulator

December 10, 2002 SS-FQ02-W: 7 3-D Applications StreamFLO StreamFEM StreamMD/Gromacs

December 10, 2002 SS-FQ02-W: 8 Irregular Grids Need an application Brook support for variable degree Architecture/run-time support

December 10, 2002 SS-FQ02-W: 9 Multi-Node Execution Brook support Manual partitioning for first step Simple application on SVM simulator

December 10, 2002 SS-FQ02-W: 10 Industrial Partner Candidates –Cray, IBM, Sun, HP, SGI, Intel Initial discussion –Present SSS project and results to date –Discuss collaboration models –Identify next steps

December 10, 2002 SS-FQ02-W: 11 Outreach National Labs –Los Alamos –Livermore –Sandia Other Government –NASA –DARPA –DoD (Charlie Holland) –AFOSR User communities

December 10, 2002 SS-FQ02-W: 12 Software Fall 02 Goals Brook –Multi-node issues: Synchronization primitives Data Partitioning –Variable length records SVM –Multi-node simulator –Performance numbers for 3 apps Compilation –Pick new infrastructure & design compiler (Reservoir) –Generate SVM code from Brook – (StreamC to SVM) –SVM to {SMP, graphics, SSS} (SVM is SMP) Run-Time (Software services) –Identify issues Issues –Variable length records? With stencils?

December 10, 2002 SS-FQ02-W: 13 Software Win 02 Goals Brook –Define carefully the semantics of the operators –Work on “views of memory” abstraction –Support for partitioning, shared memory, naming, fitting into stream abstraction –Support for irregular neighborhoods –Multithreaded version (Christos) –Concrete Winter goals [Ian/Frank] Review of the language [Pat] Partitioning (UPC) Multi-node/Multi-threaded version Irregular support – w/ application PPoPP paper MD on BRT

December 10, 2002 SS-FQ02-W: 14 Software Win 02 Goals SVM –Finish prototype single node implementation [Done] –Compiler issue –Implement multinode version w/ multi-node app. Start with one that runs on one processor [Francois] Multithreaded on SMP – on SGI [+] Cluster version [++] –SVM to simulator path Mattan – not an intermediate between Brook and SSS

December 10, 2002 SS-FQ02-W: 15 Software Win 02 Goals (3 of 3) Start regular meetings Compiler –Decide on flow from Brook->SVM->SSS [Mattan] Requirements –Select base compiler [Jayanth] ORC, Gnu, SUIF, Tendra, others… –“Spike” a simple program from Brook->SSS [Mattan/Jayanth ++] –Brook to Nvidia –Optimizations [Spring] Run time –Write a white paper

December 10, 2002 SS-FQ02-W: 16 Application Fall 02 Goals SteamMD –Migrate to Gromacs StreamFlo –Complete –3D StreamFEM –3D –Sparse LA Scalability – multiple nodes Look at Sierra, purple benchmarks: ppm, sweep3D

December 10, 2002 SS-FQ02-W: 17 Application Win 02 Goals StreamFLO[Fatica] –Partioned version; scalable –Convert to 3D StreamFEM [Barth] –Partioned version; scalable –Convert to 3D –Sparse LA StreamMD [Eric/student] –Migrate to GROMACS [Vijay Pande/Michael Levitt groups] –Redo inner (force) and outer (neighbor) loops –Partitioned version; scalable –Finish port to NV30: build cluster and Model applications [Ron/Frank] –Model PDES with sparse matrix solves An irregular application [Ron/Frank] Look at Sierra, purple benchmarks: ppm, sweep3D [delay]

December 10, 2002 SS-FQ02-W: 18 Architecture Fall 02 Goals Simulator –Multi-node working –Indexable SRF –Scalar processor Point Studies –Conditionals –Aspect ratio –Indexable SRF –Add & Store (remote ops in general) –Iterative operations & extended precision –Network Spec –Flesh out I/O App studies

December 10, 2002 SS-FQ02-W: 19 Architecture Win 02 Goals Single-Node Simulator [Jung-Ho, Knight] –64-bit support, MULADD, Scalar Processor Multi-Node Simulator [Jung-Ho, Abhishek] –Network model –Multi-node mechanisms Point Studies –Aspect ratio SSE vs VLIW –Conditional execution [Mattan/Ujval] –Sparse clusters –SRF organization [Nuwan] –Cache alternatives [Jung Ho] –Add and store study [Jung Ho] –I/O –Iterative operations [Francois]

December 10, 2002 SS-FQ02-W: 20 Special Win 02 Goals Fix website [Pat] –Public and private websites Name that computer –Mississippi –Axios –Submit names to Mattan –Bill, Pat, Bill to choose Project Party

December 10, 2002 SS-FQ02-W: 21 Winter Quarter Meeting Schedule 1/7RonAnything 1/14Francois/MattanWhat is SVM 1/21Fatica3D Flo 1/28PatRTSL partitioning 2/4Bill Carlson [Pat]UPC 2/11Francois/IanDiscussion of targets SSS/CG/MPI 2/18Tim B.Irregular grid 2/25MattanCompilation Infrastructure 3/4Jung HoAdd & Store 3/11BillWrapup

December 10, 2002 SS-FQ02-W: 22 Papers Arch –Indexable SRFs (Nuwan) –Streaming Supercomputer Overview (Tim K.) –Streaming on conventional CPUs (Mattan) –Conditionals (Ujval) –Remote Ops (Jung Ho) –Aspect Ratio (?) –Data parallel (SSE) vs. ILP (VLIW) Software –Design of Brook (Ian) –Data parallel programming on graphics HW (Pat) –Brook to CG Compiler Apps –Gromacs –StreamFEM (Tim 2 ) Overview (Bill and Pat)