March 11, 2003 SS-SQ03-W: 1 Stanford Streaming Supercomputer (SSS) Winter Quarter 2002-2003 Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford.

Slides:

Advertisements

Similar presentations

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.

Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

CS 501: Software Engineering Fall 2000 Lecture 2 The Software Process.

Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Chapter 4 M. Keshtgary Spring 91 Type of Workloads.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

1 CS 501 Spring 2003 CS 501: Software Engineering Lecture 2 Software Processes.

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.

June 11, 2002 SS-SQ-W: 1 Stanford Streaming Supercomputer (SSS) Spring Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University.

Oct 2, 2001 SSS: 1 Stanford Streaming Supercomputer (SSS) Project Meeting Bill Dally, Pat Hanrahan, and Ron Fedkiw Computer Systems Laboratory Stanford.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

December 10, 2002 SS-FQ02-W: 1 Stanford Streaming Supercomputer (SSS) Fall Quarter 2002 Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford.

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Operating Systems Concepts Professor Rick Han Department of Computer Science University of Colorado at Boulder.

SSS Software Update Ian Buck Mattan Erez August 2002.

Jan 30, 2003 GCAFE: 1 Compilation Targets Ian Buck, Francois Labonte February 04, 2003.

EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.

CS 300 – Lecture 2 Intro to Computer Architecture / Assembly Language History.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

Overview of Eclipse Parallel Tools Platform Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.

November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.

The Cluster Computing Project Robert L. Tureman Paul D. Camp Community College.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

An Introduction to Progress Arcade ™ June 12, 2013 Rob Straight Senior Manager, OpenEdge Product Management.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.

Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.

Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?

HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.

Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Lawrence Livermore National Laboratory S&T Principal Directorate - Computation Directorate Tools and Scalable Application Preparation Project Computation.

Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

Computing Performance Recommendations #10, #11, #12, #15, #16, #17.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

HPC University Requirements Analysis Team Training Analysis Summary Meeting at PSC September Mary Ann Leung, Ph.D.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

NASA MSFC Mission Operations Laboratory MSFC NASA MSFC Mission Operations Laboratory Kelvin Nichols, EO50 March 2016 MSFC ISS DTN Project Status.

The Post Windows Operating System

Computer Organization and Architecture Lecture 1 : Introduction

NFV Compute Acceleration APIs and Evaluation

Morgan Kaufmann Publishers

Department of Computer Science University of California, Santa Barbara

COMS 361 Computer Organization

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

March 11, 2003 SS-SQ03-W: 1 Stanford Streaming Supercomputer (SSS) Winter Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University March 11, 2003

SS-SQ03-W: 2 Year 2 Overview Where we are today –First year goal was met: demonstrated feasibility on single node –Feedback from site visit team was very positive –Potential for a big impact on scientific computing –But still much to do! Key FY03 goals –Get long-term software infrastructure in place Select approach, implement baseline Brook to SSS compiler –Multi-node versions that scale Language, compiler, simulator –Tackle hard problems: 3-D, Irregular neighborhoods/sparse matrix solve Language support, numerics support, evaluate on simulator –Refine architecture Cluster organization, aspect ratio, register organization, memory organization –Industrial Partner Start serious discussions, outreach to build support, close partner in 04

March 11, 2003 SS-SQ03-W: 3 Some concerns We’re doing a great job – but… Losing a bit of focus and momentum –Tooling on the detail –Need to take a step back and reexamine the big picture Need to raise our outside profile –Publish Overview paper Brook paper –Generate some more convincing evidence of advantages Need a control for bandwidth measures –Update the web page –Visit the labs

March 11, 2003 SS-SQ03-W: 4 Lets review our overall goal Exploit capabilities of VLSI to realize cost- effective scientific computing.

March 11, 2003 SS-SQ03-W: 5 Review – What is the SSS Project About? Exploit streams to give 100x improvement in performance/cost for scientific applications vs. ‘cluster’ supercomputers –From 100 GFLOPS PCs to TFLOPS single-board computers to PFLOPS supercomputers Use layered programming system to simplify development and tuning of applications –Stream languages –Streaming virtual machine Demonstrated feasibility of streaming scientific computing in year 1 Refine architecture and programming system in year 2 –Demonstrate realistic applications (3D, irregular) –Build usable compiler –Resolve architecture questions – aspect ratio, conditional execution, sparse clusters, reg organization, memory system, etc… Build a prototype and demonstrate CITS applications in years 3-6 –With industrial and government partners –Broaden our base of support

March 11, 2003 SS-SQ03-W: 6 Industrial Partner Update Candidates –Cray, IBM, Sun, HP, SGI, Intel Initial discussion –Present SSS project and results to date –Discuss collaboration models –Identify next steps Met with Cray, Sun, and SGI –They listened politely, but little traction –Need more convincing evidence –Need to address programming issue Have to provide a path for legacy codes

March 11, 2003 SS-SQ03-W: 7 Outreach National Labs –Los Alamos –Livermore –Sandia Other Government –NASA –DARPA –DoD (Charlie Holland) –AFOSR User communities

March 11, 2003 SS-SQ03-W: 8 Software Win 02 Goals Brook –Define carefully the semantics of the operators No progress –Work on “views of memory” abstraction Proposed API – will write up for next SW meeting –Support for partitioning, shared memory, naming, fitting into stream abstraction Adopting UPC – will write up for next SW meeting –Support for irregular neighborhoods Failed to find an application –Multithreaded version (Christos) Have simple model for multi-node – written up –(NEW) Preliminary Brooktran spec –Concrete Winter goals [Ian/Frank] Review of the language [Pat] Partitioning (UPC) Multi-node/Multi-threaded version Irregular support – w/ application PPoPP paper MD on BRT

March 11, 2003 SS-SQ03-W: 9 Brook Spring 03 Goals Refine semantics of operators –New version of spec Implement views of memory API (UPC) Find application for irregular structures –Dijkstra, incomplete LU Dynamic structure Start switching to new compiler Brooktran spec/implementation –Implemented in Open64 Concern – have lost metacompiler support

March 11, 2003 SS-SQ03-W: 10 Software Win 02 Goals SVM –Spec has evolved Concensus between MIT, Texas, Stanford, USC –Implement multinode version No progress –SVM to simulator path No progress –Multi-thread

March 11, 2003 SS-SQ03-W: 11 SVM – Spring 03 Goals Spec is complete – and supports SSS Revise single-node simulator Multi-node simulator (prelim)

March 11, 2003 SS-SQ03-W: 12 Software Win 02 Goals (3 of 3) Start regular meetings [Done] Compiler –Decide on flow from Brook->SVM->SSS [Mattan] Done –Select base compiler [Jayanth] ORC, Gnu, SUIF, Tendra, others… Done –“Spike” a simple program from Brook->SSS [Mattan/Jayanth ++] Started – modified front end – operating on WHIRL –Brook to Nvidia –Optimizations [Spring] Run time –Write a white paper

March 11, 2003 SS-SQ03-W: 13 Compiler Spring 03 Goals Complete feasibility study Brook to C path –Parse Brook –Generate C Optimizations –See Mattan’s document Need to generate SVM code by mid summer Parse Brooktran [Alan, Fatica, Jayanth] Kernel scheduler MULADD [Das] SVM to SSS [Francois – long term – need plan]

March 11, 2003 SS-SQ03-W: 14 Application Win 02 Goals StreamFLO[Fatica] –Base version is complete –Not running on simulator –Early start on 3D version – partitioning waiting on API def StreamFEM [Barth] –Waiting on spec for partitioning –3D arithmetic kernels done –Tridiagonal in Brook StreamMD [Eric/student] –Ported GROMACS to the NV30 – benchmarks Performance dependent on number of registers Doesn’t work with CG compiler Model applications [Ron/Frank] –Started Look at Sierra, purple benchmarks: ppm, sweep3D [delay]

March 11, 2003 SS-SQ03-W: 15 Application Spring 03 Goals StreamFLO[Fatica] –Parse Brooktran – F to WHIRL [Alan, Fatica] –Partitioned version – multi-node UPC –3D version StreamFEM [Barth] –Simulate 3D –Sparse LUD –Partitioned version StreamMD [Eric/student] –Hand-tune NV30 assembly code –GROMACS in Brook Model applications [Ron/Frank] –C implementations of adaptive structures Look at Sierra, purple benchmarks: ppm, sweep3D [delay]

March 11, 2003 SS-SQ03-W: 16 Architecture Win 02 Goals Single-Node Simulator [Jung-Ho, Knight] –64-bit support, MULADD, Scalar Processor –Not yet Multi-Node Simulator [Jung-Ho, Abhishek] –Network model –Multi-node mechanisms –Not yet Point Studies –Aspect ratio SSE vs VLIW Planning –Conditional execution [Mattan/Ujval] Started –Sparse clusters –SRF organization [Nuwan] Complete –Cache alternatives [Jung Ho] –Add and store study [Jung Ho] Started –I/O –Iterative operations [Francois] Planned

March 11, 2003 SS-SQ03-W: 17 Architecture Spring 03 Goals Multi-node simulator Point Studies –Aspect ratio [TIM] –Conditional execution [Mattan/Ujval] –Sparse clusters [Delay] –SRF organization [Nuwan] Refine Cache alternatives [Jung Ho] –Add and store study [Jung Ho] –I/O [?] –Iterative operations [Francois] 64-bit [delay] Scalar Processor [delay]

March 11, 2003 SS-SQ03-W: 18 Special Win 02 Goals Fix website [Pat] –Public and private websites Name that computer –Mississippi –Axios –Submit names to Mattan –Bill, Pat, Bill to choose Project Party [Mattan – Pat’s house]

March 11, 2003 SS-SQ03-W: 19 Name Resolution From now on, the SSS is called Merrimac

March 11, 2003 SS-SQ03-W: 20 Winter Quarter Meeting Schedule 4/1FedkiwParty 4/8Alan, FaticaBrooktran 4/15KapasiConditionals 4/22FaticaStreamFLO update 4/29Review Prep 5/6Review Prep 5/13Tim, TimStreamFEM 3D 5/20Ian, PatBrook Specification 5/27MattanBandwidth Comparison 6/3JayanthCompiler 6/10BillWrapup

March 11, 2003 SS-SQ03-W: 21 Papers Arch –Indexable SRFs (Nuwan) –Streaming Supercomputer Overview (Tim K.) –Streaming on conventional CPUs (Mattan) –Conditionals (Ujval) –Remote Ops (Jung Ho) –Aspect Ratio (?) –Data parallel (SSE) vs. ILP (VLIW) Software –Design of Brook (Ian) –Data parallel programming on graphics HW (Pat) –Brook to CG Compiler Apps –Gromacs –StreamFEM (Tim 2 ) Overview (Bill and Pat)