SSS Software Update Ian Buck Mattan Erez August 2002.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Chapter 3 Loaders and Linkers
Chapter 10 Introduction to Arrays
Efficient Solutions to the Replicated Log and Dictionary Problems
Chapter 19: Network Management Business Data Communications, 4e.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Inline Assembly Section 1: Recitation 7. In the early days of computing, most programs were written in assembly code. –Unmanageable because No type checking,
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
June 11, 2002 SS-SQ-W: 1 Stanford Streaming Supercomputer (SSS) Spring Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University.
StreamMD Molecular Dynamics Eric Darve. MD of water molecules Cutoff is used to truncate electrostatic potential Gridding technique: water molecules are.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage.
1 Concurrent and Distributed Systems Introduction 8 lectures on concurrency control in centralised systems - interaction of components in main memory -
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Computer-System Structures
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.
Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-7 Memory Management (1) Department of Computer Science and Software.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chapter 2: Computer-System Structures
DCE (distributed computing environment) DCE (distributed computing environment)
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Chapter 4 Storage Management (Memory Management).
ICS 145B -- L. Bic1 Project: Main Memory Management Textbook: pages ICS 145B L. Bic.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
CUDA - 2.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
System design : files. Data Design Concepts  Data Structures  A file or table contains data about people, places or events that interact with the system.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.
Processes and Virtual Memory
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
CS320n – Elements of Visual Programming Assignment Help Session.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Parallel Computing Presented by Justin Reschke
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Hello world !!! ASCII representation of hello.c.
Chapter 2: Computer-System Structures
Module 11: File Structure
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Main Memory Management
Main Memory Background Swapping Contiguous Allocation Paging
GENERAL VIEW OF KRATOS MULTIPHYSICS
Introduction to Data Structures
Chapter 2: Computer-System Structures
Parallel Programming in C with MPI and OpenMP
Chapter 2: Computer-System Structures
Presentation transcript:

SSS Software Update Ian Buck Mattan Erez August 2002

SSS Software Update2 Outline Applications StreamMD StreamFlo Compilation Compiler Status Multinode notes Results (what we want)

August 2002SSS Software Update3 StreamMD Overview Water simulation Gridded for accelerated force calculation Only adjacent grid cells interact Complex force calculation Tens of instructions per word transferred Status Reference C++ version working Mostly complete Brook version

August 2002SSS Software Update4 StreamMD Details Grid structure The grid cells are a stream Each stream record holds a number of molecules A separate stream holds the number of molecules in each cell Gridding operation For every molcl – compute the appropriate cell Use a GatherAdd operation to assign locations Each molcl cell’s num_molcls is atomically incremented The resulting indices are returned in a stream The return stream is used to scatter the molecules Force calculation Redundant forces are calculated assuming max number of molecules per cell

August 2002SSS Software Update5 StreamFlo 2D multi-grid solver Data structure Data is stored in a 1D stream which holds all the levels Each level is selected using domain and then shaped to a 2D stream Operation In each level a stencil is used to calculate the flux A group/stencil is used to transfer the field up/down the hierarchy

August 2002SSS Software Update6 Compiling StreamMD Gridding operation supported by hardware stream Fetch&Op Force calculation

August 2002SSS Software Update7 Compiling StreamFlo Simply need to generate code for the different stream operators used StreamShape StreamStencil StreamDomain/StreamMerge StreamSetLength The general approach is to delay the operation until required by a kernel Register the operator parameters in a compile time and/or runtime structure and wait

August 2002SSS Software Update8 StreamShape Register shape in runtime table Dimension (currently 2D only) 1D streams are not tracked Size M x N

August 2002SSS Software Update9 StreamStencil (1) Map the 2D stencil into 1D streams Suitable for both StreamC/KernelC and the SVM Minimize cost Replication (re-reading stream elements) Communication Utilize LRF and Scratchpad Current implementation relies on specific order f execution in the stream unit

August 2002SSS Software Update10 StreamStencil (2) stencil cluster 0 cluster 1 cluster 2 num_rows communication …Index stream replication (boundary values)

August 2002SSS Software Update11 StreamStencil (3) Communication to computation ratio proportional to row length num rows Replication only for boundary values Number of rows determined by amound of local storage Scratchpad LRF

August 2002SSS Software Update12 StreamGroup Handled exactly like streamStencil No need for communication No need for boundary conditions num_rows == group_length (since no comm)

August 2002SSS Software Update13 StreamDomain Domain use: Another domain – propagate boundaries (do nothing) Kernel input – generate an index stream for the domain and gather the domain (unless 1D) Kernel output – wait for StreamMerge and then use index stream to copy result back Stencil input – generate the stencil index stream starting at the appropriate offset Problems – can only work if the domain portion of the source stream is not modified

August 2002SSS Software Update14 Compiler Status Most features implemented Parts of streamStencil and SetLength are currently being coded Annoyances Can’t handle multiple.br files No automatic type casting in kernels 1 != 1. != 1.0 == 1f -1 doesn’t work (use 0 – 1) … No global variables allowed in kernels Shaky error reporting Some editing of.h files required

August 2002SSS Software Update15 Compiler Status (2) Deficiencies Can’t handle many stream functions – if the function changes the properties of the stream (length, shape, stencil, domain, …) Can’t handle arrays as record fields or kernel params Vec3 must be.x,.y, and.z instead of [0], [1], and [2] No optimizations We did our best implementing efficient strip-mining and stream operators Strip parameters must be inserted manually No merging/splitting of kernels No multi-node support Compiles to StreamC/KernelC and not SVM

August 2002SSS Software Update16 Multi-Node Notes Simple partitioning Both apps are basically regular meshes Simple synchronization Very obvious synch points between force/flux calculations and integration/multi-grid iterations StreamMD requires refstream and user synchronizations Will use C library mutex type synchronization

August 2002SSS Software Update17 Results (what we need) Single node Run-time LRF/SRF/MEM bandwidth Cluster utilization Stall time (due to memory/host) Multi node All of the above Network BW Synch stalls Comparison Need to identify, define,and collect base line numbers for the 100:1 comparison

August 2002SSS Software Update18

August 2002SSS Software Update19

August 2002SSS Software Update20 StreamSetLength New length < old length Rename all future references to new_stream Derive new_stream from old_stream New length > old length Rename all future references to new_stream Create new_stream with the new length Copy old data to the beginning of the new stream