The Triumph of Hope over Experience * ? Bill Gropp *Samuel Johnson.

Slides:

Advertisements

Similar presentations

A Proposal of Capacity and Performance Assured Storage in The PRAGMA Grid Testbed Yusuke Tanimura 1) Hidetaka Koie 1,2) Tomohiro Kudoh 1) Isao Kojima 1)

Advertisements

MPI 3 RMA Bill Gropp and Rajeev Thakur. 2 Why Change RMA? Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations.

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.

Effects of Virtual Cache Aliasing on the Performance of the NetBSD Operating System Rafal Boni CS 535 Project Presentation.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

SLAM: SLice And Merge – Effective Test Generation for Large Systems ICCAD’13 Review Reviewer: Chien-Yen Kuo.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Java.  Java is an object-oriented programming language.  Java is important to us because Android programming uses Java.  However, Java is much more.

D. Düllmann - IT/DB LCG - POOL Project1 POOL Release Plan for 2003 Dirk Düllmann LCG Application Area Meeting, 5 th March 2003.

Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.

By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.

Overview of Programming Paradigms

Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.

W4118 Operating Systems OS Overview Junfeng Yang.

1 Lecture 1  Getting ready to program  Hardware Model  Software Model  Programming Languages  The C Language  Software Engineering  Programming.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Register Allocation (via graph coloring)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

Lecture 9: SHELL PROGRAMMING (continued) Creating shell scripts!

Floating-Point and High-Level Languages Programming Languages Spring 2004.

1.2 Language Processing Activities The fundamental language processing activities divided into two parts. 1. Program generation activities 2. Program execution.

SM3121 Software Technology Mark Green School of Creative Media.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.

Grid IO APIs William Gropp Mathematics and Computer Science Division.

Multiprocessor Cache Coherency

1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.

Source Code Basics. Code For a computer to execute instructions, it needs to be in binary Each instruction is given a number Known as “operation code”

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Programming 1 1. Introduction to object oriented programming and problem-solving.

Interrupts and DMA CSCI The Role of the Operating System in Performing I/O Two main jobs of a computer are: –Processing –Performing I/O manage and.

Bulk Synchronous Parallel Processing Model Jamie Perkins.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

1 ENERGY 211 / CME 211 Lecture 26 November 19, 2008.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.

Component Technology. Challenges Facing the Software Industry Today’s applications are large & complex – time consuming to develop, difficult and costly.

11/7/2007HDF and HDF-EOS Workshop XI, Landover, MD1 HDF5 Software Process MuQun Yang, Quincey Koziol, Elena Pourmal The HDF Group.

Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

Algorithms & Flowchart

Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 19 & 20 Instruction Formats PDP-8,PDP-10,PDP-11 & VAX Course Instructor: Engr. Aisha Danish.

Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.

Chapter 3 - Language Design Principles

SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.

Programming Language Design Issues Programming Languages – Principles and Practice by Kenneth C Louden.

Exploring Parallelism with Joseph Pantoga Jon Simington.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

An overview of C Language. Overview of C C language is a general purpose and structured programming language developed by 'Dennis Ritchie' at AT &T's.

Victoria Ibarra Mat:  Generally, Computer hardware is divided into four main functional areas. These are:  Input devices Input devices  Output.

15-740/ Computer Architecture Lecture 3: Performance

Cache Memory Presentation I

Is System X for Me? Cal Ribbens Computer Science Department

Department of Computer Science University of California, Santa Barbara

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Hybrid Programming with OpenMP and MPI

Lecture 4: Instruction Set Design/Pipelining

Presentation transcript:

The Triumph of Hope over Experience * ? Bill Gropp *Samuel Johnson

University of ChicagoDepartment of Energy Where Was Parallel I/O? No parallel I/O results (independent I/O by processes in the same parallel “job” doesn’t count) were presented No application wants to generate a zillion files, proportional to the number of processes or nodes  They do it because they’ve given up getting what they want “Only one parallel file system works and I can’t buy it”  This is bad? Avoid I/O?  But you need to output something (why else compute?)

University of ChicagoDepartment of Energy Realistic (Application-centric) I/O Benchmarks POSIX I/O essentially specifies sequential consistency  Understandable choice for the masses  Disastrous choice for parallel performance Solution is not “no consistency” (e.g., NFS); it is a precise, relaxed consistency model  Such definitions exist: MPI-I/O provides an API Parallel file systems (incl PVFS) provide implementations that work today We have a parallel I/O benchmark suite  It is in pretty rough shape  But any working parallel file system (with MPI-IO) should be able to run it

University of ChicagoDepartment of Energy Predictions SOS9 – Parallel (not independent) I/O results (out on a limb) POSIX I/O will stop being a requirement for high- performance file systems Parallel I/O will be required of any programming model/environment  Just kidding!

University of ChicagoDepartment of Energy Programming Models You can always create a special- purpose language for your application  It will (given enough effort) be superior to all other languages for your application  It may be a disaster for others  It might be a disaster for you when the next machine comes out

University of ChicagoDepartment of Energy Why Was MPI Successful? It address all of the following issues:  Portability  Performance  Simplicity and Symmetry  Modularity  Composability  Completeness Most current proposals for new languages address only a subset  Two features often missing: modularity (e.g., library support) and completeness (e.g., I/O)

University of ChicagoDepartment of Energy Predictions We will be complaining about MPI  That’s ok. People are still complaining about Fortran  We’ll be complaining because most of the important applications are using MPI We will be complaining about memory latency and bandwidth  The ratio of memory latency to CPU cycle will continue to increase  The ratio of MPI latency to memory latency will be smaller, both to now and to the main/CPU ratio Applications will begin to use MPI Put/Get

University of ChicagoDepartment of Energy MPI Put/Get Part of MPI-2 (MPI  MPI-1 + MPI-2) Can be implemented efficiently on a wide range of platforms (e.g., does not require cache coherence wrt network DMA) Designed to fit into MPI model (with full generality) Q: I’ve heard that the rules are too complex  A: There are simple rules that are sufficient for most programmers. Q: I’ve heard that the RMA is slower than point to point  A: Sadly, few implementations have gotten this right. MPICH2 demonstrates how to do it; shame will do the rest Q: But what about xxx?  A: Some xxx are a misunderstanding of the standard. And some xxx are in fact limitations of the standard. The right fix though is to improve the standard, not start over.