Benjamin Perry and Martin Swany University of Delaware Computer Information Science.

Slides:



Advertisements
Similar presentations
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Advertisements

MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Remote Procedure Call Design issues Implementation RPC programming
Lightweight Abstraction for Mathematical Computation in Java 1 Pavel Bourdykine and Stephen M. Watt Department of Computer Science Western University London.
NSF/TCPP Early Adopter Experience at Jackson State University Computer Science Department.
Hiperspace Lab University of Delaware Antony, Sara, Mike, Ben, Dave, Sreedevi, Emily, and Lori.
1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
1 Message protocols l Message consists of “envelope” and data »Envelope contains tag, communicator, length, source information, plus impl. private data.
The environment of the computation Declarations introduce names that denote entities. At execution-time, entities are bound to values or to locations:
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
1 The Problem o Fluid software cannot be trusted to behave as advertised unknown origin (must be assumed to be malicious) known origin (can be erroneous.
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College
The hybird approach to programming clusters of multi-core architetures.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
1 MPI Datatypes l The data in a message to sent or received is described by a triple (address, count, datatype), where l An MPI datatype is recursively.
JVM And CLR Dan Agar April 16, Outline Java and.NET Design Philosophies Overview of Virtual Machines Technical Look at JVM and CLR Comparison of.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
JAVA SERVER PAGES. 2 SERVLETS The purpose of a servlet is to create a Web page in response to a client request Servlets are written in Java, with a little.
The University of Adelaide, School of Computer Science
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Functions Top-down design Breaking a complex problem into smaller parts that we can understand is a common practice. The process of subdividing a problem.
MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Run-Time Storage Organization Compiler Design Lecture (03/23/98) Computer Science Rensselaer Polytechnic.
The Fail-Safe C to Java translator Yuhki Kamijima (Tohoku Univ.)
Understanding Data Types and Collections Lesson 2.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.
Automatically Exploiting Cross- Invocation Parallelism Using Runtime Information Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and.
Transparent Pointer Compression for Linked Data Structures June 12, 2005 MSP Chris Lattner Vikram Adve.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
Computer Science Lecture 3, page 1 CS677: Distributed OS Last Class: Communication in Distributed Systems Structured or unstructured? Addressing? Blocking/non-blocking?
PLC '06 Experience in Testing Compiler Optimizers Using Comparison Checking Masataka Sassa and Daijiro Sudo Dept. of Mathematical and Computing Sciences.
CS 732: Advance Machine Learning
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Announcements. Practice questions, with and without solutions will be uploaded by Friday 5 th November, make sure to check them before the weekend \\netstorage\Subjects\ITCA-b\Exam.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java Author: Fridtjof Siebert, CASES 2000 Michael Sallas Object-Oriented Languages.
Chapter 9 – Real Memory Organization and Management
WORKFLOW PETRI NETS USED IN MODELING OF PARALLEL ARCHITECTURES
Department of Computer Science University of California, Santa Barbara
Machine-Independent Optimization
Closure Representations in Higher-Order Programming Languages
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Benjamin Perry and Martin Swany University of Delaware Computer Information Science

 Background  The problem  The solution  The results  Conclusions and Future work

 MPI programs communicate via MPI data types  MPI data types are usually modeled after native data types  Payloads are often arrays of MPI data types

 The sending MPI library packs payload into contiguous block  The receiving MPI library unpacks payload into original form  Non-contiguous blocks incur a copy penalty  SPMD programs, particularly in homogenous environments, can use optimized packing

 Background  The problem  The solution  The results  Conclusions and Future work

 Users model MPI types after native types  Some fields do not need to be transmitted  Users often replace dead fields with a gap in the MPI type to align with native type

 Smaller payload…. but  MPI type is non-contiguous ◦ Copy penalty during packing and unpacking  Multi-core machines and high-performance networks feel the cost depending on payload  Multi-core machines are becoming ubiquitous ◦ SPMD applications are ideal for these platforms

 Background  The problem  The solution  The results  Conclusions and Future work

 Applies only to SPMD applications  Static analysis to locate MPI data types ◦ MPI_type_struct()  Build internal representation of MPI data type ◦ MPI data type defined via library call at runtime ◦ Parameters indicate base types, consecutive instances, and displacements ◦ Def/use analysis to determine static definition

 Look for gaps in displacement array ◦ Size of base types multiplied by consecutive array  Match MPI type to native type ◦ Analyze the types of the payload ◦ MPI type must be subset of native data structure ◦ All sends and receives with MPI type handle must also share same base types

 Perform transformation on MPI type and native type ◦ Adjust parameters in MPI_type_struct ◦ Relocate non-transmitted fields to bottom of type  End goal: improve library performance of packing large arrays

 Safety check ◦ Cast to a type ◦ Address-of  Except for computing displacement ◦ Non-local types  Profitability ◦ Sends / receives within loops ◦ Large arrays of MPI types in sends / receives ◦ Cost incurred by cache misses, locality by adjusting native type when native type is in loops

 Background  The problem  The solution  The results  Conclusions and Future work

 LLVM compiler pass  OpenMPI  Intel Core2 Quad-core 2.4gz  Ubuntu  Control: sending un-optimized data type with gap using payloads of various sizes  Tested: Rearranging gap in MPI type and native type using payloads of various sizes

 Background  The problem  The solution  The results  Conclusions and Future work

 MPI data types modeled after native data types  Users introduce gaps, making data noncontiguous and costly to pack on fast networks  Discover this scenario at compile time  Fix it if safe and profitable  Greatly improves multi-core performance; infiniband also receives boost.

 Data type fission with user-injected gaps ◦ Separate transmitted fields from non-transmitted fields ◦ Complete eliminates data copy during packing  Data type fission with non-used fields ◦ Perform analysis on receiving end to see which fields are actually being used ◦ Cull non-used fields from data type; perform fission

? ? ? ? ?