CDSC/InTrans Review Oct , 2016 Student names: Martin Kong, OSU/Rice

Slides:

Advertisements

Similar presentations

IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,

Advertisements

8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.

HABANERO CNC Sagnak Tasirlar 1. Acknowledgments 2  Rice  Vivek Sarkar, Zoran Budimlic, Michael Burke, Philippe Charles  Vincent Cave,

Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan RamanJisheng ZhaoVivek Sarkar Rice University June 13, 2012 Martin.

ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.

Program Representations. Representing programs Goals.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.

INTRODUCTION COMPUTATIONAL MODELS. 2 What is Computer Science Sciences deal with building and studying models of real world objects /systems. What is.

Distributed Computations

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.

Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.

Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.

Precision Going back to constant prop, in what cases would we lose precision?

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Novel Architectures Copyright 2004 Daniel J. Sorin Duke University.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Dynamic Choreographies Safe Runtime Updates of Distributed Applications Ivan Lanese Computer Science Department University of Bologna/INRIA Italy Joint.

Chapter 1 Introduction. Goal to learn about computers and programming to compile and run your first Java program to recognize compile-time and run-time.

Transparent Grid Enablement Using Transparent Shaping and GRID superscalar I. Description and Motivation II. Background Information: Transparent Shaping.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Unit-1 Introduction Prepared by: Prof. Harish I Rathod

USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,

Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

1 MOTIVATION AND OBJECTIVE  Discrete Signal Transforms (DSTs) –DFT, DCT: major performance component in many applications –Hardware accelerated but at.

CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.

21/1/ Analysis - Model of real-world situation - What ? System Design - Overall architecture (sub-systems) Object Design - Refinement of Design.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Introductory Lecture. What is Discrete Mathematics? Discrete mathematics is the part of mathematics devoted to the study of discrete (as opposed to continuous)

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.

TensorFlow– A system for large-scale machine learning

Advanced Computer Systems

Auburn University

Introduction to Load Balancing:

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Language Translation Compilation vs. interpretation.

Parallel processing is not easy

Conception of parallel algorithms

Optimizing Compilers Background

Resource Elasticity for Large-Scale Machine Learning

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.

International Research and Development Institute Uyo

Martin Rinard Laboratory for Computer Science

Parallelizing Incremental Bayesian Segmentation (IBS)

Automatic Performance Tuning

Introduction to cosynthesis Rabi Mahapatra CSCE617

湖南大学-信息科学与工程学院-计算机与科学系

Objective of This Course

for Network Processors

Compilers B V Sai Aravind (11CS10008).

Excess power trigger generator

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Compile-time Frequency Scaling for CPU Energy and EDP Improvement

Presented By: Darlene Banta

TensorFlow: A System for Large-Scale Machine Learning

Dichotomies in CSP Karl Lieberherr inspired by the paper:

Programming with Shared Memory Specifying parallelism

Social Practice of the language: Describe and share information

Compiler Structures 1. Overview Objective

CSc 453 Final Code Generation

Function of Operating Systems

Mattan Erez The University of Texas at Austin

Presentation transcript:

PIPES: A Language and Compiler for Task-based Programming on Distributed-Memory Clusters CDSC/InTrans Review Oct 24 - 25, 2016 Student names: Martin Kong, OSU/Rice Faculty names: Louis-Noel Pouchet, P. Sadayappan, Vivek Sarkar Dept. of Computer Science OSU / Rice / CSU df Example: SGEMM with Cannon algorithm PIPES: a macro-dataflow language and compiler to specify parallel/distributed algorithms Main contributions: PIPES language: a dataflow-inspired language derived from CnC/DFGL, enriched with constructs for task placement / scheduling, and communication specifications PIPES compiler: optimize the polyhedral subset of PIPES, for automatic coarsening and coalesing; and translate to Intel CnC C++ runtime tuners to implement the mapping described Example: SSYR2K Can be seen as a sequence of 2 GEMM calls: GEMM(C, B, trans(A)); GEMM(C, A, trans(B)) Intel CnC is a powerful runtime system… It implements the semantics of Concurrent Collections (CnC) Only a “task” graph is needed as input The runtime decides scheduling/placement/communication policies …But to obtain high performance one needs to be able to: Specify (partial) task placement and communication strategies Adapt the granularity of tasks to the target machine Perform auto-tuning of the implementation, for max. performance PIPES: Language + Compiler to exploit CnC-like runtimes Compact/expressive language to describe task dataflow, communications, … Advanced analysis and transformations of the task graph (e.g., coarsening) Automatic code generation for Intel CnC runtime tuners - Research cluster at OSU, peak SP performance: ~1200 GF/s for 8 nodes. - Problem: single precision, matrices of 8000x8000 - Various coarsening factors explored via auto-tuning, best found per #proc reported above - Intel MKL used in the task bodies M. Kong, L.N. Pouchet, P. Sadayapan and V. Sarkar, “PIPES: A Language and Compiler for Task-based Programming on Distributed-Memory Clusters” (IEEE/ACM Supercomputing 2016)