CnC: A Dependence Programming Model

Slides:

Advertisements

Similar presentations

Configuration management

Advertisements

1 Software and Services Group 1 Concurrent Collections (CnC) Kath Knobe Technology Pathfinding and Innovation (TPI) Software and Services Group (SSG) Intel.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

HABANERO CNC Sagnak Tasirlar 1. Acknowledgments 2  Rice  Vivek Sarkar, Zoran Budimlic, Michael Burke, Philippe Charles  Vincent Cave,

Concurrent Collections (CnC) A programming model for parallel programming 1Dennis Kafura – CS5204 – Operating Systems.

Presented by: Thabet Kacem Spring Outline Contributions Introduction Proposed Approach Related Work Reconception of ADLs XTEAM Tool Chain Discussion.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

An Architecture-Based Approach to Self-Adaptive Software Presenters Douglas Yu-cheng Su Ajit G. Sonawane.

Copyright © 2006 Software Quality Research Laboratory DANSE Software Quality Assurance Tom Swain Software Quality Research Laboratory University of Tennessee.

CS 267 Spring 2008 Horst Simon UC Berkeley May 15, 2008 Code Generation Framework for Process Network Models onto Parallel Platforms Man-Kit Leung, Isaac.

1 Computer Systems & Architecture Lesson 1 1. The Architecture Business Cycle.

Object-oriented design CS 345 September 20,2002. Unavoidable Complexity Many software systems are very complex: –Many developers –Ongoing lifespan –Large.

Intel Concurrent Collections (for Haskell) Ryan Newton*, Chih-Ping Chen*, Simon Marlow+ *Intel +Microsoft Research Software and Services Group Jul 29,

Intel Concurrent Collections (for Haskell) Ryan Newton, Chih-Ping Chen, Simon Marlow Software and Services Group Jul 27, 2010.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.

Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

SOFTWARE DESIGN AND ARCHITECTURE LECTURE 07. Review Architectural Representation – Using UML – Using ADL.

Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,

ABCD: Eliminating Array-Bounds Checks on Demand Rastislav Bodík Rajiv Gupta Vivek Sarkar U of Wisconsin U of Arizona IBM TJ Watson recent experiments.

MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.

1 Unified Modeling Language, Version 2.0 Chapter 2.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

R-Verify: Deep Checking of Embedded Code James Ezick † Donald Nguyen † Richard Lethin † Rick Pancoast* (†) Reservoir Labs (*) Lockheed Martin The Eleventh.

1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

OOD OO Design. OOD-2 OO Development Requirements Use case analysis OO Analysis –Models from the domain and application OO Design –Mapping of model.

Wrap up. Structures and views Quality attribute scenarios Achieving quality attributes via tactics Architectural pattern and styles.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,

Auburn University

TensorFlow– A system for large-scale machine learning

Advanced Computer Systems

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Chandra S. Martha Min Lee 02/10/2016

CDSC/InTrans Review Oct , 2016 Student names: Martin Kong, OSU/Rice

The Concurrent Collections (CnC)

For Massively Parallel Computation The Chaotic State of the Art

Compiler Construction (CS-636)

Conception of parallel algorithms

Systems Analysis and Design With UML 2

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Computer Engg, IIT(BHU)

Parallel Algorithm Design

AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.

Communicating Runtimes in CnC

Topic 17: Memory Analysis

Store Recycling Function Experimental Results

Gabor Madl Ph.D. Candidate, UC Irvine Advisor: Nikil Dutt

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Chapter 5 Architectural Design.

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Prof. Leonardo Mostarda University of Camerino

Introduction to Data Structure

ICS 52: Introduction to Software Engineering

Lecture 2 The Art of Concurrency

Overview of Workflows: Why Use Them?

Applying Use Cases (Chapters 25,26)

Parallel Programming in C with MPI and OpenMP

Support for Adaptivity in ARMCI Using Migratable Objects

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Presentation transcript:

CnC: A Dependence Programming Model Kath Knobe kath.knobe@rice.edu Zoran Budimlić zoran@rice.edu

Motivation for the talk (not motivation for CnC) Someone hears about CnC To learn more they check into an implementation It wasn’t designed for what they want They walk away Our marketing department needs to make some changes to how we present/write/... about CnC Here’s a possible different slant You know most of this but I’m looking for Feedback/discussion on the problem Feedback/discussion on the story here Recommendations for changing the presentation This seems like the right group

Performance Eigensolver Used on DARPA: UHPC project at Intel DOE: X-Stack project at Intel Dreamworks: “How to train your dragon II” Aparna Chandramolishwaren One GaTech first year grad student intern at Intel Concurrent Collections Multithreaded Intel® MKL Intel’s MKL team Baseline Intel 2-socket x 4-core Nehalem @ 2.8 GHz + Intel® Math Kernel Libraries 10.2 3

Cholesky Intel 2-socket x 4-core Nehalem @ 2.8 GHz + Intel MKL 10.2 Plasma: Jack Dongarra’s group at Oak Ridge NL Cholesky Aparna Chandramolishwaren One GaTech first year grad student intern at Intel Intel 2-socket x 4-core Nehalem @ 2.8 GHz + Intel MKL 10.2

Outline Motivation CnC Wide array of existing CnC tuning approaches

Outline Motivation CnC Wide array of existing CnC tuning approaches

Styles of languages/models Serial languages and Parallel languages Orderings: Required / tuning / arbitrary Make a modification: must distinguish among these Takes time Leads to errors Too few dependences: errors Too many dependences: loose performance Dependence languages Required only No tuning orderings No arbitrary orderings Make a modification => the ordering requirements are clear No time No errors CnC is a dependence programming model The dependences include data and control dependences Both are explicit They are at the same level They are distinct

Some relevant current approaches Start with an explicitly serial or an explicitly parallel program. Modify it for some specific target or specific optimization goal Have to undo or trip over earlier tunings Start with source code and automatically uncover the parallelism We’ve been there Start with the white board drawing It’s how we think about our apps Dependences are explicit but: not executable

An actual white board sketch: LULESH a shock hydro-dynamics app

A CnC graph of flow of data for LULESH

Unlike an actual white board sketch CnC is a formal model It is precise It is executable Has enough information To prove properties about the program To analyze the program To optimize the program At the graph level Without access to computation code

Outline Motivation CnC Wide array of existing CnC tuning approaches Goals that support exascale needs CnC details via an example app Wide array of existing CnC tuning approaches

Needs for exascale Software engineering Tuning Domain expert (physicist, economist, biochemist, …) Tuning expert (computer scientist). Software engineering Tuning Separation of various activities (ways of thinking) The details of the computation within the computation steps The dependences among steps The runtime The tuning Leads to improved reuse Domain expert doesn’t need to know about Architecture issues Tuning goals Given only dependences Hides irrelevant low level computation details Each new tuning starts from just dependences Use any style of tuning Just obey the dependences Computer scientist doesn’t need to know about Physics, economics, biochemistry Same or different people Different activity Communicate at the level of the graph Not about physics or about parallel performance

Tuning Tuning consumes bulk of the time and energy There is no CnC-specific approach to tuning CnC is not a parallel programming model Only one requirement: the tuned app must obey domain spec semantics Single domain spec / A wide range of tuning goals & styles Faulty components Distinct tuning goals (time, memory, energy, …) Many different styles of runtimes (static / dynamic) Many different styles of tuning (static / dynamic) Domain spec isn’t modified by tunings Tunings may refer to domain spec but are isolated from it

CnC simplifies tuning by separation of concerns Separates the details of the computations and data from the dependences among them Domain spec is isolated from tuning It explicitly represents the ordering requirements No orderings for tuning & no arbitrary orderings

Outline Motivation CnC Wide array of existing CnC tuning approaches Goals that support exascale needs CnC details via an example app Wide array of existing CnC tuning approaches

Cholesky factorization Trisolve Update

Cholesky factorization Trisolve Update

Cholesky factorization Trisolve Update

Cholesky factorization Trisolve Update

Cholesky factorization Trisolve Update Cholesky Trisolve Update

Cholesky factorization Trisolve Update

Cholesky CnC domain spec 1: white board drawing Collections are just sets of instances Computation step collection Data item collection Computation step collection (C: iter) Computation step collection [C: iter] (T:: iter, row) Data item collection Data item collection [T: iter, row] (U: iter, row, col) [U: iter, row, col]

Cholesky CnC domain spec 1: white board drawing By default: computation steps are deterministic (C: iter) By default: data items are dynamic single assignment [C: iter] (T:: iter, row) [T: iter, row] (U: iter, row, col) [U: iter, row, col]

Cholesky CnC domain spec 2: I/O (C: iter) [C: iter] (T:: iter, row) [T: iter, row] (U: iter, row, col) [U: iter, row, col]

Cholesky CnC domain spec 3: Distinguish instances Tags are just identifiers Require an equality operator Often integers: iteration, row, col (C: iter) [C: iter] (T: iter, row) [T: iter, row] (U: iter, row, col) [U: iter, row, col]

Cholesky CnC domain spec 3: Distinguish instances Tags are just identifiers Require an equality operator Often integers: iteration, row, col But not necessarily: representation of a Sudoku board (C: iter) [C: iter] (T: iter, row) [T: iter, row] (U: iter, row, col) [U: iter, row, col]

Cholesky CnC domain spec 4: exact set of instances (control) Control tag collection <I: iter> Control tag collection (C: iter) <IR: iter, row> Control tag collection [C: iter] (T: iter, row) <IRC: iter, row, col> [T: iter, row] (U: iter, row, col) [U: iter, row, col]

Cholesky CnC domain spec 4: exact set of instances (control) Computation step instance: In/compute/out/done No cyclic dependences on a single instance <I: iter> (C: iter) <IR: iter, row> [C: iter] (T: iter, row) <IRC: iter, row, col> [T: iter, row] (U: iter, row, col) [U: iter, row, col]

Cholesky CnC domain spec 4: exact set of instances (control) Data dependence <I: iter> (C: iter) <IR: iter, row> Control dependence [C: iter] (T: iter, row) <IRC: iter, row, col> [T: iter, row] (U: iter, row, col) [U: iter, row, col]

Cholesky CnC domain spec Optional: dependences functions (C: iter) produces [C: iter] <I: iter> (C iter) <IR: iter, row> [C: iter] (T: iter, row) <IRC: iter, row, col> [T: iter, row] (U: iter, row, col) [U: iter, row, col] (C: iter) consumes [U: iter-1, iter, iter]

Semantics of CnC: specifies a partial order of execution tag avail step controlReady step ready step executed step dataReady Item avail

Outline Motivation CnC Wide array of existing CnC tuning approaches

Tuning Tuning starts with the domain spec Just the required orderings Wide range of tuning approaches Existing static tunings Existing dynamic tunings Possible future tunings

Static tuning Totally static schedule across time and space Vivek Sarkar Totally static schedule across time and space Automatic conversion of element-level data and computation to tiled versions Polyhedral tiling Distribution functions across nodes in a cluster (dynamic within each node) For data items or For computation steps Minimize RT overhead by eliminating redundant attribute propagation PIPES for distributed apps (Static and dynamic) Carl Offner Alex Nelson Alina Sbirlea Alina Sbirlea Jun Shirako Frank Schlimbach Zoran Budimlić Kath Knobe Tiago Cogumbreiro Martin Cong Yuhan Peng

Dynamic tuning Runtimes based on OCR, TBB, Babel and Haskell Vivek Sarkar Runtimes based on OCR, TBB, Babel and Haskell Basic: determines when a step is ready Tracks state changes and executes READY steps Workflow coordination Choice among CPU, GPU, FPGA Dynamic compromise: Static step preference and dynamic device availability Hierarchical affinity groups for locality Memory usage: Dynamic single assignment is higher level: Identifies values, not places Mapping data values to memory is tuning Garbage collection: Get-count Inspector/executor Nick Vrvilo Zoran Budimlić Vivek Sarkar Frank Schlimbach Shams Imam Ryan Newton Rice team Frank Schlimbach Frank Schlimbach Yves Vandriessche Alina Sbirlea Zoran Budimlić Zoran Budimlić Mike Burke Sanjay Chatterjee Kath Knobe Nick Vrvilo Frank Schlimbach Dragos Sbirlea

Possible future tunings Out-of-core computations Based on hierarchical affinity groups Checkpoint-continue Automatic continuous asynchronous checkpointing Collaborative runtimes under development Inspector/executor For computation Demand-driven execution Speculative execution Zoran Budimlić Kath Knobe

Isolation: Tuning from domain spec Critical for software engineering for exascale Critical for ease of tuning Just data and control dependences No arbitrary constraints Except for grain … CnC domain spec is totally flexible wrt tuning: More approaches in mind now New architectures New tuning goals New tuning approaches

Tuning grain size: Hierarchy Kath Knobe Nick Vrvilo Each computation step and data item can be decomposed Hierarchy adds more tuning flexibility Choose the best hierarchy Choose the best grain Different tunings can be applied at different levels in the hierarchy

Conclusion CnC is a dependence language Tuning is separate Control and data dependences Explicit, distinct and at the same level Tuning is separate No CnC-specific tuning This isolation makes tuning easier and more flexible

Thanks to … DARPA: UHPC DOE: X-Stack Cambridge Research Lab DEC / Compaq / HP / Intel Carl Offner, Alex Nelson Intel Frank Schlimbach, James Brodman, Mark Hampton, Geoff Lowney, Vincent Cave, Kamal Sharma, X-Stack TG team Rice Vivek Sarkar, Zoran Budimlić, Mike Burke, Sanjay Chatterjee, Nick Vrvilo, Martin Kong, Tiago Cogumbreiro Reservoir Rich Lethin Benoit Meister UC Irvine Aparna Chandramowlishwaran (Intel intern) Google Alina Sbirlea (Rice) Dragos Sbirlea (Rice) UCSD Laura Carrington Pietro Cicotti GaTech Rich Vuduc Hasnain Mandviwala (Inel intern) Kishore Ramachandran Indiana Ryan Newton (Intel) Purdue Milind Kulkarni Chenyang Liu PNNL John Feo Ellen Porter Facebook Nicolas Vasilache (Reservoir) Micron Kyle Wheeler (xxNL) Dreamworks Martin Watt Two Sigma Shams Imam (Rice) Sagnak Tasirlar (Rice

Discuss CnC related topics Intel CnC on Intel’s WhatIf site http://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc Available open source https://icnc.github.io Rice https://wiki.rice.edu/confluence/display/HABANERO/CNC Discuss CnC related topics New apps, optimizations, runtime, tuning, … send mail to kath.knobe@rice.edu, zoran@rice.edu, or frank.schlimbach@intel.com CnC’16 workshop Sept 27-28, 2016 Co-located with LCPC’16 in Rochester, NY https://cncworkshop2016.github.io/ To get on mailing list send mail to

Thank you!