CS 267 Spring 2008 Horst Simon UC Berkeley May 15, 2008 Code Generation Framework for Process Network Models onto Parallel Platforms Man-Kit Leung, Isaac.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Reference: Message Passing Fundamentals.

PTIDES: Programming Temporally Integrated Distributed Embedded Systems Yang Zhao, EECS, UC Berkeley Edward A. Lee, EECS, UC Berkeley Jie Liu, Microsoft.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

6/14/2015 How to measure Multi- Instruction, Multi-Core Processor Performance using Simulation Deepak Shankar Darryl Koivisto Mirabilis Design Inc.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Dataflow Process Networks Lee & Parks Synchronous Dataflow Lee & Messerschmitt Abhijit Davare Nathan Kitchen.

Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.

February 21, 2008 Center for Hybrid and Embedded Software Systems Mapping A Timed Functional Specification to a Precision.

February 12, 2009 Center for Hybrid and Embedded Software Systems Model Transformation Using ERG Controller Thomas H. Feng.

MoBIES Working group meeting, September 2001, Dearborn Ptolemy II The automotive challenge problems version 4.1 Johan Eker Edward Lee with thanks.

A Platform-based Design Flow for Kahn Process Networks Abhijit Davare Qi Zhu December 10, 2004.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 MESCAL Application Modeling and Mapping: Warpath Andrew Mihal and the MESCAL team UC Berkeley.

Lecture 1 – Parallel Programming Primer CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Voicu Groza, 2008 SITE, HARDWARE/SOFTWARE CODESIGN OF EMBEDDED SYSTEMS Hardware/Software Codesign of Embedded Systems Voicu Groza SITE Hall, Room.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Clone-Cloud. Motivation With the increasing use of mobile devices, mobile applications with richer functionalities are becoming ubiquitous But mobile.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

C. André, J. Boucaron, A. Coadou, J. DeAntoni,

1 Optical Packet Switching Techniques Walter Picco MS Thesis Defense December 2001 Fabio Neri, Marco Ajmone Marsan Telecommunication Networks Group

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

CS 838: Pervasive Parallelism Introduction to pthreads Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Lecture 1 – Parallel Programming Primer

Conception of parallel algorithms

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Programming in C with MPI and OpenMP

CSCI1600: Embedded and Real Time Software

Shanna-Shaye Forbes Ben Lickly Man-Kit Leung

Chapter 4: Threads.

Background and Motivation

Hybrid Programming with OpenMP and MPI

Model Transformation with the Ptera Controller

Chapter 4: Threads & Concurrency

Parallel Programming in C with MPI and OpenMP

CSCI1600: Embedded and Real Time Software

Presentation transcript:

CS 267 Spring 2008 Horst Simon UC Berkeley May 15, 2008 Code Generation Framework for Process Network Models onto Parallel Platforms Man-Kit Leung, Isaac Liu, Jia Zou Final Project Presentation

Leung, Liu,Zou 2 / 18CS267 Sp 08 Final PresentationUC Berkeley Outline Motivation Demo Code Generation Framework Application and Results Conclusion

Leung, Liu,Zou 3 / 18CS267 Sp 08 Final PresentationUC Berkeley Motivation Parallel programming is difficult… -Functional correctness -Performance debugging + tuning (Basically, trial & error) Code generation as a tool –Systematically explore implementation space –Rapid development / prototyping –Optimize performance –Maximize (programming) reusability –Correct-by-construction [E. Dijkstra ’70] –Minimize human errors (bugs)‏ –Eliminates the need for low-level testing –Because, otherwise, manual coding is too costly Especially true for multiprocessors/distributed platforms

Leung, Liu,Zou 4 / 18CS267 Sp 08 Final PresentationUC Berkeley Higher-Level Programming Model Source Actor1 Sink Actor Source Actor2 Implicit Buffers Kahn Process Networks (KPNs) is a distributed model of computation (MoC) where a group of processing units are connected by communication channels to form a network of processes. –The communication channels are FIFO queues. –“The Semantics of a Simple Language For Parallel Programming” [GK ’74] Deterministic Inherently parallel Expressive

Leung, Liu,Zou 5 / 18CS267 Sp 08 Final PresentationUC Berkeley MPI Code Generation Workflow Analyze & annotate model Assume weights on edges & nodes Generate cluster info (buffer & grouping) Analyze & annotate model Assume weights on edges & nodes Generate cluster info (buffer & grouping) Generate MPI code SIMD (Single Instruction Multiple Data) Generate MPI code SIMD (Single Instruction Multiple Data) Execute code Obtain execution statistics for tuning Execute code Obtain execution statistics for tuning Partitioning (Mapping) Model Given a (KPN) Model Executable Code Generation

Leung, Liu,Zou 6 / 18CS267 Sp 08 Final PresentationUC Berkeley Demo The codegen facility is in the Ptolemy II nightly release -

Leung, Liu,Zou 7 / 18CS267 Sp 08 Final PresentationUC Berkeley Partitioning (Mapping) Models Code Generation Executable Role of Code Generation Ptolemy II Platform-based Design [AS ‘02]

Leung, Liu,Zou 8 / 18CS267 Sp 08 Final PresentationUC Berkeley Implementation Space for Distributed Environment Mapping # of logical processing units # of cores / processors Network costs Latency Throughput Memory Constraint Communication buffer size Minimization metrics Costs Power consumption …

Leung, Liu,Zou 9 / 18CS267 Sp 08 Final PresentationUC Berkeley Partition Using node and edge weights abstractions Annotation on the model From the model, the input file to Chaco is generated. After Chaco produces the output file, the partitions are automatically annotated onto the model.

Leung, Liu,Zou 10 / 18CS267 Sp 08 Final PresentationUC Berkeley Multiprocessor Architectures Shared Memory vs. Message Passing –We want to generate code that will run on both kinds of architectures –Message passing: Message Passing Interface(MPI) as the implementation –Shared memory: Pthread implementation available for comparison UPC and OpenMP as future work

Leung, Liu,Zou 11 / 18CS267 Sp 08 Final PresentationUC Berkeley Pthread Implementation void Actor1 (void) {... } void Actor2 (void) {... } void Model (void) { pthread_create(&Actor1…); pthread_create(&Actor2…); pthread_join(&Actor1…); pthread_join(&Actor2…); } Model

Leung, Liu,Zou 12 / 18CS267 Sp 08 Final PresentationUC Berkeley MPI Code Generation Local buffers MPI send/recv MPI Tag Matching KPN Scheduling: Determine when actors are safe to fire Actors can’t block other actors on same partition Termination based on a firing count

Leung, Liu,Zou 13 / 18CS267 Sp 08 Final PresentationUC Berkeley Sample MPI Program main() { if (rank == 0) { Actor0(); Actor1(); } if (rank == 1) { Actor2(); }... } Actor#() { [1] MPI_Irecv(input); [2] if (hasInput && !sendBufferFull){ [3] output = localCalc(); [4] MPI_Isend(1, output); } }

Leung, Liu,Zou 14 / 18CS267 Sp 08 Final PresentationUC Berkeley Application

Leung, Liu,Zou 15 / 18CS267 Sp 08 Final PresentationUC Berkeley Execution Platform

Leung, Liu,Zou 16 / 18CS267 Sp 08 Final PresentationUC Berkeley Preliminary Results # cores MPI 500 Iter MPI 1000 Iter MPI 2500 Iter MPI 5000 Iter Pthread 500 Iter Pthread 1000 Iter Pthread 2500 Iter Pthread 5000 Iter (ms)

Leung, Liu,Zou 17 / 18CS267 Sp 08 Final PresentationUC Berkeley Conclusion & Future Work Conclusion -Framework for code generation to parallel platforms -Generate scalable MPI code from Kahn Process Network models Future Work -Target more platforms ( UPC, OpenMP etc) -Additional profiling techniques -Support more partitioning tools -Improve performance on generated code

Leung, Liu,Zou 18 / 18CS267 Sp 08 Final PresentationUC Berkeley Acknowledgments Edward Lee Horst Simon Shoaib Kamil Ptolemy II developers NERSC John Kubiatowicz

Leung, Liu,Zou 19 / 18CS267 Sp 08 Final PresentationUC Berkeley Extra slides

Leung, Liu,Zou 20 / 18CS267 Sp 08 Final PresentationUC Berkeley Why MPI Message passing –Good for distributed (shared-nothing) systems Very generic –Easy to set up –Required setup (i.e. mpicc and etc.) for one “master” –Worker nodes only need to have SSH Flexible (explicit) –Nonblocking + blocking send/recv Cons: required explicit syntax modification (as opposed to OpenMP, Erlang, and etc.) –Solution: automatic code generation

Leung, Liu,Zou 21 / 18CS267 Sp 08 Final PresentationUC Berkeley Actor-oriented design: a formalized model of concurrency object oriented actor oriented Actor-oriented design hides the states of each actor and makes them inaccessible from other actor The emphasis of data flow over control flow leads to conceptually concurrent execution of actors The interaction between actors happens in a highly disciplined way Threads and mutexes become implementation mechanism instead of part of programming model

Leung, Liu,Zou 22 / 18CS267 Sp 08 Final PresentationUC Berkeley Pthread implementation Each actor as a separate thread Implicit buffers –Each buffer has a read and write count –Condition variable: sleeps and wakes up threads –Capacity of the buffer A global notion of scheduling exists –OS level –All actors are at blocking-read mode implies the model should terminate

Leung, Liu,Zou 23 / 18CS267 Sp 08 Final PresentationUC Berkeley MPI Implementation Mapping of actors to cores is needed. –Classic graph partitioning problem –Nodes: actors –Edges: messages –Node weights: computations on each actor –Edge weights: amount of messages communicated –Partitions: processors Chaco chosen as the graph partitioner.

Leung, Liu,Zou 24 / 18CS267 Sp 08 Final PresentationUC Berkeley Partition Profiling Challenge: providing the user with enough information so node weights and edge weights can be annotated and modified to achieve load balancing. –Solution 1: Static analysis –Solution 2: Simulation –Solution 3: Dynamic load balancing –Solution 4: Profiling the current run and feed the information back to the user