1 A Case for Teaching Parallel Programming to Freshmen Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Workshop.

Slides:



Advertisements
Similar presentations
Parallel H.264 Decoding on an Embedded Multicore Processor
Advertisements

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.
A Dataflow Programming Language and Its Compiler for Streaming Systems
May 27, 2008 L1-1 Why formal verification remains on the fringes of commercial development Arvind Computer Science & Artificial.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
The Control Unit: Sequencing the Processor Control Unit: –provides control signals that activate the various microoperations in the datapath the select.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international.
1 CS115 Class 7: Architecture Due today –Requirements –Read Architecture paper pages 1-15 Next Tuesday –Read Practical UML.
SNAL Sensor Networks Application Language Alvise Bonivento Mentor: Prof. Sangiovanni-Vincentelli 290N project, Fall 04.
6.375: Complex Digital Systems Lecturer: Arvind TA: Richard S. Uhler Administration: Sally Lee February 6, 2013http://csg.csail.mit.edu/6.375 L01-1.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
A Bridge to Your First Computer Science Course Prof. H.E. Dunsmore Concurrent Programming Threads Synchronization.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
Thinking in Parallel Adopting the TCPP Core Curriculum in Computer Systems Principles Tim Richards University of Massachusetts Amherst.
February 4, 2009http://csg.csail.mit.edu/6.375/L Complex Digital System Spring 2009 Lecturer:Arvind TA: K. Elliott Fleming Assistant: Sally Lee.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.
Multi-Core Architectures
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
CSE 219 Computer Science III Program Design Principles.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Fundamental Programming: Fundamental Programming K.Chinnasarn, Ph.D.
1 Workshop Topics - Outline Workshop 1 - Introduction Workshop 2 - module instantiation Workshop 3 - Lexical conventions Workshop 4 - Value Logic System.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
March, 2007Intro-1http://csg.csail.mit.edu/arvind Design methods to facilitate rapid growth of SoCs Arvind Computer Science & Artificial Intelligence Lab.
Fall 2004EE 3563 Digital Systems Design EE 3563 VHSIC Hardware Description Language  Required Reading: –These Slides –VHDL Tutorial  Very High Speed.
The Nature of Computing INEL 4206 – Microprocessors Lecture 3 Bienvenido Vélez Ph. D. School of Engineering University of Puerto Rico - Mayagüez.
CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Application Software System Software.
Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.
CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.
1 Modular Refinement of H.264 Kermin Fleming. 2 What is H.264? Mobile Devices Low bit-rate Video Decoder –Follow on to MPEG-2 and H.26x Operates on pixel.
EKT303/4 Superscalar vs Super-pipelined.
What’s Ahead for Embedded Software? (Wed) Gilsoo Kim
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
03/30/031 ECE Digital System Design & Synthesis Lecture Design Partitioning for Synthesis Strategies  Partition for design reuse  Keep related.
From Use Cases to Implementation 1. Structural and Behavioral Aspects of Collaborations  Two aspects of Collaborations Structural – specifies the static.
March 1, 2006http://csg.csail.mit.edu/6.375/L09-1 Bluespec-3: Architecture exploration using static elaboration Arvind Computer Science & Artificial Intelligence.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
February 13, 2008L04-1http://csg.csail.mit.edu/6.375 Bluespec: The need for a new design methodology Arvind Computer Science & Artificial Intelligence.
Chapter 3 Boolean Algebra and Digital Logic T103: Computer architecture, logic and information processing.
Multiple Clock Domains (MCD) Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
From Use Cases to Implementation 1. Mapping Requirements Directly to Design and Code  For many, if not most, of our requirements it is relatively easy.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.
Parallel Patterns.
SOFTWARE DESIGN AND ARCHITECTURE
A Common Machine Language for Communication-Exposed Architectures
Linear Filters in StreamIt
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Fei Li Jinjun Xiong University of Wisconsin-Madison
Design Flow System Level
Introduction to cosynthesis Rabi Mahapatra CSCE617
Software Engineering with Reusable Components
Chapter 4: Threads & Concurrency
ECE 352 Digital System Fundamentals
From Use Cases to Implementation
Presentation transcript:

1 A Case for Teaching Parallel Programming to Freshmen Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Workshop on Directions in Multicore Programming Education, Washington D.C. March 8, 2009

2 One view of parallel programming Multicores are coming (have come) Performance gains no longer automatic and transparent Most programmers have never written a parallel program Different models for exploiting parallelism, depending upon the application  Data parallel, Threads, TM, Map-Reduce, … How to migrate my software How to get performance How to educate my programmers It is all about performance

3 Another view of parallel programming Every gadget is concurrent and reactive Many weakly interrelated tasks happening concurrently  cell phones- playing music, receiving calls, web browsing Hither to independent programs are required to interact  What should the music player do when you are browsing the web Ambiguous specs: Not clear a priori what a user wants Infrastructure is a parallel database for processing queries and commands Scalable infrastructure to deal with ever increasing queries The database is more than just records -- Many streams of data constantly being fed in Each interaction requires many queries and transactions Parallelism is obvious but interactions between modules can be complex even when infrequent Even though the substrate is multicore, performance is a secondary issue

4 My take Modeling, simulating, and programming parallel and concurrent systems is a more fundamental problem than how to make use of multicores efficiently Freshman teaching should focus on composing parallel programs; sequential programming should be taught (perhaps) as a way of writing the modules to be composed Within a few years multicores will be viewed as a transparent way of simplifying and speeding up parallel programs (not very different than the way we used to view computers with faster clocks)

5 The remainder of the talk Parallel programming can be simpler than sequential programming for inherently parallel computations Some untested ideas on what we should teach Freshman

6 Parallel programming can be easier than sequential programming

7 H.264 Video Decoder May be implemented in hardware or software depending upon... NAL unwrap Parse + CAVLC Inverse Quant Transformation Deblock Filter Intra Prediction Inter Prediction Ref Frames Compressed Bits Frames Different requirements for different environments - QVGA 320x240p (30 fps) - DVD 720x480p - HD DVD 1280x720p (60-75 fps)

8 Sequential code from ffmpeg void h264decode(){ int stage = S_NAL; while (!eof()){ createdOutput = 0; stallFromInterPred = 0; case (stage){ S_NAL: try_NAL(); stage=(createdOutput) ? S_Parse:S_NAL; break; S_Parse: try_Parse(); stage=(createdOutput) ? S_IQIT:S_NAL; break; S_IQIT: try_IQIT(); stage=(createdOutput) ? S_Parse:S_Inter; break; S_Inter: try_Inter(); stage=(createdOutput) ? S_IQIT:S_Intra; stage=(stallFromInterPred)?S_Deblock:S_Intra; break; S_Intra: try_Intra(); stage=(createdOutput) ? S_Inter:S_Deblock; break; S_Deblock: try_deblock(); stage= S_Intra; break } } } Parse NAL IQ/IT Inter- Predict Intra- Predict 20K Lines of C out of 200K Deblock ing The programmer is forced to choose a sequential order of evaluation and write the code accordingly (non trivial)

9 Price of obscuring the parallelism Program structure is difficult to understand Packets are kept and modified in a global heap (nothing to do with the logical structure) Unscrambling the over-specified control structure for parallelization is beyond the capability of current compiler techniques Thread-level data parallelism?

10 P Threads A (p)thread of each block But there is no control over mapping  int main(){  pthread_create(NAL);  phtread_create(Parse);  pthread_create(IQIT);  pthread_create(Interpred);  pthread_create(Intrapred);  pthread_create(Deblock);} Processors NAL thread Parse thread DeBlk thread Intrapr thread IQ/IT thread Interpredict thread Sleeping threads This is an implementation model

11 StreamIT (Amarasinghe & Thies) a more natural expression using filters  bit -> frame pipeline H264Decode {  add; NAL();  add; Parse();  add; IQIT();  add; feedbackloop{  join roundrobin;  body pipeline{  add; InterPredict();  add; IntraPredict();  add; Deblock();}  split roundrobin;}} Parse NAL IQ/IT Inter- Predict Intra- Predict Deblock ing Given the required rates StreamIt compiler can do a great job of generating efficient code Feedback is Problematic!

12 Functional languages (pH) Natural expression of parallelism but too general  do_H264 :: Stream Chunk -> Stream Frame  do_H264 = let  fMem :: IStructFrameMem MacroBlock  fMem = makeIStructureMemory  nalStream = nal inputStream  parseStream = parse nalStream  iqitStream = iqit parseStream  interStream = inter iqitStream fMem  intraStream = intra interStream  deblockStream = deblock intraStream fMem  in deblockStream The language does not provide any hints about which level of granularity the parallelism should be considered by either the programmer or the compiler FLs provide a solid base for building domain-specific parallel languages

13 An Idea we are testing: Hardware-design inspired parallel programming

14 Hardware-design inspiration Hardware is all about parallelism but there is no virtualization of resources If one asks for two adder then one gets two adders – if one needs to do more than two additions at a time, the adders are time multiplexed explicitly Two-level compilation model One can do a design with n adders but at some stage of compilation n must be specified (instantiated) to generate hardware. Each instantiation of n results in different design Analogy - In software one may want to instantiate a different code for different problem size or different machine configuration.

15 H.264 in Bluespec module mkH264( IH264 )  // Instantiate the modules  Nal nal <- mkNalUnwrap(); ...  DeblockFilter deblock <- mkDeblockFilter();  FrameMemory frameB <- mkFrameMemoryBuffer();  //Connect the modules  mkConnection(nal.out, parse.in);  mkConnection(parse.out, iqit.in); ……  mkConnection(deblock.mem_client, frameB.mem_writer);  mkConnection(inter_pred.mem_client, frameB.mem_reader);  interface in = nal.in; //Input goes straight to NAL  interface out = deblock.out; // Output from deblock endmodule Modularity and dataflow is obvious No sharing of resources No time multiplexing issue if each module is mapped on a separate core

16 H.264 Decoder in Bluespec Elliott Fleming, Chun Chieh Lin NAL unwrap Parse + CAVLC Inverse Quant Transformation Deblock Filter Intra Prediction Inter Prediction Ref Frames Compressed Bits Frames Behaviors of modules are composable Each module can be refined separately Any module can be compiled in SW Are there ideas worth carrying over to Parallel SW? 8K lines of Bluespec Decodes Area 4.4 mm sq (180nm)

17 What should we teach freshman

18 General guidelines Make it easy to express the parallelism present in the application no unnecessary sequentialization no forced grouping of logically separate memories Separate and deemphasize the issue of restructuring code for better sequential performance

19 Topics Finite state machines choose problems that have a natural solution as an FSM show composition and interaction of parallel FSMs Dataflow networks with unbounded and bounded edges show programming of nodes in a sequential language with blocking sends and receives Types, modularity, data structures, etc. are important topics but orthogonal to parallelism; these topics should be taught all the time

20 Some challenges No appropriate language or tools Need to think up new illustrative problems from the ground up Fibbonacci, “Hello world”, matrix multiply won’t do

21 Takeaway Parallel programming is not a special topic in programming Parallel programming is programming Sequential and parallel programming can be introduced together Parallel thinking is as natural as sequential thinking Thanks

22 Zero cost parameterization Example: OFDM based protocols MAC standard specific potential reuse Scrambler FEC Encoder InterleaverMapper Pilot & Guard Insertion IFFT CP Insertion De- Scrambler FEC Decoder De- Interleaver De- Mapper Channel Estimater FFTSynchronizer TX Controller RX Controller S/P D/A A/D Different algorithms Different throughput requirements Reusable algorithm with different parameter settings WiFi: 0.25MHz WiMAX: 0.03MHz WUSB: 128pt 8MHz 85% reusable code between WiFi and WiMAX From WiFi to WiMAX in 4 weeks  (Alfred) Man Chuek Ng, … WiFi: x 7 +x 4 +1 WiMAX: x 15 +x WUSB: x 15 +x Convolutional Reed-Solomon Turbo