Download presentation
Presentation is loading. Please wait.
Published byMiles Gurney Modified over 9 years ago
1
1 A Case for Teaching Parallel Programming to Freshmen Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Workshop on Directions in Multicore Programming Education, Washington D.C. March 8, 2009
2
2 One view of parallel programming Multicores are coming (have come) Performance gains no longer automatic and transparent Most programmers have never written a parallel program Different models for exploiting parallelism, depending upon the application Data parallel, Threads, TM, Map-Reduce, … How to migrate my software How to get performance How to educate my programmers It is all about performance
3
3 Another view of parallel programming Every gadget is concurrent and reactive Many weakly interrelated tasks happening concurrently cell phones- playing music, receiving calls, web browsing Hither to independent programs are required to interact What should the music player do when you are browsing the web Ambiguous specs: Not clear a priori what a user wants Infrastructure is a parallel database for processing queries and commands Scalable infrastructure to deal with ever increasing queries The database is more than just records -- Many streams of data constantly being fed in Each interaction requires many queries and transactions Parallelism is obvious but interactions between modules can be complex even when infrequent Even though the substrate is multicore, performance is a secondary issue
4
4 My take Modeling, simulating, and programming parallel and concurrent systems is a more fundamental problem than how to make use of multicores efficiently Freshman teaching should focus on composing parallel programs; sequential programming should be taught (perhaps) as a way of writing the modules to be composed Within a few years multicores will be viewed as a transparent way of simplifying and speeding up parallel programs (not very different than the way we used to view computers with faster clocks)
5
5 The remainder of the talk Parallel programming can be simpler than sequential programming for inherently parallel computations Some untested ideas on what we should teach Freshman
6
6 Parallel programming can be easier than sequential programming
7
7 H.264 Video Decoder May be implemented in hardware or software depending upon... NAL unwrap Parse + CAVLC Inverse Quant Transformation Deblock Filter Intra Prediction Inter Prediction Ref Frames Compressed Bits Frames Different requirements for different environments - QVGA 320x240p (30 fps) - DVD 720x480p - HD DVD 1280x720p (60-75 fps)
8
8 Sequential code from ffmpeg void h264decode(){ int stage = S_NAL; while (!eof()){ createdOutput = 0; stallFromInterPred = 0; case (stage){ S_NAL: try_NAL(); stage=(createdOutput) ? S_Parse:S_NAL; break; S_Parse: try_Parse(); stage=(createdOutput) ? S_IQIT:S_NAL; break; S_IQIT: try_IQIT(); stage=(createdOutput) ? S_Parse:S_Inter; break; S_Inter: try_Inter(); stage=(createdOutput) ? S_IQIT:S_Intra; stage=(stallFromInterPred)?S_Deblock:S_Intra; break; S_Intra: try_Intra(); stage=(createdOutput) ? S_Inter:S_Deblock; break; S_Deblock: try_deblock(); stage= S_Intra; break } } } Parse NAL IQ/IT Inter- Predict Intra- Predict 20K Lines of C out of 200K Deblock ing The programmer is forced to choose a sequential order of evaluation and write the code accordingly (non trivial)
9
9 Price of obscuring the parallelism Program structure is difficult to understand Packets are kept and modified in a global heap (nothing to do with the logical structure) Unscrambling the over-specified control structure for parallelization is beyond the capability of current compiler techniques Thread-level data parallelism?
10
10 P Threads A (p)thread of each block But there is no control over mapping int main(){ pthread_create(NAL); phtread_create(Parse); pthread_create(IQIT); pthread_create(Interpred); pthread_create(Intrapred); pthread_create(Deblock);} Processors NAL thread Parse thread DeBlk thread Intrapr thread IQ/IT thread Interpredict thread Sleeping threads This is an implementation model
11
11 StreamIT (Amarasinghe & Thies) a more natural expression using filters bit -> frame pipeline H264Decode { add; NAL(); add; Parse(); add; IQIT(); add; feedbackloop{ join roundrobin; body pipeline{ add; InterPredict(); add; IntraPredict(); add; Deblock();} split roundrobin;}} Parse NAL IQ/IT Inter- Predict Intra- Predict Deblock ing Given the required rates StreamIt compiler can do a great job of generating efficient code Feedback is Problematic!
12
12 Functional languages (pH) Natural expression of parallelism but too general do_H264 :: Stream Chunk -> Stream Frame do_H264 = let fMem :: IStructFrameMem MacroBlock fMem = makeIStructureMemory nalStream = nal inputStream parseStream = parse nalStream iqitStream = iqit parseStream interStream = inter iqitStream fMem intraStream = intra interStream deblockStream = deblock intraStream fMem in deblockStream The language does not provide any hints about which level of granularity the parallelism should be considered by either the programmer or the compiler FLs provide a solid base for building domain-specific parallel languages
13
13 An Idea we are testing: Hardware-design inspired parallel programming
14
14 Hardware-design inspiration Hardware is all about parallelism but there is no virtualization of resources If one asks for two adder then one gets two adders – if one needs to do more than two additions at a time, the adders are time multiplexed explicitly Two-level compilation model One can do a design with n adders but at some stage of compilation n must be specified (instantiated) to generate hardware. Each instantiation of n results in different design Analogy - In software one may want to instantiate a different code for different problem size or different machine configuration.
15
15 H.264 in Bluespec module mkH264( IH264 ) // Instantiate the modules Nal nal <- mkNalUnwrap(); ... DeblockFilter deblock <- mkDeblockFilter(); FrameMemory frameB <- mkFrameMemoryBuffer(); //Connect the modules mkConnection(nal.out, parse.in); mkConnection(parse.out, iqit.in); …… mkConnection(deblock.mem_client, frameB.mem_writer); mkConnection(inter_pred.mem_client, frameB.mem_reader); interface in = nal.in; //Input goes straight to NAL interface out = deblock.out; // Output from deblock endmodule Modularity and dataflow is obvious No sharing of resources No time multiplexing issue if each module is mapped on a separate core
16
16 H.264 Decoder in Bluespec Elliott Fleming, Chun Chieh Lin NAL unwrap Parse + CAVLC Inverse Quant Transformation Deblock Filter Intra Prediction Inter Prediction Ref Frames Compressed Bits Frames Behaviors of modules are composable Each module can be refined separately Any module can be compiled in SW Are there ideas worth carrying over to Parallel SW? 8K lines of Bluespec Decodes 1080p@70fps Area 4.4 mm sq (180nm)
17
17 What should we teach freshman
18
18 General guidelines Make it easy to express the parallelism present in the application no unnecessary sequentialization no forced grouping of logically separate memories Separate and deemphasize the issue of restructuring code for better sequential performance
19
19 Topics Finite state machines choose problems that have a natural solution as an FSM show composition and interaction of parallel FSMs Dataflow networks with unbounded and bounded edges show programming of nodes in a sequential language with blocking sends and receives Types, modularity, data structures, etc. are important topics but orthogonal to parallelism; these topics should be taught all the time
20
20 Some challenges No appropriate language or tools Need to think up new illustrative problems from the ground up Fibbonacci, “Hello world”, matrix multiply won’t do
21
21 Takeaway Parallel programming is not a special topic in programming Parallel programming is programming Sequential and parallel programming can be introduced together Parallel thinking is as natural as sequential thinking Thanks
22
22 Zero cost parameterization Example: OFDM based protocols MAC standard specific potential reuse Scrambler FEC Encoder InterleaverMapper Pilot & Guard Insertion IFFT CP Insertion De- Scrambler FEC Decoder De- Interleaver De- Mapper Channel Estimater FFTSynchronizer TX Controller RX Controller S/P D/A A/D Different algorithms Different throughput requirements Reusable algorithm with different parameter settings WiFi: 64pt @ 0.25MHz WiMAX: 256pt @ 0.03MHz WUSB: 128pt 8MHz 85% reusable code between WiFi and WiMAX From WiFi to WiMAX in 4 weeks (Alfred) Man Chuek Ng, … WiFi: x 7 +x 4 +1 WiMAX: x 15 +x 14 +1 WUSB: x 15 +x 14 +1 Convolutional Reed-Solomon Turbo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.