Download presentation
Presentation is loading. Please wait.
Published byDella Ramsey Modified over 9 years ago
1
1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009
2
2 Papers Automatic Thread Extraction with Decoupled Software Pipelining Fully automatic Fine grained pipelining A Practical Approach to Exploring Coarse- Grained Pipeline Parallelism in C Programs Semi-automatic Coarse grained pipelining
3
3 First paper Automatic Thread Extraction with Decoupled Software Pipelining Guilherme Ottoni, Ram Rangan, Adam Stoler and David August From Princeton University
4
4 What is the paper about? Despite increasing uses of multiprocessors, many single threaded applications do not benefit Let the compiler automatically extract threads and exploit lurking pipeline parallelism Extract non-speculative and truly decoupled threads through Decoupled Software Pipelining(DSWP)
5
5 Why decoupled pipelining? Example Linked list traversal
6
6 Why decoupled pipelining? DOACROSS Iteration * (LD latency + communication latency)
7
7 Why decoupled pipelining? DSWP Iteration * LD latency One way pipelining
8
8 DSWP Flow of data (dependency) is acyclic among cores With use of inter-core queue, threads can be decoupled Efficiency + high tolerance for latency
9
9 DSWP Algorithm Build dependence graph Find strongly connected components (SCC) Create DAG of SCC Partition DAG Split codes into partitions Add flows to partitions
10
10 Build dependence graph Include every traditional dependence (data, control, and memory) & extensions
11
11 Find SCC SCC : Instructions that form a dependency cycle in a loop Instructions in SCC cannot be parallelized 1 2 1 1 2 2
12
12 Create DAG of SCCs Merge instructions within each SCC and update dependency arrows
13
13 Partition DAG Partition DAG nodes into n partitions ( n <= # of processors) Use heuristic to maximize load balance Decide # of partitions (threads) Start filling in from partition 1 with nodes from the top of DAG. When the partition is stuffed (estimated by # of cycles), move on to next partition Find the best # of threads and its partition
14
14 Split codes and insert flows (done!) For each partition, insert code basic blocks relevant to its contained SCC node Add in codes for dependency flow
15
15 Result 19.4% speedup on important benchmark loops, 9.2% overall When core bandwidth is halved Single threaded code slows down by 17.1% DSWP code is still slightly faster than single- threaded code running on full-bandwidth core Promising enabler for Thread-Level- Parallelism(TLP)?
16
16 Second Paper A Practical Approach to Exploring Coarse- Grained Pipeline Parallelism in C Programs William Thies, Vikram Chandrasekhar and Saman Amaransinghe From MIT
17
17 What is the paper about? Despite increasing uses of multiprocessors, many single threaded… (Repeated) Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes Let people define pipeline, and learn practical dependencies in runtime
18
18 What is the paper about? Despite increasing uses of multiprocessors, many single threaded… (Repeated) Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes Let people define stages, and learn practical dependencies in runtime …for streaming applications
19
19 Interface Add annotations in the body of top loop
20
20 Dynamic analysis The system creates a stream graph according to annotations. How do they find dependencies?
21
21 Dynamic analysis Streaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages
22
22 Dynamic analysis Run the application on training examples, and record every relevant store-load pair across pipeline boundaries This gives us practical dependencies
23
23 Interface Program shows a complete stream graph User decides if he/she likes this pipelining or not If yes, done! else, redo annotations. Iterate over until satisfied
24
24 Actual pipelining When compiled, annotation macros emit codes that will fork original program for each pipeline stage
25
25 Result Average 2.78x speedup, max 3.89x on 4-core Seems unsound but practical (?)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.