Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs2 Processor 2 Processor 1 Concurrent Programs Message-Passing Style Processes & Channels E.g. Streaming Languages C1C3 C2 P2P3 P4 P1 Uniprocessors Programming Convenience ─ Embedded devices ─ Network Software Stack ─ Media Processing Multiprocessors Exploit parallelism Partition Processes Problem: Compile a concurrent program to run efficiently on a Uniprocessor
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs3 Compiling Concurrent Programs Process-based Approach Keep processes separate Context Switch between the processes Small executable Sum of Processes Significant overhead Automata-based Approach Treat each process as a state machine Combine the state machines Small Overhead Large Executables Potentially Exponential One Study Compared the two approaches and found: Compared to Process-based approach, the Automata-based Approach generates code that is ─ Twice as fast ─ 2-3 Orders of magnitude larger executable Neither approach is satisfactory
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs4 Our Work Our Goal: Compile Concurrent Programs Automated using a Compiler Low Overhead Small Executable Size Our Approach: Combine the two approaches Use process-based approach to handle all cases Use automata-based approach to speed up the common cases
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs5 Outline Motivation Fast Paths Fast Paths in Concurrent Programs Experimental Evaluation Conclusions
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs6 Fast Paths Path: A dynamic execution path in the program Fast Path or Hot Path: Well-known technique Commonly-executed Paths (Hot Path) Specialize and Optimize (Fast Path) Two components Predicate that specifies the fast path Optimized code to execute the fast path Compilers can be used to automate it Mostly in sequential Programs
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs7 Manually implementing Fast Paths To achieve good performance in Concurrent programs Start: Insert code that identifies the common case and transfer control to fast path code Extract and optimize fast path code manually Finish: Patch up state and return control at the end of fast path Obvious drawbacks Difficult to implement correctly Difficult to maintain
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs8 Outline Motivation Fast Paths Fast Paths in Concurrent Programs Experimental Evaluation Conclusions
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs9 Fast Path (Automata-based) Our Approach Baseline (Process-based) Test 1 a = b; b = c * d; d = 0; if (c > 0) c++; a = c; b = c * d; d = 3; if (c > 0) c++; Optimized Code 2 Abort? 3
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs10 Specifying Fast Paths Multiple processes Concurrent Program Regular expressions Statements Conditions (Optional) Synchronization (Optional) Support early abort Advantages Powerful Compact Hint fastpath example { process first { statement A, B, C, D, #1; start A ? (size<100); follows B ( C D )*; exit #1; } process second {... } process third {... }
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs11 Extracting Fast Paths Automata-based approach to extract fast paths A Fast Path involves a group of processes Compiler keeps track of the execution point for each of the involved processes On exit, control is returned to the appropriate location in each of the processes Baseline: Concurrent. Fast Path: Sequential Code Fairness on Fast Path Embed scheduling decisions in the fast path ─ Avoid scheduling/fairness overhead on the fast path Rely on baseline code for fairness ─ Always taken a fraction of the time
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs12 Optimization on Fast Path Enabling Traditional Fast Paths Generate and Optimize baseline code Generate Fast path code ─ Fast Paths have exit/entry points to baseline code Use data-flow information from baseline code at the exit/entry point to start analysis and optimize the fast path code Speeding up fast path using lazy execution Delay operations that are not needed when fast paths are executed to the end Such operations can be performed if the fast path is aborted
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs13 Outline Motivation Fast Paths Fast Paths in Concurrent Programs Experimental Evaluation Conclusions
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs14 Experimental Evaluation Implemented the techniques in the paper In ESP Compiler ─ Supports concurrent programs Two class of programs Filter Programs VMMC Firmware Answer three questions Programming effort (annotation complexity) needed Size of the executable Performance
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs15 Filter Programs Well-defined structure Streaming applications Use Filter Programs by Probsting et al. Good to evaluate our technique ─ Concurrency overheads dominate Experimental Setup 2.66 GHz Pentium 4, 1 GB Memory, Linux 2.4 4 Versions of the code Annotation Complexity Program sizes: 153, 125, 190, 196 lines Annotation sizes: 7, 7, 10, 10 lines P1 C1 P2 P3 C2 P4 C3
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs16 Filter Programs Cont’d Executable Size Performance Program 1 Program 2Program 3Program 4 Better Performance than Both Relatively Small Executable
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs17 VMMC Firmware Firmware for a gigabit network (Myrinet) Experimental Setup Measure network performance between two machines connected with Myrinet ─ Latency & Bandwidth 3 Versions of the firmware ─ Concurrent C version with Manual Fast Paths ─ Process-based code without Fast Paths ─ Process-based code with Compiler-extracted Fast Paths Annotation Complexity (3 fast paths) Fast Path Specification: 20, 14, and 18 lines Manual Fast Paths in C: 1100 lines total
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs18 VMMC Firmware Cont’d Message size (in Bytes) Performance: Latency ss Generated Code Size Assembly Instructions
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs19 Outline Motivation Fast Paths Fast Paths in Concurrent Programs Experimental Evaluation Conclusions
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs20 Conclusions Fast Paths in Concurrent Programs Evaluated using Filter programs and VMMC firmware Process-based approach to handle all cases Keeps executable size reasonable Automata-based approach to handle only the common cases (Fast Path) Avoid high overhead of process-based approach Often outperforms the automata-based code
Questions ?
Intel Labs & Princeton UniversityFast Paths in Concurrent Programs22 ABCDEF Abcdef Ghijk