Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.

Slides:

Advertisements

Similar presentations

Operating Systems Components of OS

Advertisements

Configuration management

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

ESP: A Language for Programmable Devices Sanjeev Kumar, Yitzhak Mandelbaum, Xiang Yu, Kai Li Princeton University.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

CS 5204 – Operating Systems 1 Scheduler Activations.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.

1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.

Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

3.5 Interprocess Communication

Chapter 17 Parallel Processing.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.

G Robert Grimm New York University Scheduler Activations.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Computer System Architectures Computer System Software

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Using Model-Checking to Debug Device Firmware Sanjeev Kumar Microprocessor Research Labs, Intel Kai Li Princeton University.

Lecture 8: Design of Parallel Programs Part III Lecturer: Simon Winberg.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Improving Network I/O Virtualization for Cloud Computing.

©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Pipelining and Parallelism Mark Staveley

Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.

Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.

ESP: A Language for Programmable Devices Sanjeev Kumar Princeton University Advisor : Kai Li.

ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.

By: Rob von Behren, Jeremy Condit and Eric Brewer 2003 Presenter: Farnoosh MoshirFatemi Jan

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

Slide 1 Chapter 8 Architectural Design. Slide 2 Topics covered l System structuring l Control models l Modular decomposition l Domain-specific architectures.

Interrupt Handler Migration and Direct Interrupt Scheduling for Rapid Scheduling of Interrupt-driven Tasks Reviewer: Kim, Hyukjoong ESLab.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.

An Efficient Compilation Framework for Languages Based on a Concurrent Process Calculus Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa Yonezawa Laboratory.

Parallel Programming Models

Parallel Programming By J. H. Wang May 2, 2017.

Improving java performance using Dynamic Method Migration on FPGAs

Levels of Parallelism within a Single Processor

Xen Network I/O Performance Analysis and Opportunities for Improvement

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Chapter 2: Operating-System Structures

Chapter 2: Operating-System Structures

Presentation transcript:

Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs2 Processor 2 Processor 1 Concurrent Programs  Message-Passing Style  Processes & Channels  E.g. Streaming Languages C1C3 C2 P2P3 P4 P1  Uniprocessors  Programming Convenience ─ Embedded devices ─ Network Software Stack ─ Media Processing  Multiprocessors  Exploit parallelism  Partition Processes Problem: Compile a concurrent program to run efficiently on a Uniprocessor

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs3 Compiling Concurrent Programs  Process-based Approach  Keep processes separate  Context Switch between the processes  Small executable  Sum of Processes  Significant overhead  Automata-based Approach  Treat each process as a state machine  Combine the state machines  Small Overhead  Large Executables  Potentially Exponential  One Study Compared the two approaches and found:  Compared to Process-based approach, the Automata-based Approach generates code that is ─ Twice as fast ─ 2-3 Orders of magnitude larger executable  Neither approach is satisfactory

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs4 Our Work  Our Goal: Compile Concurrent Programs  Automated using a Compiler  Low Overhead  Small Executable Size  Our Approach: Combine the two approaches  Use process-based approach to handle all cases  Use automata-based approach to speed up the common cases

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs5 Outline  Motivation  Fast Paths  Fast Paths in Concurrent Programs  Experimental Evaluation  Conclusions

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs6 Fast Paths  Path: A dynamic execution path in the program  Fast Path or Hot Path: Well-known technique  Commonly-executed Paths (Hot Path)  Specialize and Optimize (Fast Path)  Two components  Predicate that specifies the fast path  Optimized code to execute the fast path  Compilers can be used to automate it  Mostly in sequential Programs

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs7 Manually implementing Fast Paths  To achieve good performance in Concurrent programs  Start: Insert code that identifies the common case and transfer control to fast path code  Extract and optimize fast path code manually  Finish: Patch up state and return control at the end of fast path  Obvious drawbacks  Difficult to implement correctly  Difficult to maintain

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs8 Outline  Motivation  Fast Paths  Fast Paths in Concurrent Programs  Experimental Evaluation  Conclusions

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs9 Fast Path (Automata-based) Our Approach Baseline (Process-based) Test 1 a = b; b = c * d; d = 0; if (c > 0) c++; a = c; b = c * d; d = 3; if (c > 0) c++; Optimized Code 2 Abort? 3

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs10 Specifying Fast Paths  Multiple processes  Concurrent Program  Regular expressions  Statements  Conditions (Optional)  Synchronization (Optional)  Support early abort  Advantages  Powerful  Compact  Hint fastpath example { process first { statement A, B, C, D, #1; start A ? (size<100); follows B ( C D )*; exit #1; } process second {... } process third {... }

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs11 Extracting Fast Paths  Automata-based approach to extract fast paths  A Fast Path involves a group of processes  Compiler keeps track of the execution point for each of the involved processes  On exit, control is returned to the appropriate location in each of the processes Baseline: Concurrent. Fast Path: Sequential Code  Fairness on Fast Path  Embed scheduling decisions in the fast path ─ Avoid scheduling/fairness overhead on the fast path  Rely on baseline code for fairness ─ Always taken a fraction of the time

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs12 Optimization on Fast Path  Enabling Traditional Fast Paths  Generate and Optimize baseline code  Generate Fast path code ─ Fast Paths have exit/entry points to baseline code  Use data-flow information from baseline code at the exit/entry point to start analysis and optimize the fast path code  Speeding up fast path using lazy execution  Delay operations that are not needed when fast paths are executed to the end  Such operations can be performed if the fast path is aborted

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs13 Outline  Motivation  Fast Paths  Fast Paths in Concurrent Programs  Experimental Evaluation  Conclusions

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs14 Experimental Evaluation  Implemented the techniques in the paper  In ESP Compiler ─ Supports concurrent programs  Two class of programs  Filter Programs  VMMC Firmware  Answer three questions  Programming effort (annotation complexity) needed  Size of the executable  Performance

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs15 Filter Programs  Well-defined structure  Streaming applications  Use Filter Programs by Probsting et al.  Good to evaluate our technique ─ Concurrency overheads dominate  Experimental Setup  2.66 GHz Pentium 4, 1 GB Memory, Linux 2.4  4 Versions of the code  Annotation Complexity  Program sizes: 153, 125, 190, 196 lines  Annotation sizes: 7, 7, 10, 10 lines P1 C1 P2 P3 C2 P4 C3

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs16 Filter Programs Cont’d Executable Size Performance Program 1 Program 2Program 3Program 4 Better Performance than Both Relatively Small Executable

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs17 VMMC Firmware  Firmware for a gigabit network (Myrinet)  Experimental Setup  Measure network performance between two machines connected with Myrinet ─ Latency & Bandwidth  3 Versions of the firmware ─ Concurrent C version with Manual Fast Paths ─ Process-based code without Fast Paths ─ Process-based code with Compiler-extracted Fast Paths  Annotation Complexity (3 fast paths)  Fast Path Specification: 20, 14, and 18 lines  Manual Fast Paths in C: 1100 lines total

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs18 VMMC Firmware Cont’d Message size (in Bytes) Performance: Latency ss Generated Code Size Assembly Instructions

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs19 Outline  Motivation  Fast Paths  Fast Paths in Concurrent Programs  Experimental Evaluation  Conclusions

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs20 Conclusions  Fast Paths in Concurrent Programs  Evaluated using Filter programs and VMMC firmware  Process-based approach to handle all cases  Keeps executable size reasonable  Automata-based approach to handle only the common cases (Fast Path)  Avoid high overhead of process-based approach  Often outperforms the automata-based code

Questions ?

Intel Labs & Princeton UniversityFast Paths in Concurrent Programs22 ABCDEF Abcdef Ghijk