Using Loop Perforation to Dynamically Adapt Application Behavior to Meet Real-Time Deadlines Henry Hoffmann, Sasa Misailovic, Stelios Sidiroglou, Anant.

Slides:



Advertisements
Similar presentations
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Advertisements

CS1104: Computer Organisation School of Computing National University of Singapore.
Programming Abstractions for Approximate Computing Michael Carbin with Sasa Misailovic, Hank Hoffmann, Deokhwan Kim, Stelios Sidiroglou, Martin Rinard.
IS333, Ch. 26: TCP Victor Norman Calvin College 1.
Accuracy-Aware Program Transformations Sasa Misailovic MIT CSAIL.
UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.
CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.
Randomized Accuracy Aware Program Transformations for Efficient Approximate Computations Sasa Misailovic Joint work with Zeyuan Allen ZhuJonathan KelnerMartin.
VIPER DSPS 1998 Slide 1 A DSP Solution to Error Concealment in Digital Video Eduardo Asbun and Edward J. Delp Video and Image Processing Laboratory (VIPER)
ECE 562 Computer Architecture and Design Project: Improving Feature Extraction Using SIFT on GPU Rodrigo Savage, Wo-Tak Wu.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Virtual Dart: An Augmented Reality Game on Mobile Device Supervisor: Professor Michael R. Lyu Prepared by: Lai Chung Sum Siu Ho Tung.
The SEEC Computational Model Henry Hoffmann, Anant Agarwal PEMWS-2 April 6, 2011.
Image Processing Using Cilk 1 Parallel Processing – Final Project Image Processing Using Cilk Tomer Y & Tuval A (pp25)
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Dancing With Uncertainty Saša Misailović Stelios Sidiroglou Martin Rinard MIT CSAIL.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
SAGE: Self-Tuning Approximation for Graphics Engines
REPETITION STRUCTURES. Topics Introduction to Repetition Structures The while Loop: a Condition- Controlled Loop The for Loop: a Count-Controlled Loop.
Farid Molazem Network Systems Lab Simon Fraser University Scalable Video Transmission for MobileTV.
Stereoscopic Analyzer On-Set Assistance System for 3D Capturing Frederik Zilly.
Frame by Frame Bit Allocation for Motion-Compensated Video Michael Ringenburg May 9, 2003.
: Chapter 12: Image Compression 1 Montri Karnjanadecha ac.th/~montri Image Processing.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
Computers on Cruise Control Creating Adaptive Systems with Control Theory Ricardo Portillo The University of Texas at El Paso
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Image Processing and Computer Vision: 91. Image and Video Coding Compressing data to a smaller volume without losing (too much) information.
Tracking CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.
Reasoning about Relaxed Programs Michael Carbin Deokhwan Kim, Sasa Misailovic, and Martin Rinard.
Application Heartbeats Henry Hoffmann, Jonathan Eastep, Marco Santambrogio, Jason Miller, Anant Agarwal CSAIL Massachusetts Institute of Technology Cambridge,
CMP 131 Introduction to Computer Programming Violetta Cavalli-Sforza Week 3, Lecture 1.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
Department of Computer Science A Case for Coordinating Accuracy-aware Applications with Power-aware Systems Henry Hoffmann
1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Oindrila.
Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 5 Repetition Structures.
CSC 1010 Programming for All Lecture 4 Loops Some material based on material from Marty Stepp, Instructor, University of Washington.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Multimedia Computing and Networking Jan Reduced Energy Decoding of MPEG Streams Malena Mesarina, HP Labs/UCLA CS Dept Yoshio Turner, HP Labs.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
The SEEC Framework and Runtime System Henry Hoffmann MIT CSAIL High Performance Embedded Computing Workshop September.
Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009.
Sunpyo Hong, Hyesoon Kim
Rely: Verifying Quantitative Reliability for Programs that Execute on Unreliable Hardware Michael Carbin, Sasa Misailovic, and Martin Rinard MIT CSAIL.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Detection, Tracking and Recognition in Video Sequences Supervised By: Dr. Ofer Hadar Mr. Uri Perets Project By: Sonia KanOra Gendler Ben-Gurion University.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Jacob R. Lorch Microsoft Research
Tracking Objects with Dynamics
Topics Introduction to Repetition Structures
Mutation Testing Meets Approximate Computing
CSCI1600: Embedded and Real Time Software
Haishan Zhu, Mattan Erez
Computer Architecture
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Virtual-Time Round-Robin: An O(1) Proportional Share Scheduler
Kyoungwoo Lee, Minyoung Kim, Nikil Dutt, and Nalini Venkatasubramanian
Case Study 1 By : Shweta Agarwal Nikhil Walecha Amit Goyal
Topics Introduction to Repetition Structures
Implementation of a De-blocking Filter and Optimization in PLX
CSCI1600: Embedded and Real Time Software
Sculptor: Flexible Approximation with
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Using Loop Perforation to Dynamically Adapt Application Behavior to Meet Real-Time Deadlines Henry Hoffmann, Sasa Misailovic, Stelios Sidiroglou, Anant Agawal and Martin Rinard CSAIL Massachusetts Institute of Technology Cambridge, MA 02139

Outline Introduction/Motivation –Problem –Solution: Loop Perforation Loop Perforation –Finding Loops to Perforate –Controlling Perforation Dynamically Experiments –Using Perforation to Adapt to Faults Conclusion 2

Problem Program is too slow Misses real-time deadlines 3

Solution: Loop Perforation Loop Perforation: –Do not execute all iterations –Skip some instead Profile Program Find loops that take the most time Perforate those loops for (i = 0; i < n; i++) { … } for (i = 0; i < n; i += 2) { … } A Perforated Program: Consumes fewer computational resources Runs faster, consumes less energy, or both Can meet its real-time deadlines! A Perforated Program: Consumes fewer computational resources Runs faster, consumes less energy, or both Can meet its real-time deadlines! Perforate: to make a hole through an object or structure 4

Loop Perforation (cont’d) Maintain Acceptable Quality of Service Don’t PerforatePerforate Increase Speed ? Q: Won’t perforation change the result? A: Yes, so we target applications that have a range of acceptable outputs 5

Static vs. Dynamic Perforation Static loop perforation –Speeds up an application for some QoS loss –Allows applications to be repurposed E.g., a broadcast video encoder can be transitioned to video conferencing Dynamic loop perforation –Allows full QoS unless something bad happens –When something bad happens system adapts to maintain speed Determine which loops to perforate using profiling Our implemented system supports both static and dynamic perforation, this talk focuses on dynamic perforation Our implemented system supports both static and dynamic perforation, this talk focuses on dynamic perforation 6

Outline Introduction/Motivation –Problem –Solution: Loop Perforation Loop Perforation –Finding Loops to Perforate –Controlling Perforation Dynamically Experiments –Using Perforation to Adapt to Faults Conclusion 7

A Perforating Compiler C/C++ Program C/C++ Program Representative Inputs Representative Inputs QoS Metric & Bound QoS Metric & Bound Perforatable Loops Perforatable Loops Responsibility of User Provided as input to the perforating compiler QoS bound – the maximum acceptable loss of QoS Perforating Compiler Maximizes speedup for QoS bound Discards loops which cause: Slow down Unacceptable QoS loss Dynamic errors in Valgrind Result Set of Perforatable Loops Speedup application given QoS bound Perforation may be dynamic Result Set of Perforatable Loops Speedup application given QoS bound Perforation may be dynamic This process is discussed in detail in: Misailovic, Sidiroglou, Hoffmann, Rinard. Quality of Service Profiling. To Appear, ICSE Find costly loops Profile Program Perforate Analyze QoS

Use PARSEC Benchmarks to Test Approach PARSEC Benchmarks* represent emerging workloads We pick seven benchmark applications for which we can define QoS metric –x264 (H.264 video encoding) –bodytrack (human movement tracking) –swaptions (financial analysis) –ferret (content-based similarity search) –canneal (engineering – circuit place & route) –blackscholes (financial analysis) –streamcluster (online approx. of k-means) We augment the benchmark suite with additional data sets and divide into –Training (about 25% of inputs) –Production (remaining 75% of inputs) * 9

Performance/QoS Tradeoffs for PARSEC Benchmarks 10

Dynamically Controlling Perforation Application registers a heartbeat using Application Heartbeats API* Runtime monitors heartbeat Heartbeat too slow? –Increase perforation to trade QoS for increased performance Heartbeat too fast? –Decrease perforation to reclaim QoS Heartbeat API Perforation Selection Heartbeat API Perforation Selection Application Loop 1 Loop 2 Loop i Runtime Monitor *Hoffmann, Eastep, Santambrogio, Miller, Agarwal. Application Heartbeats for Software Performance and Health. PPoPP

Outline Introduction/Motivation –Problem –Solution: Loop Perforation Loop Perforation –Finding Loops to Perforate –Controlling Perforation Dynamically Experiments –Using Perforation to Adapt to Faults Conclusion 12

Evaluation Methodology Two applications (from PARSEC benchmark suite): –x264 (media application performs H.264 video encoding) –bodytrack (computer vision application tracks a body through a scene) Two changing environments: –Core Failure: During execution 3 of 8 cores fail –Frequency Scaling: During execution clock frequency rises and falls For each app and scenario: –Goal: keep performance within.95 to 1.1x that of system with no failures –Measure: Baseline performance (no failure) Performance with failure and no perforation Performance with failure and dynamic perforation 13

x264 Core Loss Experiment Lose 3 of 8 cores 14

bodytrack Core Loss Experiment Lose 3 of 8 cores 15

bodytrack Results (Core Failure) Maintains track on head, chest, and legs despite loss of 37.5% of compute 16

x264 Frequency Scaling Experiment Frequency Drops (2.53 GHz → 1.6 GHz) Frequency Rises (1.6 GHz → 2.53 GHz) 17

bodytrack Frequency Scaling Experiment Frequency Drops (2.53 GHz → 1.6 GHz) Frequency Rises (1.6 GHz → 2.53 GHz) 18

bodytrack Results (Frequency Scaling) Perforation allows app to maintain track while frequency is low. When frequency rises again, high-quality track is reestablished. Perforation allows app to maintain track while frequency is low. When frequency rises again, high-quality track is reestablished. 19

Conclusion Presented loop perforation –Speedup programs by making performance/QoS tradeoffs –Showed as much as 2x speedup for 5% degradation in QoS Presented dynamic loop perforation –Allow system to detect performance loss and respond by perforating loops –Maintain performance in changing environment –Can respond to any environmental change that affects performance More detail on dynamic perforation available in: Hoffmann, Misailovic, Sidiroglou, Agarwal, Rinard. Using Code Perforation to Improve Performance, Reduce Energy Consumption, and Respond to Failures. MIT-CSAIL-TR August,

Backup 21

Number of loops Perforatable Loops in PARSEC Benchmarks 22

x264, Training 23

x264, Production 24

x264 Encoder Uncompressed Video Frame Sequence Compressed Video Stream 25

Motion Estimation Reference Frame Current Frame ? All Perforated Loops Are In Motion Estimation Computation 26

x264 Loop Nest Sum of Hadamard transformed differences loop nest (computes match metric between cur and ref blocks) short temp[4][4]; for (i = 0; i < h; i += 4 ) { for (j = 0; j < w; j += 4 ) { element_wise_subtract(temp, cur, ref, cs, rs); hadamard_transform(temp, 4); value += sum_abs_matrix(temp, 4); } cur += 4*cs; ref += 4*rs; } return value; 27

Perforated x264 Loop Nest Sum of Hadamard transformed differences loop nest (computes match metric between cur and ref blocks) short temp[4][4]; for (i = 0; i < h; i += 8 ) { for (j = 0; j < w; j += 8 ) { element_wise_subtract(temp, cur, ref, cs, rs); hadamard_transform(temp, 4); value += sum_abs_matrix(temp, 4); } cur += 4*cs; ref += 4*rs; } return value; Perforation Effect New block match metric Uses block with best match (as measured by metric) New metric works fine 28

Why Not Just Skip Motion Estimation? Runs 6.8 times faster But encoded video is 3.55 times bigger! 29

bodytrack Training 30

bodytrack Production 31

bodytrack Particle method Annealing layers Dispersed particles Compute with particles 32

bodytrack Next annealing layer Particle dispersion affected by previous layer Continue until done with annealing layers 33

bodytrack Loop for (i = 0; i < layers; i++) { disperse particles for layer do particle computation } 34

Perforated bodytrack Loop for (i = 0; i < layers; i += 2) { disperse particles for layer do particle computation } Perforation Effect Perform fewer annealing layers Perform less work, finish faster 35

Other Perforated Loops in bodytrack Concepts –bodytrack maintains probabilistic model of where body parts are in previous frame –Reads image data from 4 cameras –Performs image processing to get information about where it thinks body is in current frame –Computes probabilistic model for current frame Many perforated loops in error calculations –Between probabilistic model from previous frame –And image data from current frame –Used to obtain probabilistic model for current frame 36

37 Perforated Image Quality Panning camera