High Performance Stream Processing for Mobile Sensing Applications

Slides:

Advertisements

Similar presentations

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

Advertisements

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Programmability Issues

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

The Assembly Language Level

Program Representations. Representing programs Goals.

Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.

Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

University of Iowa | Mobile Sensing Laboratory CSense: A Stream-Processing Toolkit for Robust and High-Rate of Mobile Sensing Applications IPSN 2014 Farley.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.

Buffering Techniques Greg Stitt ECE Department University of Florida.

SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.

Memory Buffering Techniques

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Code Optimization.

Virtual memory.

Reza Yazdani Albert Segura José-María Arnau Antonio González

ARM Organization and Implementation

Static Memory Management for Efficient Mobile Sensing Applications

Xiaodong Wang, Shuang Chen, Jeff Setter,

SOFTWARE DESIGN AND ARCHITECTURE

Simultaneous Multithreading

Linear Filters in StreamIt

Ch. 4 – Semantic Analysis Errors can arise in syntax, static semantics, dynamic semantics Some PL features are impossible or infeasible to specify in grammar.

FPGA: Real needs and limits

Data Compression.

Cache Memory Presentation I

Process Scheduling B.Ramamurthy 9/16/2018.

Improving Program Efficiency by Packing Instructions Into Registers

Compiler Construction

Stack Data Structure, Reverse Polish Notation, Homework 7

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Linchuan Chen, Xin Huo and Gagan Agrawal

CSCI1600: Embedded and Real Time Software

Optimizing Malloc and Free

Process Scheduling B.Ramamurthy 11/18/2018.

Objective of This Course

Address-Value Delta (AVD) Prediction

StreamIt: High-Level Stream Programming on Raw

Register Pressure Guided Unroll-and-Jam

Process Scheduling B.Ramamurthy 12/5/2018.

Smita Vijayakumar Qian Zhu Gagan Agrawal

CS 704 Advanced Computer Architecture

Lecture 2- Query Processing (continued)

Process Scheduling B.Ramamurthy 2/23/2019.

Process Scheduling B.Ramamurthy 2/23/2019.

Process Scheduling B.Ramamurthy 4/11/2019.

Process Scheduling B.Ramamurthy 4/7/2019.

Data Structures Unsorted Arrays

Process Scheduling B.Ramamurthy 4/19/2019.

ARM ORGANISATION.

Data Structures & Algorithms

Process Scheduling B.Ramamurthy 4/24/2019.

Process Scheduling B.Ramamurthy 5/7/2019.

COMP755 Advanced Operating Systems

CSCI1600: Embedded and Real Time Software

Dynamic Binary Translators and Instrumenters

CS 201 Compiler Construction

Presentation transcript:

High Performance Stream Processing for Mobile Sensing Applications Graduate Research Symposium 2015 Farley Lai Advisor: Dr. Octav Chipara

Mobile Sensing Applications (MSAs) Introduction Mobile Sensing Applications (MSAs) Speaker Identification HTTP Upload Speech Recording Voice Detection Feature Extraction Speaker Models Sensing Stream Processing MSAs are an emerging class of … Here is an example application the Speaker Identification Sensing and streaming processing phases The sensing phayse is to record speech from mic The stream processing involves voice detection and feature extraction, features may be uploaded … While sensing can be strightforward, stream processing can arbitrarily complex and compute intensive Moreover, this kind of apps are supposed to run in the background for a very long time and deliver continuous results Therefore, it is essential to achieve high performance real-time processing and efficient resource management High performance real-time processing Efficient resource management

A Model for Stream Applications StreamIt – a simple stream language from MIT Static schedule pass-by-value semantics across FIFO channels pipeline Source Duplicate LPF1 Subtract Sink Round-Robin LPF2 split-join Let’s take a closer look to the application model. We use the StreamIt language from MIT to facilitate our optimization. In this model, the stream processing is represented as a pipeline. Here is the band pass filter program that allows only data samples in a particular frequency range to pass. At a high level, the program is a pipeline with filters connected through FIFO channels. A filter is basically a function and has at most one input channel and output channels. The only way to access it input channel is to use the peek() and pop() operations. The only way to access it out channel is to use the push() operation. The pipeline may have the split-join construct that allow the data stream to branch. Here, the splitter duplicates its input and the joiner combines its input in a round-robin fashion. The program execution follows a static schedule. The schedule for this program has an init phase that executes once and a steady that may repeat forever. Each phase specifies the order and times of the filter invocations. Though this model is simple, the pass-by-value semantics across FIFO channels may be inefficient. A filter may not reuse the input memory for its output. Source,3 DUP, 3 LPF1,1 LPF2,1 Source,1 DUP, 1 RR,1 Sub,1 Sink INIT PHASE: STEADY PHASE:

The Memory Management Challenge Workload: memory intensive operations on data streams e.g., windowing, splitting, or appending Goal: implement stream operations efficiently reduce memory footprint reduce number of memory accesses Challenges captures component memory behaviors avoids unnecessary memory copies exploits data sharing between components ESMS: Efficient Static Memory management for Streaming reduces data memory usage by up to 96% up to 8.7X speedup That is why we want to take the memory management challenge to stream processing. Because we observed many stream operations such as windowing, splitting and appending are memory intensive Therefore, the goal is to implement stream operations efficiently. We would like to reduce the memory footprint and the number of memory accesses. To achieve this, we need to address the following challenges First, we need to capture component memory behaviros We also need to avoid unnecessary memory copies and exploit data sharing. Hence, we propose the ESMS…. that reduces the data memory usage up to 96% And improves the performance by 8.7X Our optimization includes the component analysis for each filter, the whole program analysis across filter and layout generation.

Component Analysis of LFP (1) Low Pass Filter (LFP) CFG: Entry sum = 0 work pop 1 push 1 peek 3 { float sum = 0; sum += peek(0) * coeff[0]; sum += peek(1) * coeff[1]; sum += peek(2) * coeff[2]; pop(); push(sum); } sum += peek(0) * coeff[0] sum += peek(1) * coeff[1] sum += peek(2) * coeff[2] Goal, live range The goal of the component analysis to capture the live ranges for each I/O element in one filter invocation The live range tells us when the element is produced for use and when it is no longer used. Let’s go through low pass filter example in band pass filter program. Here is the filter’s work function that basically computes the linear combination of its input elements with some pre-computed coefficients. To facilitate the analysis, we usually convert the function to a CFG so that we can traversal the graph to process the statements pop() push() Exit

Component Analysis of LFP (2) MC in out MC 1 in out LIN[0] =LIN[0]⊔[0,0] Component Analysis of LFP (2) STATE: CFG: Entry LIN [0,0] [0,0] [0,0] LFP ∅ LOUT sum = 0 2 1 sum += peek(0) * coeff[0] MC: 0 sum += peek(1) * coeff[1] MC: 1 sum += peek(2) * coeff[2] MC: 2 MC, input element live ranges, output element live range In the beginning, each stream operations are labeled with a MC to differentiate their order The input element live ranges are initialized to the interval between 0 and 0 The output element live ranges are initialized to be empty Since we only concerns stream operations, the push() peek() and pop(), we can directly get to the first peek(0) statement. pop() MC: 3 push() MC: 4 Exit

Component Analysis of LFP (3) MC in out MC 1 in out LIN[0] =LIN[0]⊔[0,0] Component Analysis of LFP (3) STATE: CFG: Entry LIN [0,0] [0,1] [0,0] LFP ∅ LOUT sum = 0 2 1 sum += peek(0) * coeff[0] MC: 0 sum += peek(1) * coeff[1] MC: 1 sum += peek(2) * coeff[2] MC: 2 pop() MC: 3 push() MC: 4 Exit

Component Analysis of LFP (4) MC in out MC 1 in out LIN[0] =LIN[0]⊔[0,0] Component Analysis of LFP (4) STATE: CFG: Entry LIN [0,2] [0,1] [0,0] LFP ∅ LOUT sum = 0 2 1 sum += peek(0) * coeff[0] MC: 0 sum += peek(1) * coeff[1] MC: 1 sum += peek(2) * coeff[2] MC: 2 pop() MC: 3 push() MC: 4 Exit

Component Analysis of LFP (4) MC in out MC 1 in out LIN[0] =LIN[0]⊔[0,0] Component Analysis of LFP (4) STATE: CFG: Entry LIN [0,2] [0,1] [0,3] LFP ∅ LOUT sum = 0 2 1 sum += peek(0) * coeff[0] MC: 0 sum += peek(1) * coeff[1] MC: 1 sum += peek(2) * coeff[2] MC: 2 pop() MC: 3 push() MC: 4 Exit

Component Analysis of LFP (5) MC in out MC 1 in out LIN[0] =LIN[0]⊔[0,0] Component Analysis of LFP (5) STATE: CFG: Entry LIN [0,2] [0,1] [0,3] LFP [4,4] LOUT sum = 0 2 1 sum += peek(0) * coeff[0] MC: 0 sum += peek(1) * coeff[1] MC: 1 sum += peek(2) * coeff[2] MC: 2 pop() MC: 3 push() MC: 4 Exit

Whole Program Analysis Extends the live ranges to cover all the I/O elements Elements Start End (LPF1, O0) [4, 4] [0, 3] (Subtract, I0) (LPF2, O0) [4, 4] [0, 4] (Subtract, I1) (phase, invocation, MC) With component analysis, the live ranges for one filter invocation are captured. The next step is to extend the live ranges to cover all the I/O elements through the whole program analysis. Usually, one data element live range starts as some filter’s output And ends as another filter’s input. So it has a start live range and an end live range. The whole program analysis relates them and extends the live ranges by including the schedule phase number and filter invocation index with the original live range MC In this way, it should be straightforward to check if two live ranges overlap (LPF1, O0) (0, 6, 4) (0, 9, 3) (LPF2, O0) (0, 7, 4) (0, 9, 4) Overlap: 6 < 7 < 9

Band Pass Filter Layout (1) Memory Layout Live Ranges Source LPF1 Round Robin Subtract Sink LPF2 Duplicate MEM Initialization 0 Source: O0 1 Source: O1 2 Source: O2 Elements Start End (Source: O0) (Source: O1) (Source: O2) (LPF1, O0) (0, 6, 4) (0, 9, 3) (LPF2, O0) (0, 7, 4) (0, 9, 4) (Subtract, O0) (0, 9, 2) (0, 10, 0) Next, with the live range information of the entire program, it is straightforward to generate the memory layout. The memory layout is initialized to be empty. We begin with initialization phase and simulate the schedule once. So first, the source filter executes three time and produces three output elements. Since their live ranges overlap, they need to occupy three memory locations 0, 1, and 2. (0, 0, 0) (0, 1, 0) (0, 2, 0) (0, 7, 3) (1, 3, 3) (1, 6, 1)

Band Pass Filter Layout (2) Memory Layout Live Ranges Source LPF1 Duplicate LPF2 Subtract Sink Round Robin MEM Initialization 0 Source: O0 1 Source: O1 2 Source: O2 3 LPF1: O0 Elements Start End (Source: O0) (0, 0, 0) (0, 7, 3) (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (LPF1, O0) (0, 6, 4) (0, 9, 3) (LPF2, O0) (0, 7, 4) (0, 9, 4) (Subtract, O0) (0, 9, 2) (0, 10, 0) Next, we can skip splitters and joiner because they don’t produce new elements. So we get to the LPF1. … (0, 0, 0) (0, 7, 3) (0, 6, 4) (0, 9, 3)

Band Pass Filter Layout (3) Memory Layout Live Ranges Source LPF2 LPF1 Duplicate Subtract Sink Round Robin MEM Initialization 0 Source: O0 LPF2: O0 1 Source: O1 2 Source: O2 3 LPF1: O0 Elements Start End (Source: O0) (0, 0, 0) (0, 7, 3) (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (LPF1, O0) (0, 6, 4) (0, 9, 3) (LPF2, O0) (0, 7, 4) (0, 9, 4) (Subtract, O0) (0, 9, 2) (0, 10, 0) (0, 0, 0) (0, 7, 3) (0, 7, 4) (0, 9, 4)

Band Pass Filter Layout (4) Memory Layout Live Ranges Source LPF2 Subtract Sink LPF1 Duplicate Round Robin MEM Initialization 0 Source: O0 LPF2: O0 1 Source: O1 2 Source: O2 3 LPF1: O0 4 Subtract: O0 Elements Start End (Source: O0) (0, 0, 0) (0, 7, 3) (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (LPF1, O0) (0, 6, 4) (0, 9, 3) (LPF2, O0) (0, 7, 4) (0, 9, 4) (Subtract, O0) (0, 9, 2) (0, 10, 0) (0, 9, 2) (0, 10, 0)

Band Pass Filter Layout (5) Memory Layout Live Ranges Source LPF2 LPF1 Duplicate Subtract Sink Round Robin MEM Initialization Steady 0 Source: O0 Source: O1 LPF2: O0 1 Source: O1 Source: O2 2 Source: O2 Source: O3 3 LPF1: O0 4 Subtract: O0 Elements Start End (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (Source: O3) (1, 0, 0) (1, 6, 2) (LPF1, O1) (1, 2, 4) (1, 5, 3) (LPF2, O1) (1, 3, 4) (1, 5, 4) (Subtract, O1) (1, 5, 2) (1, 6, 0) Next is the steady phase. Since source output elements 1 and 2 are still live for use, we need to copy and shift to the beginning of the layout to ensure the same memory access. The following procedure is the same, so we simply skip through the slides. (1, 0, 0) (1, 6, 2)

Band Pass Filter Layout (6) Memory Layout Live Ranges Source LPF2 LPF1 Duplicate Subtract Sink Round Robin MEM Initialization Steady 0 Source: O0 Source: O1 LPF2: O0 1 Source: O1 Source: O2 2 Source: O2 Source: O3 3 LPF1: O0 LPF1, O1 4 Subtract: O0 Elements Start End (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (Source: O3) (1, 0, 0) (1, 6, 2) (LPF1, O1) (1, 2, 4) (1, 5, 3) (LPF2, O1) (1, 3, 4) (1, 5, 4) (Subtract, O1) (1, 5, 2) (1, 6, 0) (0, 1, 0) (1, 3, 3) (1, 2, 4) (1, 5, 3)

Band Pass Filter Layout (7) Memory Layout Live Ranges Source LPF2 LPF1 Duplicate Subtract Sink Round Robin MEM Initialization Steady 0 Source: O0 Source: O1 LPF2: O0 LPF2, O1 1 Source: O1 Source: O2 2 Source: O2 Source: O3 3 LPF1: O0 LPF1, O1 4 Subtract: O0 Elements Start End (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (Source: O3) (1, 0, 0) (1, 6, 2) (LPF1, O1) (1, 2, 4) (1, 5, 3) (LPF2, O1) (1, 3, 4) (1, 5, 4) (Subtract, O1) (1, 5, 2) (1, 6, 0) (0, 1, 0) (1, 3, 3) (1, 3, 4) (1, 5, 4)

Band Pass Filter Layout (8) Memory Layout Live Ranges Source LPF2 LPF1 Duplicate Subtract Sink Round Robin MEM Initialization Steady 0 Source: O0 Source: O1 LPF2: O0 LPF2, O1 1 Source: O1 Source: O2 2 Source: O2 Source: O3 3 LPF1: O0 LPF1, O1 4 Subtract: O0 Subtract: O1 Elements Start End (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (Source: O3) (1, 0, 0) (1, 6, 2) (LPF1, O1) (1, 2, 4) (1, 5, 3) (LPF2, O1) (1, 3, 4) (1, 5, 4) (Subtract, O1) (1, 5, 2) (1, 6, 0) The resulting memory layout size is decreased from 14 to 7 compared with the original StreamIt model. Memory layout size is decreased from 14 to 7 (1, 5, 2) (1, 6, 0)

Memory Usage on Intel x86_64 Evaluation Memory Usage on Intel x86_64 removes the channel buffer allocations for splitters and joiners more data reuses ESMS Now, you should be clear how ESMS works. But you might curious about how useful it is. I would like to show the data memory usage on the Intel platform Our EMSM may use different strategies to handle the cases when the output element cannot reuse the input memory That doesn’t matter As you can see in the figure the data size for each benchmarks, we achieve 45 to 96% reductions on the data size because ESMS saves the channel buffer allocations for splitters and joiners and have more data reuses The memory saving can be use to buffer more sensor data. Here are the memory usage reductions on the Intel platform. The right figure shows the data size reduction from 45% to 96% compared with the cache optimization. The left figure shows the code size reduction is 73% on average This is because location sharing prevents unnecessary memory copies of shared elements. Therefore, ESMS reduces both channel buffer sizes and the number memory operations from splitters, joiners and reordering filters. 45% to 96% reductions on data size

Evaluation Speedup on Intel x86_64 avg. speedup for AA, AoC, IP and CacheOpt are 3, 3.1, 3 and 1.07 performance improved by eliminating unnecessary memory operations and reducing cache/memory references ESMS baseline StreamIt Next, we show the performance speedup compared with the baseline StreamIt Overall, we got avg. speedup of 3 while the StreamIt CacheOpt only achieves 1.07. due to eliminating unnecessary memory operations and fewer memory/cache references MSAs are supposed to run all the time for continuous sensing. With significant speedup, it is possible lower CPU utilization may save the battery life. The system can be more responsive too. Finally, we evaluate the performance speedup against the baseline StreamIt. Overall, the average speedup of ESMS is about 3 While the average speedup of the StreamIt cache optimization is merely 1.07. The StreamIt cache optimization is not applicable to our macro benchmarks because it runs out of memory due to large fine-grained FFT settings. To sum up, ESMS improves the performance by removing unnecessary memory operations and reducing the number of cache/memory references with a smaller working set.

Conclusions and Future Work ESMS is effective for stream languages Predictable performance/energy model for stream processing Effective captures the whole program memory behavior exploits the reuse opportunities achieves significant performance improvement Predictable many opt configurations for single core, dual, quad allow for programmers to specify real-time constraints like the latency compiler search for the configuration with the least power consumption believe useful for long-term use of MSAs

Mobile Sensing Laboratory Dr. Chipara Dr. Leo Dr. Marjan Dr. Behnam Finally, here are our mobile sensing lab members This lab is led by Dr. Chipara We have Dr. Leo, Dr. Marjan and Dr. Behnam We have Ph.D. students Shabih, Ryan and me Please feel free to talk to them in the symposium Now, I think I think it’s time to take your questions Farley Shabih Ryan

Thank You Questions?