WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.

Slides:

Advertisements

Similar presentations

Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.

Advertisements

Impact of Interference on Multi-hop Wireless Network Performance Kamal Jain, Jitu Padhye, Venkat Padmanabhan and Lili Qiu Microsoft Research Redmond.

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Xianfeng Li Tulika Mitra Abhik Roychoudhury

1 Optimizing compilers Managing Cache Bercovici Sivan.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

ECE-777 System Level Design and Automation Hardware/Software Co-design

Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore.

Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.

CML CML Managing Stack Data on Limited Local Memory Multi-core Processors Saleel Kudchadker Compiler Micro-architecture Lab School of Computing, Informatics.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Constraint Systems used in Worst-Case Execution Time Analysis Andreas Ermedahl Dept. of Information Technology Uppsala University.

1 of 14 1 /23 Flexibility Driven Scheduling and Mapping for Distributed Real-Time Systems Paul Pop, Petru Eles, Zebo Peng Department of Computer and Information.

Fundamentals of Python: From First Programs Through Data Structures

All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.

Eliminating Stack Overflow by Abstract Interpretation John Regehr Alastair Reid Kirk Webb University of Utah.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.

Computer Science 12 Design Automation for Embedded Systems ECRTS 2011 Bus-Aware Multicore WCET Analysis through TDMA Offset Bounds Timon Kelter, Heiko.

Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.

Impact Analysis of Database Schema Changes Andy Maule, Wolfgang Emmerich and David S. Rosenblum London Software Systems Dept. of Computer Science, University.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

Minimizing Cache Overhead via Loaded Cache Blocks and Preemption Placement John Cavicchio, Corey Tessler, and Nathan Fisher Department of Computer Science.

Embedded System Design Framework for Minimizing Code Size and Guaranteeing Real-Time Requirements Insik Shin, Insup Lee, & Sang Lyul Min CIS, Penn, USACSE,

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Low Contention Mapping of RT Tasks onto a TilePro 64 Core Processor 1 Background Introduction = why 2 Goal 3 What 4 How 5 Experimental Result 6 Advantage.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

2013/10/21 Yun-Chung Yang An Energy-Efficient Adaptive Hybrid Cache Jason Cong, Karthik Gururaj, Hui Huang, Chunyue Liu, Glenn Reinman, Yi Zou Computer.

A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and.

Timing Analysis of Embedded Software for Speculative Processors Tulika Mitra Abhik Roychoudhury Xianfeng Li School of Computing National University of.

CML CML Compiler Optimization to Reduce Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava* Compiler Microarchitecture Lab Department of Computer.

Zheng Wu. Background Motivation Analysis Framework Intra-Core Cache Analysis Cache Conflict Analysis Optimization Techniques WCRT Analysis Experiment.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.

SNU IDB Lab. Ch4. Performance Measurement © copyright 2006 SNU IDB Lab.

A Unified WCET Analysis Framework for Multi-core Platforms Sudipta Chattopadhyay, Chong Lee Kee, Abhik Roychoudhury National University of Singapore Timon.

Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.

© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.

Static WCET Analysis vs. Measurement: What is the Right Way to Assess Real-Time Task Timing? Worst Case Execution Time Prediction by Static Program Analysis.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

Reducing Code Management Overhead in Software-Managed Multicores

High Performance Computing (HIPC)

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Splitting Functions in Code Management on Scratchpad Memories

CSCI1600: Embedded and Real Time Software

Jian Cai, Aviral Shrivastava Presenter: Yohan Ko

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Suhas Chakravarty, Zhuoran Zhao, Andreas Gerstlauer

Optimizing Heap Data Management on Software Managed Many-core Architectures By: Jinn-Pean Lin.

Spring 2008 CSE 591 Compilers for Embedded Systems

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

CSCI1600: Embedded and Real Time Software

Presentation transcript:

WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2 1 Arizona State University 2 University of California, Berkeley 3 Linköping University RTAS 2014, Berlin, GermanyYooseong Kim 1/18

Timing is important Timing constraints – meet the deadline! For absolute timing guarantees, System-level timing analysis Worst-Case Execution Time (WCET) analysis for individual tasks Reducing the WCET can help meet deadlines RTAS 2014, Berlin, GermanyYooseong Kim 2/18 τ1τ1 τ1τ1 τ3τ3 τ3τ3 τ2τ2 τ2τ2 0Dtime τ1τ1 τ1τ1 τ2τ2 τ2τ2 τ3τ3 τ3τ3 This work is about analyzing and optimizing the WCET of a program

SPM Core DMA Main Memory Software-Managed Multicores (SMM) No direct access to main memory All code and data must be loaded into SPM at the time of execution Isolation among cores – good for real-time systems ex. IBM Cell BE RTAS 2014, Berlin, GermanyYooseong Kim 3/18 SPM Core SPM Core SPM Core Main Memory SPM Core DMA Software-Managed Multicores cannot directly access main memory

SPM Management: Static vs. Dynamic Static management Load data only at loading time Good: When everything fits in the scratchpad Bad: When it doesn’t. – limited locality Dynamic management Bring data in and out in runtime by DMA operations DMA transfers take time RTAS 2014, Berlin, GermanyYooseong Kim 4/18 Main Memory Main Memory SPM 0x0 0xFFFFFFFF 0x0 0xFFFFF Dynamic management involves DMA transfers We try to minimize the impact of DMA transfers on the WCET

Dynamic Management on Traditional Setups vs. SMMs Traditional architectures with scratchpads To exploit more locality It’s for optimization RTAS 2014, Berlin, GermanyYooseong Kim SPM Core DMA Main Memory A A B B C C D D Not-frequently accessedFrequently accessed SPM Core DMA Main Memory SMM architectures Anything that is accessed must be loaded on the SPM It’s a MUST A A B B C C D D B B D D A A B B C C D D A A B B C C D D time Dynamic management is essential to execute a program on SMMs 5/18

Dynamic Code Management Load program code on demand in runtime Granularity: basic blocks or functions? All previous approaches on optimizing WCET are in basic block-level Some basic blocks are left in main memory Thus, not applicable to SMMs Function-level approaches are applicable to both SMMs and traditional architectures RTAS 2014, Berlin, GermanyYooseong Kim v0v0 v0v0 v1v1 v1v1 v2v2 v2v2 v5v5 v5v5 v4v4 v4v4 v3v3 v3v3 f0f0 f0f0 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 All previous techniques on WCET optimizations are not usable on SMMs whereas our approach is usable on any architecture 6/18

Load the callee at a call (and the caller at a return) Function-to-Region Mapping M: F  R Region An abstraction of SPM address space Each region represents a unique SPM address range The size of a region is the size of the largest function in it |R| ≤ |F| f3f3 f3f3 f2f2 f2f2 f1f1 f1f1 f0f0 f0f0 SPM Function-Level Dynamic Code Management RTAS 2014, Berlin, GermanyYooseong Kim f2f2 f2f2 f1f1 f1f1 f3f3 f3f3 Functions Mapping M(f 1 ) = R 1 M(f 2 ) = R 2 M(f 3 ) = R 3 SPM R1R1 R1R1 R2R2 R2R2 R3R3 R3R3 This mapping is not feasible! M(f 3 ) = R 2 R1R1 R1R1 R2R2 R2R2 This mapping is feasible Function-level management needs a function-to-region mapping 7/18

Mapping for ACET ≠ Mapping for WCET RTAS 2014, Berlin, GermanyYooseong Kim f1f1 IF f2f2 f3f3 Path 1 Path 2 (0.3)(0.7) Mapping B f 1,f 3 f2f2 load f 2 Path 1 load f 3 Path 2 reload f 1 R1 R2 10+1= =11 Path 1 Path f1f1 f2f2 3 1 f3f3 2 Each path cost (without DMA) Each path cost (without DMA) DMA cost Mapping A =14 f 1,f 2 f3f3 load f 3 Path 2 load f 2 Path 1 reload f 1 R1 R2 6+2=8 A B 14* *0.7 = * *0.7 = 11 max(14,8) = 14 max(11,11) = 11 ACETWCET A mapping affects the execution time by changing function reloadings. In this paper, we find a mapping for WCET. 8/18

Overview of Our Approach Interference analysis What is the worst-case scenario of function reloadings Integer linear programming (ILP) Optimal, but not scalable A heuristic Sub-optimal, but scalable RTAS 2014, Berlin, GermanyYooseong Kim 9/18

Notation: func(v) and cc v RTAS 2014, Berlin, GermanyYooseong Kim f0f0 f1f1 … call f 1 … ret v0v0 v1v1 v3v3 … ret v2v2 func(v 0 ) = f 0 func(v 1 ) = f 0 func(v 2 ) = f 1 func(v 3 ) = f 0 cc v0 = 0 cc v1 = 0 cc v2 = 1 cc v3 = 1 func(v) – function that v belongs to 10/18

Interference Analysis What causes a function to be reloaded? Loading of other functions (in the same region) IS(v) – the set of all functions that may have been loaded since the last time func(v) was loaded RTAS 2014, Berlin, GermanyYooseong Kim IS(v 3 ) = {f 1 } If f 0 and f 1 share the same region, f 0 could have been evicted by f 1  Assume f 0 has to be reloaded … call f 1 … ret v0v0 v1v1 v3v3 … ret v2v2 Using a mapping and interference sets, we can find out the worst-case function reloading scenario 11/18

ILP Formulation (1): Finding WCEP For all (v,w) in E W v ≥ W w + C v RTAS 2014, Berlin, GermanyYooseong Kim C v = n v ·comp(v) + L v WCET from v to the end of the program Cost of v Computation cost of v If a loading occurs at v, L v is the DMA cost of loading func(v). Otherwise, Lv = 0 Objective function minimize W v s Function loading cost of v The source node 12/18 : Take the max of the sum of costs of all vertices starting from w on a path C v = Computation Cost + Function Loading Cost The objective is to minimize the sum of C v ’s of vertices on WCEP

ILP Formulation (2): Function Loading Cost For all f in F and r in R, RTAS 2014, Berlin, GermanyYooseong Kim L v ≥ n v · cc v · i f,v · M func(v),f,r · DMA func(v) DMA cost of loading func(v) 1 only when func(v) needs to be reloaded at v if both f and g are mapped to r ILP explores all possible mapping choices using M f,g,r The minimizing objective function finds the mapping that minimizes function loading cost on WCEP 13/18

Our Heuristic The number of mapping solutions increases exponentially as the number of functions increases Search a reasonably-limited solution space instead By Merging and Partitioning Cost function: The cost of the longest path (WCET) Iterative, sub-optimal – No optimal substructure RTAS 2014, Berlin, GermanyYooseong Kim f0f0 f1f1 f2f2 Merge Partition Our heuristic finds the best mapping within a limited solution space iteration 0 iteration 1 iteration 2 14/18

Implementation Overview RTAS 2014, Berlin, GermanyYooseong Kim Program Inlined CFG Generation Inlined CFG Interference Analysis Interference Sets ILP 1 WCET Estimate ILP Solver Mapping Solution DMA Instructions Insertion Final Program Loop Bounds SPM Size Function Size Heuristic ILP Generation ILP 2 ILP 1 – For finding a mapping ILP 2 – For finding the WCET only 15/18

Experimental Setup Comparison with three previous mapping techniques FMUM & FMUP +, SDRM * All optimized for average-case Benchmarks from MiBench suite and Mälardalen WCET suite Loop bounds obtained by profiling Verified by simulation with gem5 simulator RTAS 2014, Berlin, GermanyYooseong Kim + Jung et al., ASAP, 2010 * Pabalkar et al., HiPC, /18

Results: WCET Estimates RTAS 2014, Berlin, GermanyYooseong Kim The heuristic performs as well as the ILP Elapsed time Heuristic: < 1sec for all benchmarks ILP: ~ 100 min for susan, >10 days for adpcm The solution of the ILP did not improve after a few minutes Time-limited ILP (< 20 min.) can also be a heuristic Due to its call pattern, no reload occurs regardless of a mapping 17/18

Summary SMMs are a promising architecture for real-time systems But need a comprehensive dynamic management Function-level dynamic management Function-to-region mapping Mapping for ACET ≠ mapping for WCET The first mapping technique tuned for WCET Up to 80% improvement Future work Prefetching by asynchronous DMA Comparison with cache RTAS 2014, Berlin, GermanyYooseong Kim 18/18

Thank you!

Scratchpads, an Alternative to Caches The number of cores keeps increasing Caches Coherence does not scale well to many cores Transparency – easy programming, difficult WCET analysis Scratchpads (SPM) Simple, so scalable ~30% less area and power + Explicitly-managed – more predictable behavior RTAS 2014, Berlin, GermanyYooseong Kim 20/19 + Banakar et al. CODES+ISSS 2002 SPM Core DMA Main Memory Scratchpads can be a good fit for real-time embedded systems