5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

Slides:

Advertisements

Similar presentations

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Advertisements

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

SoftCOM 2005: 13 th International Conference on Software, Telecommunications and Computer Networks September 15-17, 2005, Marina Frapa - Split, Croatia.

May 2004 Department of Electrical and Computer Engineering 1 ANEW GRAPH STRUCTURE FOR HARDWARE- SOFTWARE PARTITIONING OF HETEROGENEOUS SYSTEMS A NEW GRAPH.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Energy Aware Task Mapping Algorithm For Heterogeneous MPSoC Based Architectures Amr M. A. Hussien¹, Ahmed M. Eltawil¹, Rahul Amin 2 and Jim Martin 2 ¹Wireless.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

CML CML Static Analysis of Processor Idle Cycle Aggregation (PICA) Jongeun Lee, Aviral Shrivastava Compiler Microarchitecture Lab Department of Computer.

I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

An Efficient Linear Time Triple Patterning Solver Haitong Tian Hongbo Zhang Zigang Xiao Martin D.F. Wong ASP-DAC’15.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.

Philipp Gysel ECE Department University of California, Davis

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Scalable Register File Architectures for CGRA Accelerators

Reducing Code Management Overhead in Software-Managed Multicores

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

EPIMap: Using Epimorphism to Map Applications on CGRAs

Tosiron Adegbija and Ann Gordon-Ross+

Namyoon Woo and Heon Y. Yeom

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Ann Gordon-Ross and Frank Vahid*

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

Dynamic Code Mapping Techniques for Limited Local Memory Systems

1. Arizona State University, Tempe, USA

Gary M. Zoppetti Gagan Agrawal

Spring 2008 CSE 591 Compilers for Embedded Systems

Code Transformation for TLB Power Reduction

RAMP: Resource-Aware Mapping for CGRAs

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Presentation transcript:

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava **, Jonghee Yoon and Yunheung Paek **Compiler and Microarchitecture Lab, Center for Embedded Systems, Arizona State University, Tempe, AZ, USA. * Embedded Systems Research Lab, ECE, Ulsan Nat’l Institute of Science & Tech, Ulsan, Korea Software Optimization And Restructuring, Department of Electrical Engineering, Seoul National University, Seoul, South Korea

Coarse-Grained Reconfigurable Array (CGRA) SO&R and CML Research Group 2  High computation throughput  Low power consumption and scalability  High flexibility with fast configuration CategoryProcessorMIPSmWMIPS/mW EmbeddedXscale DSPTI TM320C DSP(VLIW)TI TM320C614T * CGRA shows 10~100MIPS/mW

Coarse-Grained Reconfigurable Array (CGRA) SO&R and CML Research Group 3  Array of PE  Mesh like network  Operate on the result of their neighbor PE  Execute computation intensive kernel

Application mapping in CGRA SO&R and CML Research Group 4  Mapping DFG on PE array mapping space  Should satisfy several conditions  Should map nodes on the PE which have a right functionality  Data transfer between nodes should be guaranteed  Resource consumption should be minimized for performance

CGRA execution & data mapping 5 t c : computation time, t d : data transfer time PE Configuration Memory Main Memory Bk1 buf2 Bk2 buf2 Bk3 buf2 Bk4 buf2 DMA Bk1 buf1 Bk2 buf1 Bk3 buf1 Bk4 buf1 Local memory Double buffering Total runtime = max(t c, t d )

The performance bottleneck : Data transfer SO&R and CML Research Group 6  Many multimedia kernels show bigger t d than t c  Average ratio of t c : just 22% Most applications are memory-bound applications. 100% = t c + t d

Computation Mapping & Data Mapping SO&R and CML Research Group 7 Duplicate array increase data transfer time Local memory 01 2 LD S[i]LD S[i+1] + S[i] S[i+1] 0 1

Contributions of this work SO&R and CML Research Group 8  First approach to consider computation mapping and data mapping - balance t c and t d - minimize duplicate arrays (maximize data reuse) - balance bank utilization  Simple yet effective extension - a set of cost functions - can be plugged in to existing compilation frameworks - E.g., EMS (edge-based modulo scheduling)

Application mapping flow SO&R and CML Research Group 9 DFG Performance Bottleneck Analysis Data Reuse Analysis Memory-aware Modulo Scheduling DCRDRG Mapping

Preprocessing 1 : Performance bottleneck analysis SO&R and CML Research Group 10  Determines whether it is computation or data transfer that limits the overall performance  Calculate DCR(data-transfer-to-computation time ratio) DCR = t d / t c DCR > 1 : the loop is memory-bound

Preprocessing 2 : Data reuse analysis SO&R and CML Research Group 11  Find the amount of potential data reuse  Creates a DRG(Data Reuse Graph)  nodes correspond to memory operations and edge weights approximate the amount of reuse  The edge weight is estimated to be TS - rd TS : the tile size rd : the reuse distance in iterations S[i] S[i+1] D[i] R[i] S[i+5] D[i+10] R2[i]

Application mapping flow SO&R and CML Research Group 12 DFG Performance Bottleneck Analysis Data Reuse Analysis Memory-aware Modulo Scheduling DCRDRG Mapping  DCR & DRG are used for cost calculation

Mapping with data reuse opportunity cost SO&R and CML Research Group 13 PE0PE1PE2PE A[i],A[i+1]B[i] Local Memory PE Bank1Bank A[i] B[i] A[i+1] 4 8 B[i+1] PE Array x x xx x x xx x x xx 6 6 Memory-unaware cost Data reuse opportunity cost New total cost (memory unaware cost + DROC)

BBC(Bank Balancing Cost) SO&R and CML Research Group 14  To prevent allocating all data to just one bank  BBC(b) = β × A(b) β : the base balancing cost(a design parameter) A(b) : the number of arrays already mapped onto bank b PE0PE1PE2PE A[i],A[i+1] A[i] A[i+1] 4 7 B[i] Cand β : 10 Local Memory PE Bank1Bank2 PE Array

Application mapping flow SO&R and CML Research Group 15 DFG Performance Bottleneck Analysis Data Reuse Analysis Memory-aware Modulo Scheduling DCRDRG Mapping Partial Shutdown Exploration

Partial Shutdown Exploration SO&R and CML Research Group 16  For a memory-bound loop, the performance is often limited by the memory bandwidth rather than by computation. ≫ Computation resources are surplus.  Partial Shutdown Exploration  on PE rows and the memory banks  find the best configuration that gives the minimum EDP(Energy-Delay Product)

Example of Partial shutdown exploration TcTdRER*E 4r-2m r-2m /7r8/3 -/65/--/4 0/12/0r D[…], R[…] S[…] -/0r/20/1/- 4/-/--/5/-3/8/67/-/- S[…] D[…], R[…] LD S[i]LD S[i+1] LD D[i] ST R[i] 17

Experimental Setup SO&R and CML Research Group 18  A set of loop kernels from MiBench, multimedia, SPEC 2000 benchmarks  Target architecture  4x4 heterogeneous CGRA(4 memory accessable PE)  4 memory bank, each connected to each row  Connected to its four neighbors and four diagonal ones  Compared with other mapping flow  Ideal : memory unaware + single bank memory architecture  MU : memory unaware mapping(*EMS) + multi bank memory architecture  MA : memory aware mapping + multi bank memory architecture  MA + PSE : MA + partial shutdown exploration * Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures, Hyunchul Park et al, PACT 08

Runtime comparison SO&R and CML Research Group 19 Compared with MU The MA reduces the runtime by 30%

Energy consumption comparison SO&R and CML Research Group 20 MA + PSE shows 47% energy consumption reduction.

Conclusion SO&R and CML Research Group 21  The CGRA provide very high power efficiency while be software programmable.  While previous solutions have focused on the computation speed, we consider the data transfer to achieve higher performance.  We proposed an effective heuristic that considers memory architecture.  It achieves 62% reduction in the energy-delay product which factors into 47% and 28% reductions in the energy consumption and runtime.

SO&R and CML Research Group 22 Thank you for your attention!