LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

1 of 14 1/15 Design Optimization of Multi-Cluster Embedded Systems for Real-Time Applications Paul Pop, Petru Eles, Zebo Peng, Viaceslav Izosimov Embedded.

CML CML Cache Vulnerability Equations for Protecting Data in Embedded Processor Caches from Soft Errors † Aviral Shrivastava, € Jongeun Lee, † Reiley Jeyapaul.

A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Sanghyun Park, §Aviral Shrivastava and Yunheung Paek

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.

CML CML Static Analysis of Processor Idle Cycle Aggregation (PICA) Jongeun Lee, Aviral Shrivastava Compiler Microarchitecture Lab Department of Computer.

I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

CML CML Compiler Optimization to Reduce Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava* Compiler Microarchitecture Lab Department of Computer.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

My Coordinates Office EM G.27 contact time:

CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.

Scalable Register File Architectures for CGRA Accelerators

Reducing Code Management Overhead in Software-Managed Multicores

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

EPIMap: Using Epimorphism to Map Applications on CGRAs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

1. Arizona State University, Tempe, USA

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Code Transformation for TLB Power Reduction

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

RAMP: Resource-Aware Mapping for CGRAs

Presentation transcript:

LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung Paek **Compiler and Microarchitecture Lab Center for Embedded Systems Arizona State University, Tempe, AZ, USA. * High Performance Computing Lab UNIST (Ulsan National Institute of Sci & Tech) Ulsan, Korea Software Optimization And Restructuring Department of Electrical Engineering Seoul National University, Seoul, Korea

Coarse-Grained Reconfigurable Array (CGRA) SO&R and CML Research Group 2  High computation throughput  High power efficiency  High flexibility with fast reconfiguration CategoryProcessorMIPSmWMIPS/mW VLIWItanium GPPAthlon 64 Fx GPMPIntel core 2 duo EmbeddedXscale DSPTI TM320C MPCell PPEs DSP(VLIW)TI TM320C614T * CGRA shows 10~100MIPS/mW

Coarse-Grained Reconfigurable Array (CGRA) SO&R and CML Research Group 3  Array of PE  Mesh-like interconnection network  Operate on the result of their neighbor PE  Execute computation intensive kernel Local Memory Configuration Memory PE Array

Execution Model SO&R and CML Research Group 4  CGRA as a coprocessor  Offload the burden of the main processor  Accelerate compute-intensive kernels Main Processor CGRA Main memory DMA controller

Memory Issues SO&R and CML Research Group 5  Feeding a large number of PEs is very difficult  Irregular memory accesses  Miss penalty is very high  Without cache, compiler has full responsibility  Multi-bank memory  Large local memory helps  High throughput R load S[i] - + load D[i] * store R[i] Bank1 Bank2 Bank3 Bank4 Local Memory PE Array  Memory access freedom is limited  Dependence handling  Reuse opportunity

MBA (Multi-Bank with Arbitration) SO&R and CML Research Group 6

Contributions SO&R and CML Research Group 7  Previous work  Hardware solution: Use load-store queue  More hardware, same compiler  Our solution  Compiler technique: Use conflict-free scheduling MBAMBAQ Memory Unaware Scheduling Baseline Previous work [Bougard08] Memory Aware Scheduling ProposedEvaluated

How to Place Arrays  Interleaving  Balanced use of all banks  Spread out bank conflicts  More difficult to analyze access behavior  Sequential  Easy-to-analyze behavior  Unbalanced use of banks 8 SO&R and CML Research Group 4-element array on 3-bank memory Bank1 Bank2 Bank3

Hardware Approach (MBAQ + Interleaving) SO&R and CML Research Group 9  DMQ of depth K can tolerate up to K instantaneous conflicts  DMQ cannot help if average conflict rate > 1  Interleaving makes bank conflicts spread out NOTE: Load latency is increased by K-1 cycles How to improve this using compiler approach?

Operation & Data Mapping: Phase-Coupling SO&R and CML Research Group 10  CGRA mapping = operation mapping + data mapping PE0 PE3 PE1 PE2 Bank1 Bank2 Arb. Logic PE0PE1PE2PE Bank1 A, B Bank2 C Conflict ! A[i]B[i] C[i]

Array clustering Our Approach SO&R and CML Research Group 11  Main challenge  Solving inter-dependent problems between operation and data mapping  Solving simultaneously is extremely hard  solve them sequentially  Application mapping flow  Pre-mapping  Array clustering  Conflict free scheduling DFG Pre-mapping Conflict free scheduling Array analysis Array clustering If array clustering fails If scheduling fails

Conflict Free Scheduling SO&R and CML Research Group 12  Our array clustering heuristic guarantees the total per- iteration access count to the arrays included in a cluster  Conflict free scheduling  Treat memory banks, or memory ports to the banks, as resources  Save the time information that memory operation is mapped on  Prevent that two memory operations belonging same cluster is mapped on the same cycle

Conflict Free Scheduling Example SO&R and CML Research Group PE0PE1PE2PE3C1C A[i] B[i] C[i] Cluster1Cluster2 A[i], C[i]B[i] II= r r x x x x x x x x xx x A x x B PE0 PE3 PE1 PE2 Bank1 Bank2 Arb. Logic

Array Clustering SO&R and CML Research Group 14  Array mapping affect performance in at least two ways  Concentrated arrays in a few bank decrease bank utilization  Array size  Each array is accessed a certain number of times per iteration. If ∑ A∈∁ Acc L A >II’ L there can be no conflict free scheduling ( ∁ : array cluster, II’ L : the current target II of loop L )  Array access count  It is important to spread out both  Array sizes & array accesses

Array Clustering SO&R and CML Research Group 15  Pre-mapping  Find MII for array clustering  Array analysis  Priority heuristic for which array to place first  Priority A = Size A /SzBank + Acc L A /II’ L  Cluster assignment  Cost heuristic for which cluster an array gets assigned to  Cost(∁, A) = Size A /SzSlack ∁ + Acc L A /AccSlack L ∁  Start from the highest priority array

Experimental Setup SO&R and CML Research Group 16  Sets of loop kernels from MiBench, multimedia benchmarks  Target architecture  4x4 heterogeneous CGRA (4 load-store PE)  4 local memory banks with arbitration logic (MBA)  DMQ depth is 4  Experiment 1  Baseline  Hardware approach  Compiler approach  Experiment 2  MAS + MBA  MAS + MBAQ MBAMBAQ Memory Unaware Scheduling Baseline Hardware approach Memory Aware Scheduling Compiler approach

Experiment 1 SO&R and CML Research Group 17 MAS shows 17.3% runtime reduction

Experiment 2 SO&R and CML Research Group 18  Stall-free condition  MBA: At most one access to each bank at every cycle  MBAQ: At most N accesses to each bank in every N consecutive cycles DMQ is unnecessary with memory aware mapping

Conclusion SO&R and CML Research Group 19  Bank conflict problem in realistic memory architecture  Considering data mapping as well as operation mapping is crucial  Propose compiler approach  Conflict free scheduling  Array clustering heuristic  Compared to hardware approach  Simpler/faster architecture with no DMQ  Performance improvement: up to 40%, on average 17%  Compiler heuristic can make DMQ unnecessary

SO&R and CML Research Group 20 Thank you for your attention!

Appendix SO&R and CML Research Group 21

Resource table Array Clustering Example 22 Name#Acc / iter A1 B3 C2 D3 Name#Acc / iter C2 D2 E3 II’ = 3 II’ = 5 NamePriority A1/4 + 1/3 = 0.58 B1/4 + 3/3 = 1.25 C1/4 + 2/3 + 2/5 = 1.32 D1/4 + 3/3 + 2/5 = 1.65 E1/4 + 3/5 = 0.85 NamePriority D1.65 C1.32 B1.25 E0.85 A0.58 Bank1 Bank2 Bank3 Loop 1 (II’ = 3) Loop 2 (II’ = 5) Cost(B1,D) = 1/4 + 3/3 + 2/5 = 1.65 Cost(B2,D) = 1/4 + 3/3 + 2/5 = 1.65 Cost(B3,D) = 1/4 + 3/3 + 2/5 = 1.65 D 32 Cost(B1,C) = X Cost(B2,C) = 1/4 + 2/3 + 2/5 = 1.32 Cost(B3,C) = 1/4 + 2/3 + 2/5 = 1.32 C 22 Cost(B1,B) = X Cost(B2,B) = X Cost(B3,B) = 1/4 + 3/3 = 1.32 B 3 E 3 A 3 Cost(B1,E) = 1/3 + 3/3 = 1.33 Cost(B2,E) = 1/3 + 3/3 = 1.33 Cost(B3,E) = 1/3 + 3/5 = 0.93  If array clustering failed, increased II and try again.  We call the II that is the result of Array clustering MemMII  MemMII is related with the number of access to each bank for one iteration and a memory access throughput per a cycle.  MII = max(resMII, recMII, MemMII)

Memory Aware Mapping SO&R and CML Research Group 23  The goal is to minimize the effective II  One expected stall per iteration effectively increases II by 1  The optimal solution should be without any expected stall If there is an expected stall in an optimal schedule, one can always find another schedule of the same length with no expected stall  Stall-free condition At most one access to each bank at every cycle (for DMA) At most n accesses to each bank in every n consecutive cycles (for DMAQ)

Application mapping in CGRA SO&R and CML Research Group 24  Mapping DFG on PE array mapping space  Should satisfy several conditions  Should map nodes on the PE which have a right functionality  Data transfer between nodes should be guaranteed  Resource consumption should be minimized for performance

How to place arrays SO&R and CML Research Group 25  Interleaving  Guarantee a balanced use of all the banks  Randomize memory accesses to each bank ⇒ spread bank conflicts around  Sequential  Bank conflict is predictable at compiler time Assign size 4 array on local memory 0x00 Bank1 Bank2

Proposed scheduling flow 26 DFG Pre-mapping Array clustering Conflict aware scheduling Array analysis Cluster assignment If cluster assignment fails If scheduling fails DFG Pre-mapping Array clustering Conflict aware scheduling Array analysis Cluster assignment If cluster assignment fails If scheduling fails

Resource table Array clustering example SO&R and CML Research Group 27 Name#Acc / iter A1 B3 C2 D3 Name#Acc / iter C2 D2 E3 II’ = 3 II’ = 5 NamePriority A1/4 + 1/3 = 0.58 B1/4 + 3/3 = 1.25 C1/4 + 2/3 + 2/5 = 1.32 D1/4 + 3/3 + 2/5 = 1.65 E1/4 + 3/5 = 0.85 NamePriority D1.65 C1.32 B1.25 E0.85 A0.58 Bank1 Bank2 Bank3 Loop 1 (II’ = 3) Loop 2 (II’ = 5) Cost(B1,D) = 1/4 + 3/3 + 2/5 = 1.65 Cost(B2,D) = 1/4 + 3/3 + 2/5 = 1.65 Cost(B3,D) = 1/4 + 3/3 + 2/5 = 1.65 D32 Cost(B1,C) = X Cost(B2,C) = 1/4 + 2/3 + 2/5 = 1.32 Cost(B3,C) = 1/4 + 2/3 + 2/5 = 1.32 C22 Cost(B1,B) = X Cost(B2,B) = X Cost(B3,B) = 1/4 + 3/3 = 1.32 B3 E3A3

Conflict free scheduling example SO&R and CML Research Group PE0PE1PE2PE3CL1CL2 0xx 1A 2B 3x 4xxx 5xxxx 6xxx A[i] B[i] C[i] Cluster1Cluster2 A[i], C[i]B[i] II= c1 c2 r r

Conflict free scheduling with DMQ SO&R and CML Research Group 29  In conflict free scheduling, MBAQ architecture is used for relaxing the mapping constraint.  Can permit several conflict within a range of added memory operation latency.