Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms 10th Reconfigurable Architectures Workshop.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
POSI/CHS/47158/2002 (March ) Contact: João Luís Sobral Goal Provide a single code base that efficiently execute on Multi-Core,
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Information Society Technologies programme 1 IST Programme - 8th Call Area IV.2 : Computing Communications and Networks Area.
FAST PROTOTYPING OF REAL- TIME ROBOT CONTROL USING MATLAB REAL-TIME WORKSHOP Michele Bongiovanni Basilio Bona
Bachelors Degree in Computer and Electro-techniques Engineering Hugo Monteiro Nº Marco Mouta Nº Final Project towards the Bachelors Degree.
Technology from seed Cloud-TM: A distributed transactional memory platform for the Cloud Paolo Romano INESC ID Lisbon, Portugal 1st Plenary EuroTM Meeting,
Richmond House, Liverpool (1) 26 th January 2004.
System-level Architectur Modeling for Power Aware Computing Dexin Li.
Faculty of Sciences and Technology University of Algarve, Faro João M. P. Cardoso April 30, 2001 IEEE Symposium on Field-Programmable Custom Computing.
Dependency Test in Loops By Amala Gandhi. Data Dependence Three types of data dependence: 1. Flow (True) dependence : read-after-write int a, b, c; a.
Project Breadcrumbs :: Álvaro Figueira (PI) :: Presentation at CoLab annual conference :: Fundação Calouste Gulbenkian :: Breadcrumbs CRACS-FCUP.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
For(int i = 1; i
Progam.-(6)* Write a program to Display series of Leaner, Even and odd using by LOOP command and Direct Offset address. Design by : sir Masood.
Copyright 2008 by Pearson Education 1 Nested loops reading: 2.3 self-check: exercises: videos: Ch. 2 #4.
SE 292 (3:0) High Performance Computing Aug R. Govindarajan Sathish S. Vadhiyar
D ARMSTADT, G ERMANY - 11/07/2013 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo ∗, Xinyu Niu†,
25 seconds left…...
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
1 PART 1 ILLUSTRATION OF DOCUMENTS  Brief introduction to the documents contained in the envelope  Detailed clarification of the documents content.
Dynamic Load Balancing in Scientific Simulation Angen Zheng.
School of Engineering & Technology Computer Architecture Pipeline.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Communication Support for Task-Based Runtime Reconfiguration in FPGAs Shannon Koh COMP4211 Advanced Computer Architectures Seminar 1 June 2005.
Algorithm for Virtually Synchronous Group Communication Idit Keidar, Roger Khazan MIT Lab for Computer Science Theory of Distributed Systems Group.
International Symposium of Physical Design Sonoma County, CA April 2001ER UCLA UCLA 1 Congestion Estimation During Top-Down Placement Xiaojian Yang Ryan.
V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.
Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡
S: Application of quicksort on an array of ints: partitioning.
14/06/ A Data-Model for Context-Aware Deployment of Component-based Applications onto Distributed Systems Dhouha Ayed, Chantal Taconet, and Guy Bernard.
Mapping Techniques for Load Balancing
Session 2: How to catalog Body of Knowledge (BoK) in an area?
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
DISTRIBUTED SYSTEMS RESEARCH GROUP CHARLES UNIVERSITY, PRAGUE Faculty of Mathematics and Physics Generic Environment for Full.
Fine Grain MPI Earl J. Dodd Humaira Kamal, Alan University of British Columbia 1.
An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.
The PATMOS Workshop Series Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow PATMOS 2015, the 25 th International.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
TCPP Planning Workshop on Curriculum Stakeholders and How to Engage Them All Manish Parashar Office of CyberInfrastructure National Science Foundation.
computer
Report on Communication Architecture for Clusters (CAC) Workshop Dhabaleswar K. (DK) Panda Department of Computer and Info. Science The Ohio State University.
Offloading to the GPU: An Objective Approach
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Gaining power of professional buyers. Producer 2 Distributor 2 Producer 1 Distributor 1 Independent Producers Final clients Independent Distributors Professional.
Paper Review Presentation Paper Title: Hardware Assisted Two Dimensional Ultra Fast Placement Presented by: Mahdi Elghazali Course: Reconfigurable Computing.
Operating Systems for Reconfigurable Embedded Platforms: Online Scheduling of Real-Time Tasks -Ramkumar Shankar.
COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY
WP1 WP2 WP3 WP4 WP5 COORDINATOR WORK PACKAGE LDR RESEARCHER ACEOLE MID TERM REVIEW CERN 3 RD AUGUST 2010 ASIC Building Blocks for SLHC ACEOLE Mid Term.
NSF/TCPP Curriculum Planning Workshop Joseph JaJa Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
IEEE Vehicle Power and Propulsion Conference “Spreading E-Mobility Everywhere” October 27-30, 2014 — Coimbra, Portugal Lecture.
ANONYMOUS STORAGE AND RETRIEVAL OF INFORMATION Olufemi Odegbile.
Denis Caromel1 OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis -- IUF IPDPS 2003 Nice Sophia Antipolis, April Overview: 1. What.
WP1 WP2 WP3 WP4 WP5 COORDINATOR WORK PACKAGE LDR RESEARCHER ACEOLE MID TERM REVIEW CERN 3 RD AUGUST 2010 ASIC Building Blocks for SLHC ACEOLE Mid Term.
Managing Massive Trajectories on the Cloud
Scientific DataManagement for biodiversity (and other) data
Simultaneous Multithreading
Adapting Applications and Platforms
Energy Efficient Computing in Nanoscale CMOS
CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.
Mapping the Data Warehouse to a Multiprocessor Architecture
Verilog to Routing CAD Tool Optimization
أنماط الإدارة المدرسية وتفويض السلطة الدكتور أشرف الصايغ
Fixed transaction ordering and admission in blockchains
Presentation transcript:

Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms 10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, th Annual Intl Parallel & Distributed Processing Symposium (IPDPS 2003) João M. P. Cardoso University of Algarve, Faro, INESC-ID, Lisboa Portugal

João M. P. Cardoso RAW 2003 Motivation for(int i=0; i<8;i++) for(int j=0;j<8;j++) CosTrans[j+8*i] = CosBlock[i+8*j]; for(int i=0; i<8;i++) for(int j=0;j<8;j++) { TempBlock[i+j*8] = 0; for(int k=0;k<8;k++) TempBlock[i+j*8] += InIm[i+k*8] * CosTrans[k+j*8]; } How to map sets of computational structures requiring more resources than available?

João M. P. Cardoso RAW 2003 Motivation How to map sets of computational structures requiring more resources than available? Temporal Partitioning for(int i=0; i<8;i++) for(int j=0;j<8;j++) CosTrans[j+8*i] = CosBlock[i+8*j]; for(int i=0; i<8;i++) for(int j=0;j<8;j++) { TempBlock[i+j*8] = 0; for(int k=0;k<8;k++) TempBlock[i+j*8] += InIm[i+k*8] * CosTrans[k+j*8]; }

João M. P. Cardoso RAW 2003 Motivation How to map sets of computational structures requiring more resources than available? Temporal Partitioning Other motivations for Partitioning Computations in Time each design is simpler may lead to better performance! amortize some configuration time by overlapping execution stages use of smaller reconfigurable arrays to implement complex applications For more info: see Cardoso and Weinhardt, DATE 2003

João M. P. Cardoso RAW 2003 Motivation How to map sets of computational structures requiring more resources than available? Temporal Partitioning Computational structures for each loop or set of nested loops implemented in a single partition But, what to do with a Loop requiring more resources than available?

João M. P. Cardoso RAW 2003 Outline Motivation Configure-Execute Paradigm (execution stages) Target Architecture PACT XPP Architecture XPP Configuration Flow XPP-VC Compilation Flow Temporal Partitioning of Loops Experimental Results Conclusions & Future Work

João M. P. Cardoso RAW 2003 f2 c2 f2c2 Configure-Execute Paradigm (Execution Stages) the program in a single configuration two configurations without on-chip context planes and without partial reconfiguration with partial reconfiguration with on-chip context planes Fetch (f)Configure (c)Compute (comp) f1c1comp1comp2 f1c1comp1 f2c2_2 comp2 f1c1comp1comp2 c2_1 time

João M. P. Cardoso RAW 2003 PE X × Y Coarse-grained array: Processing elements (PEs): compute typical ALU operations Two columns of SRAMs (Ms) I/O ports for data streaming PE M M PACT XPP Architecture (briefly)

João M. P. Cardoso RAW 2003 Ready/ack. protocol for each programmable interconnection Flow of data (pre-foundry parameterized bit-widths) Flow of events (1-bit lines) PE M M PACT XPP Architecture (briefly)

João M. P. Cardoso RAW 2003 PE PACT XPP Architecture (briefly) Dynamically reconfigurable: On-chip configuration cache and configuration manager Partial reconfiguration (only those used resources are configured) PE Configuration Manager (CM) Configuration Cache (CC) fetch configure CMPort0 CMPort1 M M

João M. P. Cardoso RAW 2003 XPP Configuration Flow Uses 3 stages to execute each configuration: Array may request the next configuration Configuration manager accepts requests and proceeds without intervention from external host c0; If(CMPort0) then c1; If(CMPort1) then c2; c1 fetchconfigure <N CMport0 CMport1 c2 c0 Configuration Cache (CC) Configuration Manager (CM) c0 Fetch (f)Configure (c)Compute (comp)

João M. P. Cardoso RAW 2003 XPP-VC Compilation Flow TempPart: partitions and generates reconfiguration statements which are executed by Configuration Manager MODGen: maps C subset to NML (PACT proprietary structural language with reconfiguration primitives) C program Preprocessing + Dependence Analysis TempPart Temporal Partitioning MODGen Module Generation (with pipelining) NML file xmap XPP Binary Code NML Control Code Generation (Reconfiguration) For more info: see Cardoso and Weinhardt, FPL 2002

João M. P. Cardoso RAW 2003 Temporal Partitioning One partition for each node in the Hierarchical Task Graph (HTG) TOP level Merge adjacent nodes if combination of both can be mapped to XPP device and if merge does not degrade overall performance If HTG node too large, create separate partition for each node of the inner-HTG and call algorithm recursively start end Loop 1 x coef Loop 2 Loop 3 Loop 4 tmp y

João M. P. Cardoso RAW 2003 Temporal Partitioning of Loops What to do when loops in the program cannot be mapped due to the lack of enough resources? Software/reconfigware approach control of the loop in software, migrates to reconfigware inner-code sections, each one mapped to a single configuration Loop Distribution transforms a loop into two or more loops each loop with the same iteration-space traversal of the original loop inner statements of the original loop are split among the loops Loop Dissevering transforms a loop in a set of configurations cyclic behavior implemented by the configuration flow

João M. P. Cardoso RAW 2003 Temporal Partitioning of Loops Loop Distribution Loop Dissevering … for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) { for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0; Inner Loop 1 for(k=0;k<N;k++) tmp += X[i+ny*N][k+nx*N]* CosBlock[j][k]; TempBlock[i][j] = tmp; } // to be partitioned here for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0;Inner Loop 2 for(k=0;k<N;k++) tmp += TempBlock[k][j]* CosBlock[i][k]; Y[i+ny*N][j+nx*N] = tmp; } …

João M. P. Cardoso RAW 2003 Loop Distribution … for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) for(i=0;i<N;i++) for(j=0;j<N;j++) { Inner Loop 1 TempBlock[i+ny*N][j+nx*N] = tmp; } for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0; for(k=0;k<N;k++) tmp += TempBlock[k+ny*N][j+nx*N]* CosBlock[i][k]; Y[i+ny*N][j+nx*N] = tmp; } … begi n end Conf. 1 Conf. 2 Conf. 1 Conf. 2 tmp += TempBlock[k][j]* CosBlock[i][k];

João M. P. Cardoso RAW 2003 Loop Distribution Cannot be applied to all loops no break of cycles in the dependence graph of the original loop Use of auxiliary array variables for each loop-independent flow dependence of a scalar variable (known as scalar expansion) and for each control dependence in the place where we want to partition the loop Expansion of some arrays But, it preserves the software pipelining potential, and may improve parallelization, cache hit/miss ratio, etc.

João M. P. Cardoso RAW 2003 Loop Dissevering L1: L3: L4: Finish: … nx=0; write nx; read nx; If(nx>=X_DIM_BLK) goto Finish; ny=0; write ny; read ny; read nx; If(ny>=Y_DIM_BLK) goto L4; for(i=0;i<N;i++) for(j=0;j<N;j++) { Inner Loop 1 TempBlock[i][j] = tmp; } read ny; read nx; for(i=0;i<N;i++) for(j=0;j<N;j++) { Inner Loop 2 Y[i+ny*N][j+nx*N] = tmp; } ny++; write ny; goto L3; nx++; write nx; goto L1 … begi n Conf. 1 Conf. 2 end Conf. 3 Conf. 4 Conf. 5 Conf. 1 Conf. 2 Conf. 3 Conf. 4 Conf. 5

João M. P. Cardoso RAW 2003 Loop Dissevering Applicable to every loop Only relies on a configuration manager to execute complex loops May relieve the host microprocessor to execute other tasks No array or scalar expansion (only scalar communication) But, Besides its usage to furnish feasible mappings, is it worth to be applied? Does it lead to efficient solutions (in terms of performance)? What are the improvements if the architecture can switch between configurations in few clock cycles?

João M. P. Cardoso RAW 2003 Experimental Results Compared Architectures Both with runtime support to partial reconfiguration ARCH-A word-grained partial reconfiguration ARCH-B context-planes with switching between contexts in few clock cycles f2c2 f1c1comp1 f2c2_2 comp2 f1c1comp1comp2 c2_1

João M. P. Cardoso RAW 2003 Experimental Results BenchmarkDescription#LoC#loops#loops after loop dist. DCT 8 8 Discrete Cosine Transform on an image BPIC Binary pattern image coding Life Conways game of life algorithm Benchmarks

João M. P. Cardoso RAW 2003 Experimental Results (resource savings) Using loop dissevering When compared to implementations without loop dissevering only 44% (DCT), 66% (BPIC), and 85% (Life) of resources are used Benchmark w/o loop disseveringw/ loop disseveringRatio (#PEs) #configs#PEs#configs#PEs DCT / BPIC / Life4144/ /

João M. P. Cardoso RAW 2003 Experimental Results (speedups) Architecture A (ARCH-A) Word-grained partial reconfiguration Architecture B (ARCH-B) Context-planes DCT

João M. P. Cardoso RAW 2003 Experimental Results (speedups) Life Applying Loop Dissevering Benefits of ARCH-B are neglected when partitions in the loop compute for long times

João M. P. Cardoso RAW 2003 Conclusions Temporal Partitioning + Loop Dissevering guarantees the mapping of theoretically unlimited computational structures Loop Dissevering and Loop Distribution may lead to performance enhancements saving of resources Loop Dissevering applicable to every loop performance efficient implementations may require fast reconfiguration the resultant performance may decrease when innermost loops are partitioned (no more potential for loop pipelining) when each active partition computes for short times (does not amortize the reconfiguration time)

João M. P. Cardoso RAW 2003 Future Work More study on the impact of Loop Dissevering and Loop Distribution To understand the impact of the number of context- planes, configuration cache size, etc. To evaluate loop partitioning when mapping to FPGAs Automatic implementation of Loop Distribution Methods to decide between Loop Dissevering and Loop Distribution

João M. P. Cardoso RAW 2003 Acknowledgments (in the paper) Part of this work has been done when the author was with PACT XPP Technologies, Inc, Munich, Germany. We gratefully acknowledge the support of all the members of PACT XPP Technologies, Inc., especially the help of Daniel Bretz, Armin Strobl, and Frank May, regarding the XDS tools. A special thanks to Markus Weinhardt regarding the fruitful discussions about loop dissevering and the XPP-VC compiler.