R. Arce-Nazario, M. Jimenez, and D. Rodriguez Electrical and Computer Engineering University of Puerto Rico – Mayagüez.

Slides:



Advertisements
Similar presentations
Copyright 2000 Cadence Design Systems. Permission is granted to reproduce without modification. Introduction An overview of formal methods for hardware.
Advertisements

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.
Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.
Software Modeling SWE5441 Lecture 3 Eng. Mohammed Timraz
Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
1 © 1998 HRL Laboratories, LLC. All Rights Reserved Construction of Bayesian Networks for Diagnostics K. Wojtek Przytula: HRL Laboratories & Don Thompson:
Reducing Multi-Valued Algebraic Operations to Binary J.-H. Roland Jiang Alan Mishchenko Robert K. Brayton Dept. of EECS University of California, Berkeley.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions IEEE/ACM Asia South Pacific Design Automation.
Simulated Annealing 10/7/2005.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
Lattices for Distributed Source Coding - Reconstruction of a Linear function of Jointly Gaussian Sources -D. Krithivasan and S. Sandeep Pradhan - University.
EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.
Threshold Voltage Assignment to Supply Voltage Islands in Core- based System-on-a-Chip Designs Project Proposal: Gall Gotfried Steven Beigelmacher 02/09/05.
Global Constraints for Lexicographic Orderings Alan Frisch, Ian Miguel (University of York) Brahim Hnich, Toby Walsh (4C) Zeynep Kiziltan (Uppsala University)
Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Issues with Data Mining
Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.
A Secure Protocol for Computing Dot-products in Clustered and Distributed Environments Ioannis Ioannidis, Ananth Grama and Mikhail Atallah Purdue University.
CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.
Presented by Tienwei Tsai July, 2005
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
A Graph-based Friend Recommendation System Using Genetic Algorithm
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Engineering Design GE121 The Design Process (continued – Part III)
Materials Process Design and Control Laboratory Finite Element Modeling of the Deformation of 3D Polycrystals Including the Effect of Grain Size Wei Li.
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
An Efficient Linear Time Triple Patterning Solver Haitong Tian Hongbo Zhang Zigang Xiao Martin D.F. Wong ASP-DAC’15.
© 2005 Prentice Hall1-1 Stumpf and Teague Object-Oriented Systems Analysis and Design with UML.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
1 Attractive Mathematical Representations Of Decision Problems Warren Adams 11/04/03.
1 MOTIVATION AND OBJECTIVE  Discrete Signal Transforms (DSTs) –DFT, DCT: major performance component in many applications –Hardware accelerated but at.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Outline Motivation and Contributions Related Works ILP Formulation
Linear Systems Dinesh A.
Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Topics 1 Specific topics to be covered are: Discrete-time signals Z-transforms Sampling and reconstruction Aliasing and anti-aliasing filters Sampled-data.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Test complexity of TED operations Use canonical property of TED for - Software Verification - Algorithm Equivalence check - High Level Synthesis M ac iej.
2009/6/30 CAV Quantifier Elimination via Functional Composition Jie-Hong Roland Jiang Dept. of Electrical Eng. / Grad. Inst. of Electronics Eng.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Data Transformation: Normalization
Cristian Ferent and Alex Doboli
Parallel Algorithm Design
Fast Fourier Transforms Dr. Vinu Thomas
Cyber-Infrastructure
Problem Solving Strategies & Techniques
1-D DISCRETE COSINE TRANSFORM DCT
Michele Santoro: Further Improvements in Interconnect-Driven High-Level Synthesis of DFGs Using 2-Level Graph Isomorphism Michele.
 = N  N matrix multiplication N = 3 matrix N = 3 matrix N = 3 matrix
Fast Fourier Transform (FFT) Algorithms
Presentation transcript:

R. Arce-Nazario, M. Jimenez, and D. Rodriguez Electrical and Computer Engineering University of Puerto Rico – Mayagüez

2 Motivation and Objective Discrete Signal Transforms (DSTs) DFT, DCT, lots of applications Hardware accelerated but at high area cost Distributed (dedicated) hardware architectures (DHAs) Cost-effective Partitioning plays key role Objective: Use inherent properties of DSTs to improve their hardware partitioning to distributed hardware architectures. DST Partitioning DHA

3 Previous Work Automated partitioning of DST to DHA’s DSTs treated as any other algorithm/benchmark [Srinivasan01][Bringmann00] Converted to high-level or structural DFG and treated as such. Manual partitioning & automated code generation DST specific properties exploited [Kumhom01] New formulations developed to exploit architectural features. [VanLoan92] SPIRAL and FFTW – code generation platforms exploring the space of equivalent algorithms. ([Pueschel05], [Frigo05]) [Arce05] – Automated partitioning methodology that incorporates DST features and formulation exploration

4 Partitioning Methodology KPA DST Formulation Architectural Description Formulation Manipulator Formulation To DFG Heuristic Control Partition/ Placement Estimators High-level partition solution KPA Formulation DFG Cost and Indicators Rule Selection KPA Formulation Hypergraph Representation

5 DSTs – General Concepts General formula for d-dimensional DST Essentially a vector-matrix multiplication Fast versions exists, using divide and conquer techniques Highly regular Highly connected Rules can be applied at formulation level: permutation, index-set.. α’s determine type of transform, e.g. DFT:

6 Kronecker Algebra Compact framework for formulation of DSTs Multidimensional, e.g. Fast versions of DSTs Governed by well known rules and properties Formulation ‘implies’ structure F4F4 F2F2 W W F2F2 W W F2F2 W W F2F2 W W F4F4

7 Target topology Similar to existing platforms in market and academia. Annapolis Micro Systems (Wildforce) Gidel (PROC20KE) Berkeley Emulation Engine (BEE) – being proposed as a cost effective alternative to traditional high performance computing systems. M0M0 D0D0 M1M1 D1D1 M k-1 D k-1 Crossbar

8 Partitioning Methodology KPA DST Formulation Architectural Description Formulation Manipulator Formulation To DFG Heuristic Control Partition/ Placement Estimators High-level partition solution KPA Formulation DFG Cost and Indicators Rule Selection KPA Formulation Hypergraph Representation

9 DST properties in our methodology Incorporated graph considerations to partitioning/placement process Exploration of equivalent formulations Partition/ Placement

10 Graph partitioning considerations Focus on horizontal partitioning schemes (SIMD-like implementation) Initial solution = balanced horizontal linear partitioning scheduling consideration: swap nodes from same computational stages. M0M0 D0D0 M1M1 D1D1 M k-1 D k-1 Crossbar Kernigan Lin - bipartitioningHeterogeneous channel k-way partitioning

11 Formulation exploration Formulation Manipulator Formulation To DFG Heuristic Control Partition/ Placement KPA Formulation DFG Cost and Indicators Rule Selection Formulation Manipulator Applies permutation and factorization to Kronecker formulation of DSTs to obtain equivalent formulations Rule Number of possible reformulations grows exponentially with DST size Heuristic control method, first answer questions: Do reformulations have an effect on solution quality? How can we effectively explore the equivalent formulation space to find more apt formulations? Experiments  Gain an understanding of algorithmic level effects on solution quality and convergence.

12 Measuring quality of solution where ‘weight’ of channel i required communications through i D0D0 D1D1 D2D2 D3D3 D0D0 D1D1 D2D2 D3D3 Example: W 01 = W 12 = W 23 = 1, WXBAR = 2

13 Experiment #1 – Inter-stage permutations Since Cooley-Tukey’s FFT several common formulations available. Pease formulation here Experiment – several sizes of 5 common formulations where partitioned. ISP have effect on solution quality, yet no clear winner formulation. Stockahm Tr. Stockahm Cooley-Tukey G. Sande Pease

14 Experiment #2 - Granularity The weight of the nodes for the various computational stages of the transform. coarser finer

15 Experiment #2 – Granularity Decomposition rules: Large DST = combinations of smaller DSTs  analogous to node clustering * Multiple formulations achieved best cost. Coarsest granularity is shown. Effect of topology: Ring vs. Linear: 57% cost reduction Finest granularity not necessarily best.

16 Experiment #3 – Breakdown strategy Breakdown strategy – order and divisors with which a transform is decomposed. Split trees – a common graphical representation of break. Strategy Example: Two split tress for a DFT size 64. (a) (b) (a)(b)

17 Experiment #3 – Results Procedure Exhaustive generation of split trees for DFT sizes n=16 to 256. Formulations partitioned for various topologies Observation of split tree decisions that lead to ‘partition friendly’ formulations Generation of n > 256 formulations using rules.

18 Conclusions and Future Work Methodology for partitioning of DST to DHAs: DST graph considerations Formulation exploration Graph considerations Generation of initial partition linear – provides better results than random. Limitation of node moves – faster convergence time. Exploration at the algorithmic level  experiments Isolated features such as permutations and granularity Effect was evidenced, but hard to establish a relation to solution quality. Coarse granularity = better convergence, good solution quality Breakdown strategy – ‘partition friendly’ formulations generated. Current Work: Experimentation with DCTs. Experimentation with other properties  define overall exploration strategy

19 Acknowledgements Puerto Rico Experimental Program to Stimulate Competitive Research (PR-EPSCoR) WALSAIP - Wide-Area Large Scale Automated Information Project Puerto Rico NASA Space Grant QUESTIONS?