1 MOTIVATION AND OBJECTIVE  Discrete Signal Transforms (DSTs) –DFT, DCT: major performance component in many applications –Hardware accelerated but at.

Slides:



Advertisements
Similar presentations
D ARMSTADT, G ERMANY - 11/07/2013 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo ∗, Xinyu Niu†,
Advertisements

Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
1 of 14 1 /23 Flexibility Driven Scheduling and Mapping for Distributed Real-Time Systems Paul Pop, Petru Eles, Zebo Peng Department of Computer and Information.
Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.
Sharing Aggregate Computation for Distributed Queries Ryan Huebsch, UC Berkeley Minos Garofalakis, Yahoo! Research † Joe Hellerstein, UC Berkeley Ion Stoica,
Common Subexpression Elimination Involving Multiple Variables for Linear DSP Synthesis 15 th IEEE International Conference on Application Specific Architectures.
Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.
Architectural-Level Prediction of Interconnect Wirelength and Fanout Kwangok Jeong, Andrew B. Kahng and Kambiz Samadi UCSD VLSI CAD Laboratory
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions IEEE/ACM Asia South Pacific Design Automation.
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
End-to-End Design of Embedded Real-Time Systems Kang G. Shin Real-Time Computing Laboratory EECS Department The University of Michigan Ann Arbor, MI
1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Torino (Italy) – June 25th, 2013 Ant Colony Optimization for Mapping, Scheduling and Placing in Reconfigurable Systems Christian Pilato Fabrizio Ferrandi,
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Tradeoff Analysis for Dependable Real-Time Embedded Systems during the Early Design Phases Junhe Gan.
Decentralised Coordination of Mobile Sensors School of Electronics and Computer Science University of Southampton Ruben Stranders,
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
A Thermal-Aware Mapping Algorithm for Reducing Peak Temperature of an Accelerator Deployed in a 3D Stack A Thermal-Aware Mapping Algorithm for Reducing.
A performance analysis of multicore computer architectures Michel Schelske.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
Network Aware Resource Allocation in Distributed Clouds.
1 SOC Test Architecture Optimization for Signal Integrity Faults on Core-External Interconnects Qiang Xu and Yubin Zhang Krishnendu Chakrabarty The Chinese.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Energy saving in multicore architectures Assoc. Prof. Adrian FLOREA, PhD Prof. Lucian VINTAN, PhD – Research.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
Performance Enhancement of Video Compression Algorithms using SIMD Valia, Shamik Jamkar, Saket.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
R. Arce-Nazario, M. Jimenez, and D. Rodriguez Electrical and Computer Engineering University of Puerto Rico – Mayagüez.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
Kronecker Products-based Regularized Image Interpolation Techniques
MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
Outline Motivation and Contributions Related Works ILP Formulation
Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.
Requirement Engineering with URN: Integrating Goals and Scenarios Jean-François Roy Thesis Defense February 16, 2007.
Ontologies for the Semantic Web Prepared By: Tseliso Molukanele Rapelang Rabana Supervisor: Associate Professor Sonia Burman 20 July 2005.
Improving System Availability in Distributed Environments Sam Malek with Marija Mikic-Rakic Nels.
Custom Reduction of Arithmetic in Linear DSP Transforms S. Misra, A. Zelinski, J. C. Hoe, and M. Püschel Dept. of Electrical and Computer Engineering Carnegie.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Dynamo: A Runtime Codesign Environment
CDSC/InTrans Review Oct , 2016 Student names: Martin Kong, OSU/Rice
Non-additive Security Games
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
An Improved Split-Row Threshold Decoding Algorithm for LDPC Codes
Algorithms for Budget-Constrained Survivable Topology Design
Unit 2: Fundamentals of Computer Systems
A Robotic Cloud Advisory Service
A Framework for Testing Query Transformation Rules
 = N  N matrix multiplication N = 3 matrix N = 3 matrix N = 3 matrix
Search-Based Approaches to Accelerate Deep Learning
Presentation transcript:

1 MOTIVATION AND OBJECTIVE  Discrete Signal Transforms (DSTs) –DFT, DCT: major performance component in many applications –Hardware accelerated but at high area cost –Example: 4P DFT formulation and others  Distributed (dedicated) hardware architectures (DHAs) –Partitioning plays key role in performance and resource optimization.  Partitioning beyond multi-device –Multi-core GPPs: IBM CELL BE, Intel Core Duo –System-on-Chip: Network-On-Chip  Need tools to aid in partition exploration, mapping, implementation  design automation  Can we improve partitioning by introducing high-level DST considerations? DST Partitioning Distributed Hardware Architecture

2 DMAGIC METHODOLOGY DMAGIC = DST Mapping using Algorithmic and Graph Interaction and Computation 2 Formulation Manipulator Formulation To DFG Heuristic Control Partition/ Placement Estimators High-level partition solution KPA Formulation DFG Cost and Indicators Rule Selection KPA DST Formulation Architectural Description HypergraphDST Formulation

TOOLS Weighted DFG with level information Operator matrices: Identity, Transform, Permutation, Unitary, Unitary Transpose, Twiddle KPA operations: Tensor product (  ), Direct Sum (  ), Matrix Multiplication KTG Graph partitioning/placement algorithm Kronecker to Graph Tool  P/P inspired by Kernighan Lin bipartition heuristic  Extended to k-way partitioning for heterogeneous channels  Cost function sensible to DHA main concerns  DST graphical considerations heuristics weight of channel i required comm. through i Cost Function 3

FORMULATION EXPLORATION 4 Use DST rules to explore space of equivalent formulations in search for one that better suits the target architecture. Combinatorial explosion of the solution space. Find rules amenable to hardware implementation. Conducted experiments to assess the impact of transformations on partition quality. Results used to devise exploration strategy. Objective: Challenges: Approach: Experiments: 4 Effect of inter-column permutations Have effect on solution quality, yet hard to establish heuristic. Effect of node granularity Effect of breakdown strategy using DST decomposition rule Greedy ‘top-down’ formulation exploration using breakdown strategy Observations on exhaustive small sized transforms

5 RESULTS  Latency Reduction: Average = 23.3%, Peak = 34.1%  Run-Time Reduction: Average = 98.0%, Peak = 99.9%

6 CONCLUSIONS AND CONTRIBUTIONS  Multiple opportunities to improve partitioning by taking advantage of DST and DHA features. –Graph level: regularity of permutations and operations. –Graph partitioning and area estimator can be made more sensible to DHA and DST concerns –Algorithmic level –Reformulation has significant impact on partition quality –Improvements over generic methodology  Latency reduction (23.3% avg, 34.1% max), Runtime (98.8% avg, 99.9% max)  Tools: –KTG: automated and extensible methodology for conversion of Kronecker product algebra (KPA) formulations into DFGs –Architectural model and high-level estimator for the implementation of distributed DSTs –Graph partitioning heuristic for k-way for DSTs to DHAs  New heuristic for exploring DST formulation space  New arbitrary decomposition DCT formulation