MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, 2012 1 MOBY-DIC Final Workshop Circuit implementations Marco Storace.

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Distributed Arithmetic
Fast Algorithms For Hierarchical Range Histogram Constructions
ARM-DSP Multicore Considerations CT Scan Example.
Searching on Multi-Dimensional Data
Lecture 9: D/A and A/D Converters
Parametric Throughput Analysis of Synchronous Data Flow Graphs
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Double buffer SDRAM Memory Controller Presented by: Yael Dresner Andre Steiner Instructed by: Michael Levilov Project Number: D0713.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
UNIVERSITY OF MASSACHUSETTS Dept
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Distributed Arithmetic: Implementations and Applications
FLANN Fast Library for Approximate Nearest Neighbors
GPGPU platforms GP - General Purpose computation using GPU
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
DARPA Digital Audio Receiver, Processor and Amplifier Group Z James Cotton Bobak Nazer Ryan Verret.
Moby-Dic Workshop, Noordwijkerhout, August 23, 2012 ASIC Implementation of the PWA Generic Canonical Form Dpto. Electrónica y Electromagnetismo, Universidad.
1 Chapter 3: Efficiency of Algorithms Quality attributes for algorithms Correctness: It should do things right No flaws in design of the algorithm Maintainability.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
MOBY DIC TOOLBOX MOBY DIC TOOLBOX MOBY-DIC WORKSHOP Noordwijkerhout, MOBY-DIC WORKSHOP Noordwijkerhout,
New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Implementation of Finite Field Inversion
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
1 P. David, V. Idasiak, F. Kratz P. David, V. Idasiak, F. Kratz Laboratoire Vision et Robotique, UPRES EA 2078 ENSI de Bourges - Université d'Orléans 10.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
CIS 350 – I Game Programming Instructor: Rolf Lakaemper.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
C OMPARING T HREE H EURISTIC S EARCH M ETHODS FOR F UNCTIONAL P ARTITIONING IN H ARDWARE -S OFTWARE C ODESIGN Theerayod Wiangtong, Peter Y. K. Cheung and.
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/
Performance Performance
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
Comparison of Various Multipliers for Performance Issues 24 March Depart. Of Electronics By: Manto Kwan High Speed & Low Power ASIC
VLSI Design of 2-D Discrete Wavelet Transform for Area-Efficient and High- Speed Image Computing - End Presentation Presentor: Eyal Vakrat Instructor:
Distortion Correction ECE 6276 Project Review Team 5: Basit Memon Foti Kacani Jason Haedt Jin Joo Lee Peter Karasev.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
Lecture 3: Uninformed Search
Data Transformation: Normalization
September 2 Performance Read 3.1 through 3.4 for Tuesday
Algorithms with numbers (1) CISC4080, Computer Algorithms
Objective of This Course
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Guest Lecture by David Johnston
Comparison of Various Multipliers for Performance Issues
EGR 2131 Unit 12 Synchronous Sequential Circuits
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, MOBY-DIC Final Workshop Circuit implementations Marco Storace

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Normal forms PWA generic (PWAG) -) more focused on accuracy of the corresponding digital circuit implementations -) meets the requirements of ‘direct’ synthesis methods -) the architecture requires a Finite State Machine Example of generic PWA function Corresponding (irregular) domain partition

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, PWA Simplicial (PWAS) -) more focused on speed of the corresponding digital circuit implementations -) suitable for both ‘indirect’ and ‘direct’ synthesis methods x 1 x 2 Example of PWAS function Corresponding (simplicial) domain partition Normal forms

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, PWA multi-resolution hyper-rectangular (mPWAR) -) suitable for both ‘indirect’ and ‘direct’ synthesis methods -) even discontinuous PWA functions -) the architecture requires a Finite State Machine Normal forms Corresponding domain partition Example of mPWAR function

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, PWA single-resolution hyper-rectangular (sPWAR) -) simpler digital architecture w.r.t. mPWAR, resulting in faster evaluation time -) even discontinuous PWA functions Normal forms Corresponding domain partitionExample of sPWAR function

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Point location problem All normal forms  PWA functions = sets of affine functions defined over subregions of a given domain. Computational standpoint: we have to solve 2 problems. 1) Point location problem: find the polytope a given input point belongs to. Chosen normal form  higher or lower computational effort (e.g., regular partition decreases the required computational effort). Circuit point of view: this problem can be a bottleneck for some normal forms and for large input dimensions. 2) Computation of the affine function defined over a given polytope. This require just a memory + products and sums  problem trivial from a circuit point of view, but the required memory can become very large for high input dimensions.

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, PWAG: Some details Reference (hyperrectangular) domain scaled to: S C = {x   n : -1  x i < 1, i = 1,…,n} PWAG function f defined over S C : f(x) = f j ’x + g j for any x   j S C is partitioned into generic polytopic regions  j (j = 1,...,L G ), such that S C =  j L G  j  j  k =  for any j  k jj

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Point location strategy for PWAG Many algorithms proposed to solve this problem (e.g., see works by T.A. Johansen et al.), but not all of them suitable for circuit implementation. Quite efficient algorithm to solve the point location problem and suitable for circuit implementation: based on a binary search tree. Tree computed off-line based on the domain partition: each non-leaf node  a partition edge and each leaf node  one of the L G polytopes. Tree explored on-line: the search complexity (  computation time) depends mainly on the maximum depth of the tree, which in turn depend on L G and on the polytopes shapes.

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Also, the tree exploration requires the evaluation of the partition edges corresponding to non-leaf nodes, in order to continue the search in the tree branch containing the leaf node a given input belongs to. Such kind of data structure is circuit implemented through a Finite State Machine. Main challenges: to minimize the tree depth (  minimize the time required by the on-line tree exploration) through the off-line optimization. Also relevant is keeping the total number of nodes at a minimum, as this would decrease the circuit dimensions. Point location strategy for PWAG

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Digital architectures There are basically 2 architectures: A) mainly serial (simpler, but slower) B) mainly parallel (faster, but more complex) Both of them can have 3 input acquisition methods (with increasing complexity and speed): -) serial bit-wise (at each clock cycle one bit of all input components is read) -) serial component-wise (at each clock cycle a whole input component is read) -) parallel (all components are read together in one clock cycle). Then we have 6 possible combinations, allowing one to choose the best trade-off between speed and size of the circuit.

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Digital architectures Binary search tree Contains a memory and either a Multiply and Accumulate (MAC) block (serial architecture) or a bank of multipliers and adders (parallel architecture) Three possible acquisition methods Can add delays to meet the sampling times

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Latency times n: dimension of the input (i.e., number of components of the input vector x) m: dimension of the output (i.e., number of components of the PWA function f) d: maximum depth of the binary search tree b: number of bits used to code the input Input methodPWAG(A) serialPWAG(B) parallel bitwised(n+2)+m(n+2)+b+22d+2m+b+3 comp-wised(n+2)+m(n+2)+n+22d+2m+n+3 paralleld(n+2)+m(n+2)+32d+2m+4

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, The other normal forms When d and/or n become too large, the PWAG circuit solution can be either too slow (  impossible to meet the sampling times) or too complex (  impossible to implement it in a given FPGA board). In this case, we can try to resort to the other normal forms, that can be used to approximate the PWAG function through a PWA controller (PWAS or mPWAR or sPWAR), which is usually faster and/or simpler.

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, PWAS: Some details Reference (hyperrectangular) domain scaled to : S C = {x   n : 0  x i < m i, i = 1,…,n} For circuit reasons, we choose m i = 2 p i  1, with p i positive integer. Often m i = m S  i simplex vertex

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Point location strategy for PWAS Main idea: the regular partition of the domain makes the point location problem quite a simple task, owing to the Kuhn lemmas. Drawback: the number of coefficients (i.e., size of the memory) equals the number of vertices  curse of dimensionality. The effects of this problem can be reduced by adding a pre-scaler block that allows to have non-uniform simplicial partitions. Main challenges: to find a uniform or non-uniform simplicial partition to approximate at best a given PWAG function, but keeping the total number of nodes at a minimum, as this would decrease the circuit dimensions.

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Digital architectures There is no need of a Finite State Machine to solve the point location problem. The main operation required to this end is sorting in ascending order the decimal parts of the input components. Again, there are basically 2 architectures: A) serial and B) parallel, with the 3 input acquisition methods described in the PWAG case. Then we have 6 possible combinations, allowing one to choose the best trade-off between speed and size of the circuit. The Output block has the only function of waiting for some clock cycles in order to meet the correct sampling times, since the PWAS architectures have latencies depending only on n, m, and b.

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Digital architectures Contains a memory and either a Multiply and Accumulate (MAC) block (serial architecture) or a bank of multipliers and adders (parallel architecture) -) 3 acquisition methods -) contains a sorter -) may contain a pre-scaler (  non-uniform simplicial partition)

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Latency times n: dimension of the intput (i.e., number of components of the input vector x) m: dimension of the output (i.e., number of components of the PWA function f) b: number of bits used to code the input Input methodPWAS(A) serialPWAS(B) parallel bitwisem(n+3)+b+1b+3 comp-wisem(n+3)+n+1n+3 parallelm(n+3)+24

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, PWAR: Some details Reference domain: S C = {x   n : -1  x i < 1, i = 1,…,n} Single-resolution (sPWAR) Each coordinate axis is divided into m C (= 2 r for circuit reasons) subintervals of the same length 2 (1-r) Multi-resolution (mPWAR) r levels of refinement of S = S (0). At first level S (1) each coordinate axis is divided into 2 identical subintervals of unitary length. Then, only some hypercubes are splitted, others not, as opposed to the single-resolution case. Choice of the hypercubes to be further refined: depends on the level of detail required for the PWA function in a certain region of the domain. At level S (r) the subintervals’ length is 2 (1-r)

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, mPWARsPWAR PWAR: Some details r = 3

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Point location strategy for mPWAR Main idea: to build offline an orthogonal search tree such that the real-time search complexity is logarithmic with respect to the number of regions. Dimension of the tree: scales with n Depth of the tree (  time required by the on-line exploration): depends on the refinement level r. Main challenges: to minimize the orthogonal tree depth by tuning the parameter r (  minimize the time required by the on-line tree exploration). Also relevant is keeping the total number of nodes at a minimum, as this would decrease the circuit dimensions.

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Digital architecture mPWAR: architecture very similar to the PWAG case, the only difference is in the Finite State Machine that addresses the memory

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Digital architecture sPWAR: there is no need of a binary search tree, the architecture is more similar to the PWAS case (also in this case the point location problem is solved by exploiting the regular partition)

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, PWAS and sPWAR: suitable for small-scale problems (higher speed and simpler digital structure). No FSM required. sPWAR: also discontinuous functions. Both heavily affected by “curse of dimensionality” (memory size grows exponentially with input dimension n). PWAG: implements every generic PWA function. Latency and number of parameters to be stored grow very fast for problems with a highly irregular domain partitioning. mPWAR: simpler structure ( → shorter computation times) and great memory saving (no information about the edges). Drawback: dimensions of the FSM, mainly for high-dimension problems requiring a deep level of refinement. Comparisons between architectures

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Comparisons between architectures ArchitectureLatency (# ck cycles)Memory sizeMax # of nodes PWAG(A)d(n+2)+m(n+2)+3(E+L G )(n+1) PWAG(B)2d+2m+4  PWAS(A)m(n+3)+2(m S +1) n NO tree PWAS(B)4(m S +1) n (n+1)NO tree mPWAR(A)r+n+1L M (n+1) mPWAR(B)r+1   sPWAR(A)n+2(m r ) n (n+1)NO tree sPWAR(B)2  n: number of input variables (n = dim(x)) L G, L M : number of regions (PWAG, mPWAR, resp.) E: number of edges defining the domain partition (PWAG) d, r: maximum depth of the (binary, orthogonal) tree (PWAG, mPWAR, resp.) m S, m r = 2 r : number of subintervals per dimension (PWAS, sPWAR) parallel input method

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Comparisons between architectures Double integrator FunctionMax. errorRel. error PWAS mPWAR sPWAR Solution of the MPC problem: PWAG function u*(x) Approximation with PWAS and PWAR functions u(x) 2 kinds of error computed: -) maximum error : max x |u*(x) – u(x)| -) relative error :  j |u*(x j ) – u(x j )|/  j |u*(x j )| (computed on a grid of samples) 3D example FunctionMax. errorRel. error PWAS mPWAR sPWAR

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Comparisons between architectures PWAS approximation: m S1 = 31, m S2 = 7, m S3 = 15 (  simplices & 4096 coefficients) mPWAR approximation: r = 5, L M = 1548 regions sPWAR approximation: m r1 = 32, m r2 = 8, m r3 = 16 (  4096 regions & coefficients) Benchmark example: 3D example (Mayne & Rakovic, Int. J. Robust and Nonlinear Control, 2003) Domain S = {-10  x 1,3  10, -5  x 2  5} Constraints on the control: -1  u  1 Matlab Multi-Parametric tbx  PWAG control law u*(x) L G = 256 polytopes E = 249 edges d = 12

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Comparisons between architectures Latency: obtained as (# of clock cycles needed to perform the PWA computation)  50ns (i.e., clock frequency suitable for all architectures after the post-synthesis simulation on the FPGA). Some architectures can work at higher frequencies. Used b = 12 bits to represent the data in all cases. Data obtained with older PWAG and PWAS architectures (  not always with the data reported in previous slides). Performances on a Xilinx Spartan III FPGA (xc3s200). Power consumption for all architectures:  60 mW

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Comparisons between architectures ArchitectureOccup. Slices Latency [  s] Mem. size [kb] PWAG53% PWAS(A)12% PWAS(B)12% mPWAR(A)79% mPWAR(B)78% sPWAR(A)2% sPWAR(B)2% One can choose among the architectures, if any, that fulfill the system constraints (available FPGA board, sampling times, etc.).

MOBY-DIC final workshop --- Noordwijkerhout (NL), August 23, Conclusions Many circuit solutions to implement explicit MPC control systems Proposed architectures completely reconfigurable and suitable for FPGA implementation (  customisation of the HW at an attractive price even in low quantities!) Architectures particularly attractive for fast, small-size and low-power applications For large-scale productions  ASIC solutions!