LabVIEW Real Time for High Performance Control Applications

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
A Centralized Scheduling Algorithm based on Multi-path Routing in WiMax Mesh Network Yang Cao, Zhimin Liu and Yi Yang International Conference on Wireless.
Heterogeneous Computing and Real-Time Math for Plasma Control
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Computer Graphics Recitation The plan today Least squares approach  General / Polynomial fitting  Linear systems of equations  Local polynomial.
Using the Trigger Test Stand at CDF for Benchmarking CPU (and eventually GPU) Performance Wesley Ketchum (University of Chicago)
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
1 LabVIEW DSP Test Integration Toolkit. 2 Agenda LabVIEW Fundamentals Integrating LabVIEW and Code Composer Studio TM (CCS) Example Use Case Additional.
Embedded Runtime Reconfigurable Nodes for wireless sensor networks applications Chris Morales Kaz Onishi 1.
- Modal Analysis for Bi-directional Optical Propagation
Edward Freeman CCLRC ESDG Optical Data Acquisition Development EID forum 12th October 2005 By Edward Freeman.
1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.
An Exact Algorithm for Difficult Detailed Routing Problems Kolja Sulimma Wolfgang Kunz J. W.-Goethe Universität Frankfurt.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Evolutionary Computation Evolving Neural Network Topologies.
1 CHAP 3 WEIGHTED RESIDUAL AND ENERGY METHOD FOR 1D PROBLEMS FINITE ELEMENT ANALYSIS AND DESIGN Nam-Ho Kim.
Unstructured Meshing Tools for Fusion Plasma Simulations
CPU Central Processing Unit
These slides are based on the book:
Machine Learning Supervised Learning Classification and Regression
M. Bellato INFN Padova and U. Marconi INFN Bologna
NFV Compute Acceleration APIs and Evaluation
Roy Taragan Shaham Kenat
Multiple-Layer Networks and Backpropagation Algorithms
Algorithms and Programming
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Part 3 Linear Programming
Xing Cai University of Oslo
Parallel Plasma Equilibrium Reconstruction Using GPU
Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.
Large-scale Machine Learning
Scientific requirements and dimensioning for the MICADO-SCAO RTC
Chilimbi, et al. (2014) Microsoft Research
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
SCADA for Remote Industrial Plant
IEEE NPSS Real Time Conference 2009
Computing and Compressive Sensing in Wireless Sensor Networks
Parallel Processing - introduction
Enabling machine learning in embedded systems
ABSTRACT   Recent work has shown that sink mobility along a constrained path can improve the energy efficiency in wireless sensor networks. Due to the.
第 3 章 神经网络.
Parallel Algorithm Design using Spectral Graph Theory
ElasticTree Michael Fruchtman.
CS 286 Computer Organization and Architecture
A Brief Introduction of RANSAC
Real-Time Ray Tracing Stefan Popov.
Embedded OpenCV Acceleration
Data Structures and Algorithms in Parallel Computing
Part 3 Linear Programming
Artificial Neural Network & Backpropagation Algorithm
Support for ”interactive batch”
CLUSTER COMPUTING.
GENERAL VIEW OF KRATOS MULTIPHYSICS
Artificial Intelligence Chapter 3 Neural Networks
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Hybrid Programming with OpenMP and MPI
Die Stacking (3D) Microarchitecture -- from Intel Corporation
Artificial Intelligence Chapter 3 Neural Networks
Computer Evolution and Performance
Network Processors for a 1 MHz Trigger-DAQ System
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Part 3 Linear Programming
NVMe.
6- General Purpose GPU Programming
2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.
Artificial Intelligence Chapter 3 Neural Networks
Multiprocessors and Multi-computers
Presentation transcript:

LabVIEW Real Time for High Performance Control Applications Aljosa Vrancic, Lothar Wenzel, LabVIEW R&D, National Instruments Case studies: Matrix-vector Multiplication & PDE Solver Tool Most of the basic computational algorithms for high-performance computing are designed to have the best average performance for the most general case as they are used in off-line calculations: the goal is for calculation to end in as little time as possible for an arbitrary set of inputs but each step in general does not have a strict time deadline. As a result, they do not scale well into hard real-time embedded environments that have much stricter per-iteration timing constraints, often in a 1 ms range. We found the following approach to work much better for real-time environments: Divide algorithm into two steps: Off-line calculation – it is acceptable for this step to be expensive, as it is done up front On-line calculation – this step uses input and off-line calculation data to compute outputs deterministically We applied this approach to matrix-vector multiplication and PDE solver problems that are at the heart of many control applications. PDEs are used to describe the system and matrix-vector multiplication is used to apply the model to sensor data in order to generate actuator data. (Non)Linear Elliptic PDE Solver LabVIEW Graphical Development Environment Tokamak Fusion Reactor Structured data flow programming Compiled graphical development Target desktop, mobile, industrial, and embedded Thousands of out-of-the box mathematics and signal processing routines Seamless connectivity to millions of I/O devices Grad-Shafranov PDE Magnetohydrodynamics Shape Control Matrix-Vector Multiplication in real-time Custom FPGA-based Communication Protocol Application Layers Currently: 3-12 ms, 39x63 grid Goal: 1ms Geometry-aware algorithm: Cast PDE into a set of equations: Mx – b = f(x) Use iterative algorithm: x(n+1) = M-1{f[x(n)] + b} Reduce problem for each iteration by finding a set of points (pink) that, when calculated, exactly split the problem into two or more smaller independent sub-problems (blue and green) Calculate the exact solution for the points using corresponding rows of M-1 Repeat for each sub-problem E-ELT Telescope Layer 1 ESO (European Southern Observatory) Five mirrors (M1 through M5) National Instruments involvement: Data Acquisition M1 – 42m primary segmented mirror M4 – adaptive mirror Off-line calculations Data distribution CPU affinity feature for cache control (keep fixed data in cache) Read sensor data, distribute across CPUs (machines). Collate and update outputs Program FPGA Call dll Distributed Real-Time FPGA Network topology: ring Network speed: up to 110 MB/s (cable dependent) Configurable: RX and TX parameters, data sink, data source, data recombine Development time: 2 weeks Throughput/Latency for 80MB/s network setup Calculating the exact solution for these grid points splits grid into two independent sections of smaller size Graph representation of solution grid points E-ELT Telescope: M1 Mirror Layer 2 Packet length (f32s) Throughput (MB/s) First byte latency per node (ms) 8 46 1 16 57 1.5 32 65 2.5 64 70 4.5 C code Memory management data alignment Iterate over sub-problems Problem: For a 1000x1000 grid, M dimensions are M(1,000,000 x 1,000,000) Calculating M-1 directly may be impossible Solution: Solve the problem for the unit boundary vector using any method (for example, iteration over smaller subsets of the problem for which M-1 can be calculated directly). Values for the solution corresponding to the pink points represent elements in the M-1 rows corresponding to the unit boundary vector. Repeat the process for all unit vectors. 984 hexagonal segments 6 sensors/3 actuators per segment 1 ms control cycle Math tricks to reduce problem to 3k x 3k symmetric matrix by 3k vector multiplication New multiplication algorithm needed 750 ms (worst case) Dell T7400 (2x2.6GHz 12MB L2 Quad Core Xeon) Matrix restructured off-line in L1/L2 cache optimized sub-problems and loaded into the CPU caches GPU (dual Tesla): 850 ms FPGA Block Diagram Layer 3 Assembly code Hand coded SSE2 assembly for solving sub-problems L2 & L1 cache optimized PC PC Memory DMA FPGA Interaction between solution grid points (represents matrix M) Grid points calculated with reduced set of boundary conditions Minimum Grid Size Layer 4 Custom communication protocol CPU of-load for data recombination Fully hand-shaked, with retries Slower nodes will gate data delivery DMA for data access directly from memory Receive Block L2 Cache Threshold Transmit Block 111x63 grid (6993 non-linear equations with 6993 unknowns) 7th order RHS polynomial One iteration: < 250 ms 10-5 error: 4 iterations starting from 0s Total time for solution: <1 ms Dell 7400T (2x2.6GHz Quad Core Xeons) Matrix fits Matrix too big + Hold time Packet Size f32 ADD Hold time Packet Size Retry LabVIEW Targets Linear PDE Benchmarks (ms) Portable FPGA Embedded Controllers PC Handheld Industrial Controllers (PXI) Sensor Vision System DSP/MPU Grid Size Laplace General Elliptic Liner PDE Spectral New Method (CPUs) CPUs 1 8 2 4 31x31 115 39 73 40 32 34 72 63x63 495 96 84 100 66 61 85 127x127 2130 363 128 381 211 157 129 255x255 10321 1498 296 6073 2866 2408 308 319x319 16594 2589 459 1416 511x511 45748 10883 965 29195 15171 13873 11814 E-ELT Telescope: M4 Mirror Problem Grid Grid points calculated with full set of boundary conditions 6000-9000 actuators 1ms control cycle Math: 9k x 15k matrix by 15k vector multiplication Both CPU bandwidth and cache size limited distributed computation required Custom FPGA-based communication protocol: Deterministic communication On-the-fly data recombination Boundary Conditions Geometry-aware algorithm applied to a rectangular PDE grid Multi-blade setup for distributed calculation ni.com