TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Slides:

Advertisements

Similar presentations

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

Advertisements

Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications.

GPU Programming using BU Shared Computing Cluster

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

ARM-DSP Multicore Considerations CT Scan Example.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Chapter 13 Embedded Systems

Figure 1.1 Interaction between applications and the operating system.

ARM-DSP Communication Architecture

ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

OpenMP China MCP.

High Performance Computing 1 Numerical Linear Algebra An Introduction.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

MCC website: ©Board of Trustees University of Illinois Research Objectives: Using game consoles as a platform for molecular modeling.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Carrying Your Environment With You or Virtual Machine Migration Abstraction for Research Computing.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.

Developing Power-Aware Strategies for the Blackfin Processor Steven VanderSanden Giuseppe Olivadoti David Kaeli Richard Gentile Northeastern University.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

VLSI Algorithmic Design Automation Lab. THE TI OMAP PLATFORM APPROACH TO SOC.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin

Sunpyo Hong, Hyesoon Kim

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

The World Leader in High Performance Signal Processing Solutions Heterogeneous Multicore for blackfin implementation Open Platform Solutions Steven Miao.

June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Software and Communication Driver, for Multimedia analyzing tools on the CEVA-X Platform. June 2007 Arik Caspi Eyal Gabay.

NFV Compute Acceleration APIs and Evaluation

Yang Gao and Dr. Jason D. Bakos

TI Information – Selective Disclosure

CS427 Multicore Architecture and Parallel Computing

Texas Instruments TDA2x and Vision SDK

SDK for developing High Performance Computing Applications

Embedded OpenCV Acceleration

Unit 2 Computer Systems HND in Computing and Systems Development

for more information ... Performance Tuning

Numerical Algorithms Quiz questions

Presentation transcript:

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh Francisco Igual Peña Murtaza Ali

TI Information – Selective Disclosure TI Embedded Processors Library Development Strategy TI LINALG library BLIS on C66x Testing Performance Outline Picture Credit: HP

TI Information – Selective Disclosure TI Embedded Processors

TI Information – Selective Disclosure Keystone architecture – Lowers development effort – Speeds time to market – Leverages TI’s investment – Optimal software reuse 5 Generations of TI Multicore Processors

TI Information – Selective Disclosure Keystone II architecture Cores – 4 ARM A15s at 1.0 GHz 4 MB shared L2 cache 32 Gflops single precision 8 Gflops double precision – 8 C66x DSPs at 1.0 GHz 32 kB L1 scratch / cache each 1 MB L2 scratch / cache each 128 Gflops single precision 32 Gflops double precision Memory – 8 GB DDR3 DRAM (external) – 6 MB shared SRAM/L3 Interfaces – 2x Gigabit Ethernet~ 100 MB/s – 4x SRIO~ 400 MB/s – 2x Hyperlink~ 1 GB/s TI 66AK2H12 SoC

TI Information – Selective Disclosure Library Development Strategy

TI Information – Selective Disclosure User view – Embedded Linux running on the ARM – Standard GCC tool chain – Simply link to a TI provided library with an ARM callable API to accelerate applications using multiple ARM cores, DSP cores and processors as appropriate – Use TI provided tools and examples to write new applications and libraries which use multiple ARM cores, DSP cores and processors to accelerate performance Using multiple cores on a single processor – OpenMP for shared memory parallelization across ARM cores – OpenCL or OpenMP Accelerator for heterogeneous acceleration with multiple DSP cores Using multiple processors – Open MPI over Ethernet, SRIO or Hyperlink Development Philosophy

TI Information – Selective Disclosure ARM + OpenCL DSP Acceleration

TI Information – Selective Disclosure TI LINALG library

TI Information – Selective Disclosure CBLAS Use BLIS (BLAS-like Library Instantiation Software) for underlying BLAS computations Advantages of using BLIS over traditional BLAS libraries Portable across architectures Generalized Matrix Storage Ease to use (BLAS and CBLAS compatibility layers) Code Reuse Allows us to bring BLIS into embedded processing markets

TI Information – Selective Disclosure Single Threaded Applications Support for the standard CBLAS and CLAPACK APIs CBLAS runs on either the available ARM or DSP cores Support for single core and multi core CBLAS computation Automatically chooses between ARM and DSP cores for compute based on problem size User can override through environment variables CBLAS calls to DSP are blocking

TI Information – Selective Disclosure Multi Threaded Applications Application can make BLAS calls from multiple threads ARM compute supports up to four threads (# of Application threads) x (# of CBLAS ARM compute threads) = 4 DSP compute calls are enquequed in the OpenCL command queue

TI Information – Selective Disclosure Compute to Data Movement Ratio Level 3 BLAS operations are compute bound (C/D > 1) Automatic offloading decision available only for Level 3 BLAS operations Theoretical Data Movement (D) Compute (C) C/D Level 13NN1/3 Level 2N^2+3N2(N^2)+N2N+1 / (3N+1) Level 34(N^2)2(N^3)N/2 For vector length = N, and matrix size = NxN

TI Information – Selective Disclosure Offload Strategy Automatic offloading decision available only for Level 3 BLAS operations Tuning : For each level 3 operation, find the matrix sizes for which the execution on DSP is faster Performed offline Sweep matrix sizes, e.g. (m,k,n) for xGEMM For each combination of (m,k,n), benchmark DSP execution and ARM execution Generate offload lookup table based on benchmarking results Making offloading decision for each level 3 function Configuration through environment variable Offload lookup table obtained through tuning

TI Information – Selective Disclosure BLIS on C66x

TI Information – Selective Disclosure BLIS High-Performance GEMM

TI Information – Selective Disclosure C66x High-Performance GEMM BLIS is designed for cache based architectures C66x is a DMA based architecture Integrate DMA capabilities into BLIS to obtain high-performance on C66x Parallelize data movement through various levels of memory with the computation by using the DMA Parameters are selected such that ping-pong buffers fill up the SRAM memory available MCKCNCMRNR S (single) D (double) C (single complex) Z (double complex) Parameter values for C66x

TI Information – Selective Disclosure Flexible User or library developer must be able to select when and where to transfer data for an operation Transparent User must not be aware of the usage of the DMA, but if desired can manage the DMA Integrated into the control tree mechanism DMA Integration Goals

TI Information – Selective Disclosure GEMM Control Tree Definitions

TI Information – Selective Disclosure Memory Buffers

TI Information – Selective Disclosure C66x Data Movement for Level 3 BLIS AB C

TI Information – Selective Disclosure C66x High-Performance GEMM

TI Information – Selective Disclosure Algorithmic Variants for GEMM

TI Information – Selective Disclosure Testing

TI Information – Selective Disclosure BLIS Test Suite Suitable for Larger matrix sizes Performance benchmarks Selective functionality tests Customizable Can sweep over BLAS routines with all possible permutations of the available options

TI Information – Selective Disclosure BLAS Test Suite Suitable for Corner cases (zero matrix dimension, near-underflow and near-overflow valued matrices) Smaller matrix sizes Not customizable Total tests = 239,052

TI Information – Selective Disclosure CLAPACK Test Suite Suitable for Corner cases (zero matrix dimension, near-underflow and near-overflow valued matrices) Smaller matrix sizes Not customizable Types of tests = 83 Total tests = 3,073,466

TI Information – Selective Disclosure Performance

TI Information – Selective Disclosure SGEMM Single precision general matrix-matrix multiplication Obtained using a TI 66AK2H12 SoC at a 1 GHz clock Theoretical peak DSP performance = 128 GFLOPS Theoretical peak ARM performance = 32 GFLOPS

TI Information – Selective Disclosure DGEMM Double precision general matrix-matrix multiplication Obtained using a TI 66AK2H12 SoC at a 1 GHz clock Theoretical peak DSP performance = 32 GFLOPS Theoretical peak ARM performance = 8 GFLOPS

TI Information – Selective Disclosure Thanks!