TI Information – Selective Disclosure

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
ARM-DSP Multicore Considerations CT Scan Example.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Super Fast Camera System Performed by: Tokman Niv Levenbroun Guy Supervised by: Leonid Boudniak.
Lecture 11: DMBS Internals
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
Real-Time HD Harmonic Inc. Real Time, Single Chip High Definition Video Encoder! December 22, 2004.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Fan Zhang, Yang Gao and Jason D. Bakos
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
Marilyn Wolf1 With contributions from:
Software and Communication Driver, for Multimedia analyzing tools on the CEVA-X Platform. June 2007 Arik Caspi Eyal Gabay.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Generating Families of Practical Fast Matrix Multiplication Algorithms
Stanford University.
Generalized and Hybrid Fast-ICA Implementation using GPU
Yang Gao and Dr. Jason D. Bakos
Lynn Choi School of Electrical Engineering
Analysis of Sparse Convolutional Neural Networks
CS427 Multicore Architecture and Parallel Computing
A Comparison of Cache-conscious and Cache-oblivious Programs
Scientific requirements and dimensioning for the MICADO-SCAO RTC
Lecture 16: Data Storage Wednesday, November 6, 2006.
Enabling machine learning in embedded systems
System On Chip.
Embedded Systems Design
High-Performance Matrix Multiplication
BLIS optimized for EPYCTM Processors
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Lecture-1 Introduction
Lecture 11: DMBS Internals
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
Parallel Computers Today
Memory Hierarchies.
CISC AND RISC SYSTEM Based on instruction set, we broadly classify Computer/microprocessor/microcontroller into CISC and RISC. CISC SYSTEM: COMPLEX INSTRUCTION.
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
How Efficient Can We Be?: Bounds on Algorithm Energy Consumption
A Comparison of Cache-conscious and Cache-oblivious Codes
Final Project presentation
What is Computer Architecture?
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
What is Computer Architecture?
Modified from notes by Saeid Nooshabadi
Objectives Describe how common characteristics of CPUs affect their performance: clock speed, cache size, number of cores Explain the purpose and give.
Learning Objectives To be able to describe the purpose of the CPU
Memory System Performance Chapter 3
Martin Croome VP Business Development GreenWaves Technologies.
Presentation transcript:

TI Information – Selective Disclosure An Implementation of GEMM for DMA-enabled Architectures Devangi Parikh Will Leven Sep 19, 2016 TI / Embedded Processing / Processors / Silicon Development / Machine Learning Lab

Outline TI Embedded Processors TI LINALG Library Application Use Case TI Information – Selective Disclosure TI Embedded Processors TI LINALG Library Application Use Case GEMM On C66x Performance

TI Embedded Processors TI Information – Selective Disclosure TI Embedded Processors

5 Generations of TI Multicore Processors TI Information – Selective Disclosure Keystone architecture Lowers development effort Speeds time to market Leverages TI’s investment Optimal software reuse

TI 66AK2H12 SoC Keystone II architecture Cores Memory Interfaces TI Information – Selective Disclosure Keystone II architecture Cores 4 ARM A15s at 1.0 GHz 4 MB shared L2 cache 32 Gflops single precision 8 Gflops double precision 8 C66x DSPs at 1.0 GHz 32 kB L1 scratch / cache each 1 MB L2 scratch / cache each 128 Gflops single precision 32 Gflops double precision Memory 8 GB DDR3 DRAM (external) 6 MB shared SRAM/L3 Interfaces

TI Information – Selective Disclosure TI LINALG Library

Dense Linear Algebra LINALG TI Information – Selective Disclosure LINALG Support for the standard CBLAS and CLAPACK APIs CBLAS runs on either the available ARM or DSP cores Uses BLIS (BLAS-like Library Instantiation Software) for underlying BLAS computations

Data Movement for Level 3 BLAS on C66x TI Information – Selective Disclosure Efficiently moves the input and output matrices through the different levels of memory using DMA, and packing routines to ensure efficient computations Level 3 BLAS computations require 4.5 MB MSMC memory 768 KB L2 Scratchpad memory 28 KB L1D Scratchpad memory A B C

Single Precision GEMM Performance TI Information – Selective Disclosure Single precision general matrix-matrix multiplication Obtained using a TI 66AK2H12 SoC at a 1 GHz clock Theoretical peak DSP performance = 128 GFLOPS Theoretical peak ARM performance = 32 GFLOPS

TI Information – Selective Disclosure Application Use Case

CNN Applications TI Information – Selective Disclosure

Extended BLAS for TIML TI Information – Selective Disclosure Input and output final locations (any permutation for the parameters can be provided) DRAM MSMC / L3 L2 Fuse GEMM/GEMV with other operations Fully connected layer (with or without H pre packed / stored optimally, with or without V special structure) Y = ReLU(H*X + V) y = ReLU(H*x + v) Convolutional layer (with or without H pre packed / stored optimally, with or without V special structure, need to form Xfilter) Y = ReLU(H*Xfilter + V) Y = pool(ReLU(H*Xfilter + V)) Data movement EDMA Working space May assume only L1 is available for scratchpad buffers

TI Information – Selective Disclosure GEMM On C66x

GEMM Building Blocks C66x SGEMM ukernel Packing Routines TI Information – Selective Disclosure C66x SGEMM ukernel MR_S = 4 NR_S = 8 KC_S > 384 (to get > 90% performance from the ukernel) Packing Routines 8xK to pack Matrix B 4xK to pack Matrix A Memory required for packing Available working space 28 KB of L1 1 micro-panel of B = 12 KB 1 micro-panel of A = 6 KB

Performance Analysis Performance Limitations Custom Implementation TI Information – Selective Disclosure Performance Limitations MC and NC have to be very small to fit panels of A and B in L1 KC has to be reduced to fit more micro-panels of A and B Expensive loops (5th loop and 3rd loop around ukernel) iterate large number of times Custom Implementation GEMM building blocks Ukernel (> 90% performance) Streamlined Implementation Aim to reduce functions calls, and other code generalization Use DMA to pack A Pack the next micro-panel while computing on current micro-panel Operations % of total cycles Ukernel ~42.5 Packing A ~17.2 Packing B ~2.0 Overhead ~38.3 M 256 MC 16 N NC K 198 KC MR 4 NR 8

GEMM on C66x TI Information – Selective Disclosure

Single Precision GEMM Performance TI Information – Selective Disclosure Preliminary Results: Performance (Clock 983 MHz) L2 – 11.03 GFLOPS (70 %) MSMC – 9.49 GFLOPS (59 %) Operations % of total cycles Ukernel ~84.0 Packing A ~7.7 Packing B ~2.7 Overhead ~5.6 M 224 KC 398 N MR 4 K NR 8

TI Information – Selective Disclosure Thank you!

TI Information – Selective Disclosure Backup

Summary TI Information – Selective Disclosure Our previous implementation of Dense Linear Algebra libraries for TI DSP processors assumes all on-chip memory is available as working space to efficiently move data through the various levels of memory using DMA and packing routines. However, this assumption prevents applications from using any on-chip memory to store data that the application may be using frequently. In this talk, we will describe an implementation of GEMM that uses a limited amount of working space, and DMAs to pack matrices freeing up most of the on-chip memory for the applications’ use.