MPI Software Technology, Inc. VSIPL for Diverse Architectures

Slides:

Advertisements

Similar presentations

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

Platforms, ASIPs and LISATek Federico Angiolini DEIS Università di Bologna.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

MIT Lincoln Laboratory XYZ 3/11/2005 VSIPL and SAR Performance on Multiple Generations of Intel® Processors Peter Carlston, Platform Architect,

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.

© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.

Progress in Multi-platform Software Deployment (Linux and Windows) Tim Kwiatkowski Welcome Consortium Members November 29,

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

GPU VSIPL: Core and Beyond Andrew Kerr 1, Dan Campbell 2, and Mark Richards 1 1 Georgia Institute of Technology 2 Georgia Tech Research Institute.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Measuring Performance Based on slides by Henri Casanova.

SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.

Embedded Software Design Week III Processor Basics Raspberry Pi -> Blinking LEDs & pushing buttons.

M. Bellato INFN Padova and U. Marconi INFN Bologna

Containers as a Service with Docker to Extend an Open Platform

Homework Reading Machine Projects Labs

TI Information – Selective Disclosure

Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof

Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-

Visit for more Learning Resources

M. Richards1 ,D. Campbell1 (presenter), R. Judd2, J. Lebak3, and R

William Stallings Computer Organization and Architecture 8th Edition

Parallel Processing - introduction

Application-Specific Customization of Soft Processor Microarchitecture

Enabling machine learning in embedded systems

Multicore, Multithreaded, Multi-GPU Kernel VSIPL Standardization, Implementation, & Programming Impacts Anthony Skjellum, Ph.D.

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Sharing Memory: A Kernel Approach AA meeting, March ‘09 High Performance Computing for High Energy Physics Vincenzo Innocente July 20, 2018 V.I. --

BLIS optimized for EPYCTM Processors

Texas Instruments TDA2x and Vision SDK

Performance Tuning Team Chia-heng Tu June 30, 2009

Vector Processing => Multimedia

Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-

Matlab as a Development Environment for FPGA Design

Performance Optimization for Embedded Software

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Chapter 2: The Linux System Part 1

VSIPL Short Tutorial Anthony Skjellum MPI Software Technology, Inc.

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Computer Evolution and Performance

Introduction to Microprocessor Programming

CSC3050 – Computer Architecture

VSIPL++: Parallel Performance HPEC 2004

Application-Specific Customization of Soft Processor Microarchitecture

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

CSE 502: Computer Architecture

DSP Architectures for Future Wireless Base-Stations

Martin Croome VP Business Development GreenWaves Technologies.

Presentation transcript:

MPI Software Technology, Inc. VSIPL for Diverse Architectures (Pentium 4 to DSPs) Anthony Skjellum Brian Chase Wenhao Wu September 23, 2003 © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. Prelude – VSI/Pro Overview: Current VSIPL platform support Status: G4 / Altivec Widely used worldwide Domestic production computing adoption picking up Helps untie programs from specific vendors Expertise on optimizing G4 a major part of expertise Expertise on porting to different PPC environments also key expertise Dealing with C/C++ toolchains a major expertise Key optimizations for more advanced users (e.g., Rader’s algorithm and other NTT-motivated improved) with high performance are at cusp of newest release efforts Complete version for Image processing also released Customers have started asking for non-G4/Altivec alternatives! © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. MSTI’s Strategy Availability on Different Processors Operating System / Development Tools Core+ G4 / Altivec Core P4 / SSE CoreLite TI DSP C67 family VxWorks, MercuryOS, LynxOS, Linux, MacOSX Windows, Linux, VxWorks Code Composer toolset © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. Why P4? The higher clock speed, 3 or more GHz COTS technology enables cost effective solutions Anticipated lower power versions from Intel and third parties in future Not all embedded systems equally power/heat constrained even now Double precision 4-way vectorization useful Future winner in Gflop/Watt? Gflop/$? © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. Why C67 Specially designed architecture for DSP applications. very deep pipeline very large instruction word (VLIW) architecture streaming data Better GFlops per $ than G4 / Altivec © 2001-2003 MPI Software Technology, Inc..

Product Exploration/Results - P4 Full Core profile support for Windows, Linux, and VxWorks. Optimized FFT performance for SSE registers (performance graph later) Optimized matrix library easily achievable also Can equal or beat MKL (Intel commercial library) in significant aspects of overall performance… more tuning possible © 2001-2003 MPI Software Technology, Inc..

Porting Experience for C67 What we achieved in 1 month: VSI/Pro Core Lite profile is completely ported for TI C67. We have C6711 optimized Complex-to-Complex inplace and out of place, forward and inverse FFTs: vsip_ccfftop_f() vsip_ccfftip_f() C6711 150Mhz CPU 29300 cycles for 1024 element FFT © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. Issues C side: Straightforward C++ side: Strict on template support VLIW assembly side: No hand tuned assembly code in the library yet… next step before product release © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. C67 Operating Systems Example Various OS platforms: SPARK (Small Portable Adjustable Real-time Kernel) OSE Diamond Thread XABS GmbH Jena © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. FFT optimization for P4 Algorithm was engineered for the architecture to minimize problems arising from the scarcity of registers and lower cache associativity. The algorithm is auto-sort DIF, efficient not only on power-of-two sizes. The key functions are written in assembler supported by highly optimized C and C++ code, using SSE. © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. FFT performance on P4 RCFFT Comparison with MKL… © 2001-2003 MPI Software Technology, Inc..

FFT performance on P4 Interleaved in-place CCFFT Comparison © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. FFT performance on P4 Split in-place CCFFT © 2001-2003 MPI Software Technology, Inc..

FFT optimization for DSP Using Radix-2 , Radix-4 algorithms. Also using cache splitting ( the L1 cache is 4KB, so the splitting is needed for sizes > 256 for in-place FFT) . © 2001-2003 MPI Software Technology, Inc..

PRELIMINARY FFT performance on C67 TI DSP C6711 150 MHz DSK © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. Future Plans for C67 Release the Core Lite profile library for the C67 platform Explore releasing Core profile library Explore possibilities of partnering with OS vendors such as OSE Systems © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. P4-related Issues, I Single-precision optimization not a big concern outside embedded computing… Good free libraries exist (e.g., FFTW, LAPACK) and MKL exists as alternatives Academic basic kernels for matrix multiplication (non-ATLAS) are now mature enough to use with small code size, but these are not open source/redistributable (e.g., libgoto) Several universities working on better free libraries Code bloat an issue for certain library architectures when considering embedded (e.g., ATLAS code size) The merger of free libraries and free VSIPL has been tried, not as good a fully optimized library (e.g., VSIPL ERI, VSIPL Ref Implementation upgrade) Demand for commercial VSIPL for P4 remains a question © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. P4-related Issues, II Distinct flavors of P4 (e.g., Athlon) have distinctively different optimal libraries Cache architecture Instruction decode differences TLB and other memory issues Register file differences (e.g., 16 vs 8) Strong potential that future embedded P4 clones will also have different optimal choices in their hardware configurations © 2001-2003 MPI Software Technology, Inc..

Why we think it is useful to have commercial VSIPL on P4 and C67 Shows true performance portability story between diverse architectures, not just different G4/Altivec OS’s and vendors Allows system designers to work with assumption low software porting cost, and explore other aspects of design alternatives Processors are getting harder to program Precise mix of required optimizations for embedded not strong emphasis of free libraries per se © 2001-2003 MPI Software Technology, Inc..

© 2001-2003 MPI Software Technology, Inc.. Conclusions Demand for VSIPL for non-G4 platforms is TBD… appears promising but not well developed Opportunities to achieve extremely high performance on clearly different architectures now evident Proof of concept may help drive adoption Technical hurdles involving hand-optimization remain for key inner kernels on each new platform, but do not require massive coding in assembly language if handled correctly C/C++ toolchain always an issue for new processor + OS combinations © 2001-2003 MPI Software Technology, Inc..