ARM Cortex-A9 performance in HPC applications Kurt Keville, Clark Della Silva, Merritt Boyd ARM gaining market share in embedded systems and SoCs Current.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Computer Abstractions and Technology
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
FSOSS Dr. Chris Szalwinski Professor School of Information and Communication Technology Seneca College, Toronto, Canada GPU Research Capabilities.
Embedded Systems Programming
Some Thoughts on Technology and Strategies for Petaflops.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Antenna Baseline Measurement System 29 January 2010 Mark Phillips, Research Assistant Ohio University, Athens OH Research Intern Naval Research Laboratory.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Embedded Systems Programming
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Prardiva Mangilipally
© 2009 Acehub Vista Sdn. Bhd Introduction to ARM ® Processors.
Cosc 2150: Computer Organization Chapter 10: Embedded Systems.
Camera Aided Robot Progress Report.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Beagleboard and Friends Nathan Gough. Hardware – OMAP3  Based around Texas Instruments OMAP3530 “Applications Processor”  OMAP3 Platform:  Arm Cortex-A8.
Embedded Operating System Design October 4, 2012 Doug Kelly.
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
1 Introduction to ARM A15 Linux DSP Platform Software Apps Team 04/19/2013 1TI Confidential - NDA Restrictions.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
1 Buses and types of computer. Paul Strickland Liverpool John Moores University.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
Lecture 6. VFP & NEON in ARM
Why it might be interesting to look at ARM Ben Couturier, Vijay Kartik Niko Neufeld, PH-LBC SFT Technical Group Meeting 08/10/2012.
Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
A next-generation many-core processor with reliability, fault tolerance and adaptive power management features optimized for embedded.
Michael Opdenacker, Community Manager SophiaConf, July 2011 Linaro Engineering resources for the ARM Linux community.
Effects of Limiting Numerical Precision on Neural Networks
Itanium® 2 Processor Architecture
TI Information – Selective Disclosure
Scientific requirements and dimensioning for the MICADO-SCAO RTC
PRESENTATION ON ARM PROCESSORS
Inc. 32 nm fabrication process and Intel SpeedStep.
Low Power processors in HEP
Lynn Choi School of Electrical Engineering
Texas Instruments TDA2x and Vision SDK
MeeGo on Development Boards
NVIDIA Fermi Architecture
CPU and Embedded Systems
ClearSpeed CSX620 Overview
Robert Reed | University of the Witwatersrand
Objectives Describe how common characteristics of CPUs affect their performance: clock speed, cache size, number of cores Explain the purpose and give.
Introduction to Single Board Computer
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

ARM Cortex-A9 performance in HPC applications Kurt Keville, Clark Della Silva, Merritt Boyd ARM gaining market share in embedded systems and SoCs Current processors include the ARM9 series, the Cortex-A8, and the Cortex-A9 ~15 Billion ARM chips shipped to date Thumb / Thumb-2 More efficient instruction encoding, better code density Higher performance for select applications VFPv3 Floating point co-processor NEON SIMD Extensions Up to 4x 32-bit floating point operations per instruction No double precision

ARM Cortex-A8 and Cortex-A9 Uses the ARMv7-A architecture Thumb-2, NEON, VFPv3 Systems TI BeagleBoard (1GHz)‏ Genesi Efika-MX (800MHz)‏ Gumstix Overo Earth (600MHz)‏ Uses ARMv7-A architecture Thumb-2, NEON, VFPv3 Available in single and dual core packages, quads upcoming (A6, Tegra 3)‏ Systems TI PandaBoard (dual 1GHz)‏ NuFront NuSmart (dual 1.2GHz)‏

TI PandaBoard Results 3.0 Gflop/s SP NEON, 1.2 Gflop/s DP (HPL)‏ Power consumption Idle: ~4 Watts / board, Full load: ~7.5 Watts / board Software & Hardware Challenges Most libraries assume x86 / x86-64, No precompiled binaries (unavailable or unoptimized), Compiler support immature (-mcpu=cortex-a9, -mhard-float)‏ Limited RAM on some systems, Low-quality networking hardware and software, Few possibilities for expansion, Reliability issues Energy Efficiency 2 Gflop/s / Watt gets you #1 on Green500 PandaBoard is $175, and 18 square inches.4 Gflop/s / Watt, Gflop/s / $, and Gflop/s / square inch

Looking Ahead : Embedded GPUs Most SoCs include a GPU, e.g. PVR SGX 540 (PandaBoard)‏ Potential for mixed CPU-GPU computation OpenCL support, pending release of drivers on TI SoCs, available for Apple Hardware ARM Cortex-A15 with PVR series 6 GPU Much more powerful and better suited for computation Tegra 3 & 4 Potential for Cuda Support