This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit

Slides:



Advertisements
Similar presentations
Introduction to the CUDA Platform
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
CSCE 145: Algorithmic Design I Chapter 1 Intro to Computers and Java Muhammad Nazmus Sakib.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit
Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.
Performance Analysis of Multiprocessor Architectures
Pentium 4 and IA-32 ISA ELEC 5200/6200 Computer Architecture and Design, Fall 2006 Lectured by Dr. V. Agrawal Lectured by Dr. V. Agrawal Kyungseok Kim.
DEPARTMENT OF COMPUTER ENGINEERING
ANDROID OPERATING SYSTEM Guided By,Presented By, Ajay B.N Somashekar B.T Asst Professor MTech 2 nd Sem (CE)Dept of CS & E.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit
Embedded Systems Programming
Software Development and Software Loading in Embedded Systems.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Introduction to The Linaro Toolchain Embedded Processors Training Multicore Software Applications Literature Number: SPRPXXX 1.
ENGR. SHOAIB ASLAM Computer Programming I Lecture 02.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
Intel® IPP. Fighting for the performance Intel® IPP. Fighting for the performance Novosibirsk, 2008 Boris Sabanin Novosibirsk, 2008 Boris Sabanin.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit
CSC 461: Lecture 41 CSC461: Lecture 4 Introduction to OpenGL Objectives: Development of the OpenGL API OpenGL Architecture -- OpenGL as a state machine.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.
CS533 Concepts of Operating Systems Jonathan Walpole.
University of Washington Roadmap 1 car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Car c = new Car(); c.setMiles(100);
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Performance of mathematical software Agner Fog Technical University of Denmark
GPU Architecture and Programming
Topic 2d High-Level languages and Systems Software
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Copyright © Mohamed Nuzrath Java Programming :: Syllabus & Chapters :: Prepared & Presented By :: Mohamed Nuzrath [ Major In Programming ] NCC Programme.
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
12/8/2015\course\cpeg323-07Fs\Topic2b-323.ppt1 Topic 2b High-Level languages and System Software (Languages) Introduction to Computer Systems Engineering.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Software Systems Division (TEC-SW) ASSERT process & toolchain Maxime Perrotin, ESA.
SimTK 1.0 Workshop Downloads Jack Middleton March 20, 2008.
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
Martin Kruliš by Martin Kruliš (v1.1)1.
Android.
Muen Policy & Toolchain
Programming Language Hierarchy, Phases of a Java Program
Roadmap C: Java: Assembly language: OS: Machine code: Computer system:
Texas Instruments TDA2x and Vision SDK
Optimization for the Linux kernel and Linux OS C. Tyler McAdams
Vector Processing => Multimedia
Application Binary Interface (ABI)
Performance Optimization for Embedded Software
Compiler Back End Panel
Compiler Back End Panel
VTune: Intel’s Visual Tuning Environment
Overview of System Development for Windows CE.NET
Presentation transcript:

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA. Lecture 11 - Portability and Optimizations

Laura Gheorghe, Petre Eftime Portability - Different Operating Systems  Portability on different operating systems  Each operating system has its own System API and Libraries  Possible solutions:  Standard Libraries (C, C++, etc.)  Implement a certain standard  Standard C library implemented after ANSI C standard  Interaction with the OS  Wrappers around API (system calls, library calls)  Identify OS and make appropriate calls  Use compiler macros  Identify the OS  __ANDROID__, __linux__, _WIN32, __MACH__, etc. 2

Laura Gheorghe, Petre Eftime Portability - Different Architectures 3  Portability on different hardware platforms  Each architecture has a certain ABI  ABI:  Describes data type size, alignment, calling convention, dealing with system calls, binary format of object files, etc.  Also depends on OS  Solution: compile for a certain ABI with the appropriate toolchain  Cross-compilation:  Compile for a different architecture than the one we are on  Select the appropriate toolchain for the target architecture and OS  Toolchains with correct headers, libraries and ABI

Laura Gheorghe, Petre Eftime Android toolchain 4  Can generate standalone toolchain for an Android version and ABI  Easily integrate with build system for other platforms  $NDK/build/tools/make-standalone-toolchain.sh --platform=android- --arch= --install-dir=  ARCHITECTURE can be x86, ARM (default) or MIPS  Contains C++ STL library with exceptions and RTTI

Laura Gheorghe, Petre Eftime Optimizations - Compiler 5  Identify performance problems using profiling (systrace, ddms, traceview)  Compilers can generate optimized code  APP_OPTIM - differentiate between debug and release versions  Defined in Application.mk  For release versions it uses -O2 and defines NDEBUG  NDEBUG disables assertions and can be used to remove debugging code  The compiler may perform (implicit) vectorization => increase performance  Might not do vectorization when appropriate, sometimes it’s necessary to optimize by hand (but check your algorithm first)

Laura Gheorghe, Petre Eftime Optimizations - Libraries 6  Libraries can provide highly optimized functions  Some are architecture dependent  Math:  Eigen  C++ template library for linear algebra  ATLAS  Linear algebra routines  C, Fortran  Image and signal processing:  Intel Integrated Performance Primitives  Multimedia processing, data processing, and communications applications  OpenCV  Computational efficiency, real-time applications  C++, C, Python and Java  Threading:  Intel Threading Building Blocks  C and C++ library for creating high performance, scalable parallel applications

Laura Gheorghe, Petre Eftime Low level optimizations 7  Optimized algorithm, compiler does not optimize properly, no optimized libraries are available => low level optimizations  Use (explicit) vectorization  Intrinsic compiler functions or assembly  Not all CPUs have the same capabilities  At compile time:  Build different versions of libraries for each architecture  In Makefile depending on the ABI  At runtime:  Execute a certain piece of code only on some architectures  Choose specific optimizations based on CPU features at runtime

Laura Gheorghe, Petre Eftime CPU Features Library 8  cpufeatures library on Android  Identifies processor type and attributes  Make optimizations at runtime according to the procesor  Main functions:  android_getCpuFamily  ANDROID_CPU_FAMILY_ARM, ANDROID_CPU_FAMILY_X86, etc.  android_getCpuFeatures  Returns a set of bits, each representing an attribute  Floating point, NEON, instruction set, etc.  android_getCpuCount  Number of cores

Laura Gheorghe, Petre Eftime ABI: x86 9  Android NDK supports 4 ABIs: x86, armeabi, armeabi-v7a, mips  x86 supports the instruction set called ’x86’ or ’IA-32’  Includes:  Pentium Pro instruction set  MMX, SSE, SSE2 and SSE3 instruction set extensions  Code optimized for Atom CPU  Follows standard Linux x86 32-bit calling convention

Laura Gheorghe, Petre Eftime ABI: armeabi 10  Supports at least ARMv5TE instruction set  Follows little-endian ARM GNU/Linux ABI  Least semnificative byte at the smallest address  No support for hardware-assisted floating point computations  FP operations through software functions in libgcc.a static library  Does not support NEON  Supports Thumb-1  Instruction set  Compact 16-bit encoding for a subset of ARM instruction set  Used when you have a small amount of memory  Android generates Thumb code default

Laura Gheorghe, Petre Eftime ABI: armeabi-v7a 11  Extends armeabi to include instruction set extensions  Supports at least ARMv7A instruction set  Follows little-endian ARM GNU/Linux ABI  Supports VFPv3-D16  16 dedicated 64-bit floating point registers provided by the CPU  Supports Thumb-2  Extends Thumb with instructions on 32 bits  Cover more operations

Laura Gheorghe, Petre Eftime ABI: armeabi-v7a 12  Supports NEON  128-bit SIMD architecture extension for the ARM Cortex TM -A  Accelerate multimedia and signal processing: video encode/decode, 2D/3D graphics, image/sound processing  Set LOCAL_ARM_NEON to true in Android.mk  All sources are compiled with NEON support  Use NEON GCC intrinsics in C/C++ code or NEON instructions in Assembly code  Add.neon suffix to sources in LOCAL_SRC_FILES  Compile only those files with NEON support  LOCAL_SRC_FILES := foo.c.neon bar.c

Laura Gheorghe, Petre Eftime NEON 13 # define a static library containing our NEON code ifeq ($(TARGET_ARCH_ABI),armeabi-v7a) include $(CLEAR_VARS) LOCAL_MODULE := neon-example LOCAL_SRC_FILES := neon-example.c LOCAL_ARM_NEON := true include $(BUILD_STATIC_LIBRARY) endif # TARGET_ARCH_ABI == armeabi-v7a #include [..] if (android_getCpuFamily() == ANDROID_CPU_FAMILY_ARM && (android_getCpuFeatures() & ANDROID_CPU_ARM_FEATURE_NEON) != 0){ // use NEON-optimized routines [..] } else{ // use non-NEON fallback routines instead [..] }

Laura Gheorghe, Petre Eftime Bibliography 14  $NDK/docs/STANDALONE-TOOLCHAIN.html  $NDK/docs/CPU-FEATURES.html  $NDK/docs/CPU-ARCH-ABIS.html  $NDK/docs/CPU-ARM-NEON.html  $NDK/docs/ANDROID-MK.html  $NDK/docs/APPLICATION-MK.html  Intrinsics.html

Laura Gheorghe, Petre Eftime Keywords 15  Portability  Standard Libraries  Wrappers  ABI  Toolchain  Cross-compilation  Profiling  Optimization  Vectorization  Optimized libraries  CPU features  MMX, SSE  NEON  Little-endian  Thumb  VFP