National Tsing Hua University ® copyright OIA National Tsing Hua University HSAemu - A Full System Emulator for HSA Platform Prof. Yeh-Ching Chung System.

Slides:



Advertisements
Similar presentations
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Advertisements

Threads, SMP, and Microkernels
Using MapuSoft Instead of OS Vendor’s Simulators.
Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,
Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.
General information Course web page: html Office hours:- Prof. Eyal.
Operating Systems High Level View Chapter 1,2. Who is the User? End Users Application Programmers System Programmers Administrators.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Memory Management 2010.
Embedded Real-time Systems The Linux kernel. The Operating System Kernel Resident in memory, privileged mode System calls offer general purpose services.
Operating Systems Concepts Professor Rick Han Department of Computer Science University of Colorado at Boulder.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
Panda: MapReduce Framework on GPU’s and CPU’s
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
Session-02. Objective In this session you will learn : What is Class Loader ? What is Byte Code Verifier? JIT & JAVA API Features of Java Java Environment.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
OpenCL Introduction A TECHNICAL REVIEW LU OCT
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
Advanced Operating Systems CIS 720 Lecture 1. Instructor Dr. Gurdip Singh – 234 Nichols Hall –
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.
Architectural Optimizations David Ojika March 27, 2014.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.
CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.
Windows 2000 Course Summary Computing Department, Lancaster University, UK.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
 Programming - the process of creating computer programs.
Full and Para Virtualization
Introduction Why are virtual machines interesting?
Interrupt driven I/O Computer Organization and Assembly Language: Module 12.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Building A KVM-based Hypervisor for A Heterogeneous System Architecture Compliant System National Chiao Tung University & National Tsing Hua University.
Introduction to Operating Systems Concepts
Computer System Structures
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Outline Installing Gem5 SPEC2006 for Gem5 Configuring Gem5.
Current Generation Hypervisor Type 1 Type 2.
Advanced Operating Systems CIS 720
Chapter 1 Introduction.
Enabling machine learning in embedded systems
Chapter 1 Introduction.
Texas Instruments TDA2x and Vision SDK
Implementation of Efficient Check-pointing and Restart on CPU - GPU
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
SOC Runtime Gregory Stoner.
Introduction to OpenCL 2.0
Chapter 4: Threads.
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Chapter 2: The Linux System Part 1
Memory Management Tasks
Lecture Topics: 11/1 General Operating System Concepts Processes
Prof. Leonardo Mostarda University of Camerino
CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors
Introduction to Virtual Machines
Outline Operating System Organization Operating System Examples
Introduction to Virtual Machines
Presentation transcript:

National Tsing Hua University ® copyright OIA National Tsing Hua University HSAemu - A Full System Emulator for HSA Platform Prof. Yeh-Ching Chung System Software Laboratory Department of Computer science National Tsing Hua University 1

National Tsing Hua University ® copyright OIA National Tsing Hua University Outline Introduction to HSA Design of HSAemu Performance Evaluation Conclusions and Future Work 2

National Tsing Hua University ® copyright OIA National Tsing Hua University Introduction to HSA HSA Foundation is a non-profit industry standards body to create software/hardware standards for heterogeneous computing – simplify the programing environment – make compute at low power pervasive – introduce new capabilities in modern computing devices Core founders include AMD, ARM, Imagination Technology, MediaTek, Qualcomm, Samsung, and Texas Instruments Open membership to deliver royalty free specifications, and API’s Founded June 12,

National Tsing Hua University ® copyright OIA National Tsing Hua University Members of HSA Foundation – 2014/6 4 Founders Promoters Supporters Contributors Academic Needs Updating – Add Toshiba Logo Membership consists of 43 companies and 16 universities Adding 1-2 new members each month

National Tsing Hua University ® copyright OIA National Tsing Hua University HSA Foundation’s Initial Focus (1) Heterogeneous SOCs have arrived and are a tremendous advance over previous platforms SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory How do we make them even better? – Easier to program – Easier to optimize – Higher performance – Lower power 5

National Tsing Hua University ® copyright OIA National Tsing Hua University HSA Foundation’s Initial Focus (2) HSA unites accelerators architecturally – Bring the GPU forward as a first class processor Unified coherent address space (hUMA) User mode dispatch/scheduling Can utilize pagable system memory Fully coherent memory between the CPU and GPU Pre-emption and context switching Relaxed consistency memory model Quality of Service Attract mainstream programmers – Support broader set of languages beyond traditional GPGPU languages – Support for task parallel runtimes & nested data parallel programs – Rich debugging and performance analysis support 6

National Tsing Hua University ® copyright OIA National Tsing Hua University HSA Foundation’s Initial Focus (3) CPU GPU Shared Memory and Coherency Audio Processo r Audio Processo r Video Hardwar e Video Hardwar e DSP Security Processor Security Processor Image Signal Processing Image Signal Processing Fixed Function Accelerat or Fixed Function Accelerat or SM&C 7 Early focus on the GPU compute accelerator, but HSA will go well beyond the GPU

National Tsing Hua University ® copyright OIA National Tsing Hua University Pillars of HSA* Unified addressing across all processors Operation into pageable system memory Full memory coherency User mode dispatch Architected queuing language Scheduling and context switching HSA Intermediate Language (HSAIL) High level language support for GPU compute processors 8

National Tsing Hua University ® copyright OIA National Tsing Hua University HSA Specifications HSA System Architecture Specification – Version 1.01, released March 16, 2015 – Defines discovery, memory model, queue management, atomics, etc HSA Programmers Reference Specification – Version 1.02, released March 16, 2015 – Defines the HSAIL language and object format HSA Runtime Software Specification – Version 1.0, released March 16, 2015 – Defines the APIs through which an HSA application uses the platform All released specifications can be found at the HSA Foundation web site: – 9

National Tsing Hua University ® copyright OIA National Tsing Hua University hQ and hUMA 10

National Tsing Hua University ® copyright OIA National Tsing Hua University HSA Intermediate Layer — HSAIL HSAIL is a virtual ISA for parallel programs – Finalized to ISA by a JIT compiler or “Finalizer” – ISA independent by design for CPU & GPU Explicitly parallel – Designed for data parallel programming Support for exceptions, virtual functions, and other high level language features Lower level than OpenCL SPIR – Fits naturally in the OpenCL compilation stack Suitable to support additional high level languages and programming models: – Java, C++, OpenMP, C++, Python, etc 11

National Tsing Hua University ® copyright OIA National Tsing Hua University HSA Memory Model Defines visibility ordering between all threads in the HSA System Designed to be compatible with C++11, Java, OpenCL and.NET Memory Models Relaxed consistency memory model for parallel compute performance Visibility controlled by: – Load.Acquire – Store.Release – Fences 12

National Tsing Hua University ® copyright OIA National Tsing Hua University HSA Queuing Model User mode queuing for low latency dispatch – Application dispatches directly – No OS or driver required in the dispatch path Architected Queuing Layer – Single compute dispatch path for all hardware – No driver translation, direct to hardware Allows for dispatch to queue from any agent – CPU or GPU GPU self enqueue enables lots of solutions – Recursion – Tree traversal – Wavefront reforming 13

National Tsing Hua University ® copyright OIA National Tsing Hua University HSA Runtime The HSA core runtime is a thin, user-mode API that provides the interface necessary for the host to launch compute kernels to the available HSA components. The overall goal of the HSA core runtime design is to provide a high-performance dispatch mechanism that is portable across multiple HSA vendor architectures. – The dispatch mechanism differentiates the HSA runtime from other language runtimes by architected argument setting and kernel launching at the hardware and specification level. – The HSA core runtime API is standard across all HSA vendors, such that languages which use the HSA runtime can run on different vendor’s platforms that support the API. 14

National Tsing Hua University ® copyright OIA National Tsing Hua University HSA Platform 15

National Tsing Hua University ® copyright OIA National Tsing Hua University Simplified HSA Software Stack 16

National Tsing Hua University ® copyright OIA National Tsing Hua University First HSA APU 17

National Tsing Hua University ® copyright OIA National Tsing Hua University What Is HSAemu 18 HSAemu is a full system emulator that supports the following HSA features – Shared virtual memory between CPU and GPU – Memory based signaling and synchronization – Multiple user level command queues – Preemptive GPU context switching – Concurrent execution of CPU threads and GPU threads – HSA runtime – Finalizer A project sponsored by MediaTek (MTK) Currently, it supports simple HSA platform simulation – Functional-accurate simulation – Cycle-accurate simulation

National Tsing Hua University ® copyright OIA National Tsing Hua University Goals of HSAemu Verify software stack implementation – Tool chain/SDK – HSA runtime – Finalizers Assist application software development in parallel to hardware development – HSA feature support – functional correctness guaranteed Easy to plug-in different simulators/emulators – Provide a command buffer interface 19

National Tsing Hua University ® copyright OIA National Tsing Hua University Architecture of HSAemu HSAemu consists of 9 components – HSAIL Off-line Compiler – HSA Runtime – HSA Driver – HSA Finalizer – CPU Simulation Module – GPU Task Dispatcher – Functional-Accurate GPU Simulator (Fast-Time GPU Simulator) – Cycle-Accurate GPU Simulator (Multi2Sim) – GPU Helper Functions 20

National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL 1.2 Benchmarks AMD-APPSDK OpenCL benchmarks – 20+ benchmarks can be run on HSAemu – For example: NBODY, Mandelbrot set, Histogram, etc. Rodina OpenCL benchmark – Kmeans, Gaussian…etc 21

National Tsing Hua University ® copyright OIA National Tsing Hua University Compliation Framework (1) HSAIL Compiler HSAIL Decoder HSAIL Finalizer OpenCL Kernel HSAIL BRIG Device Native HSAIL Compiler Convert OpenCL kernel to HSAIL HSAIL Decoder Convert HSAIL to binary format (BRIG) HSAIL Finalizer Finalize the BRIG to the real ISA which is selected by the HSA Runtime 22

National Tsing Hua University ® copyright OIA National Tsing Hua University Compliation Framework (2) 23 HSAIL Finalization OpenCL Kernel CL2HSAIL HSAIL Text HSAIL2BRIG HSAIL Binary (BRIG) HSA Runtime Object File Kernel Descriptor BRIG2OBJ OpenCL 2.0 Runtime Components and compilation flow

National Tsing Hua University ® copyright OIA National Tsing Hua University Compliation Framework (3) 24 Library Header OpenCL Kernel LLVM IR Built-In Function Library Built-In Function Library OpenCL Type Header Library OpenCL Type Header Library Clang HSAIL Text include Llc HSAIL Target CL2HSAIL – CL2HSAIL is based on LLVM – Compiling OpenCL to LLVM should include a self-defined OpenCL library header – Use LLVM backend and HSAIL Target module to translate LLVM to HSAIL

National Tsing Hua University ® copyright OIA National Tsing Hua University Compliation Framework (4) 25 HSAIL Text HSAIL2BRIG – Based on Lex and Yacc BRIG is an ELF format binary file following HSAIL specification HSAIL Binary (BRIG)

National Tsing Hua University ® copyright OIA National Tsing Hua University Compliation Framework (4) 26 HSAIL Binary (BRIG) Object File BRIG2OBJ Flow Constructor Flow Constructor HDecoder HAssembler LLVM BitCode BRIG2OBJ is based on LLVM – Flow Constructor: Covert BRIG to control flow tree – Hdecoder: Covert control flow tree to LLVM bitcode – Hassembler: Covert LLVM bitcode to host native

National Tsing Hua University ® copyright OIA National Tsing Hua University HSAIL Finalization (1) 27 BRIG HSAIL Finalization HSA Runtime OpenCL Runtime OpenCL Runtime BRIG2OBJ Loader Linker Code Cache descriptor BRIG2OBJ Flow Constructor HDecoder HAssembler BRIG Control Flow Tree LLVM BitCode Target Executable Object File Target Executable Object File Call the Coresponding HSA Runtime Read BRIG File, Generate The Kernel Descriptor And Launch BRIG2OBJ Construct The Control Flow Graph of HSAIL Program Translate HSAIL to LLVM IR Translate LLVM IR to LLVM Target Object File Load Target Object File Link to Helper Functions Store The Target Binary Code to Code Cache

National Tsing Hua University ® copyright OIA National Tsing Hua University HSAIL Finalization (2) Host SSE instruction Optimization – Reconstruct the control flow graph of kernel function – Use bitmap masking and packing/unpacking algorithms to generate host SSE instructions Example : The control flow graph for kernel function $foo 28

National Tsing Hua University ® copyright OIA National Tsing Hua University Reconstruct the control flow graph by depth-first traversal Perform bitmap masking and packing & unpacking algorithms HSAIL Finalization (3) 29

National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Runtime Most of OpenCL 1.2 APIs were implemented – Based on the Multi2Sim runtime architecture In OpenCL APIs, they call HSA runtime APIs to do the tasks – OpenCL device init -> hsa_init API – OpenCL command queue -> hsa_queue and AQL packet 30

National Tsing Hua University ® copyright OIA National Tsing Hua University HSA Runtime Follow the HSA runtime specification v1.0 The following features were implemented – HSA init and shutdown – HSA notification mechanism – HSA system and agent information – HSA queue – HSA AQL packet – HSA signal – HSA memory 31

National Tsing Hua University ® copyright OIA National Tsing Hua University HSA Driver – Provide hardware information for HSA runtime – Provide Memory Operation for HSA runtime – Pack AQL packets to a command – Dispatch command to Command Buffer Command buffer packet 32

National Tsing Hua University ® copyright OIA National Tsing Hua University CPU Simulation Module (1) Act as an HSA host – PQEMU Agent code, HSA runtime, and operating system are running on PQEMU 33

National Tsing Hua University ® copyright OIA National Tsing Hua University CPU Simulation Module (2) PQEMU – A parallel system emulator based on QEMU – Can simulate up to 256 cores – Dynamic binary translation (DBT) technique – A project sponsored by MTK Code Cache CPU DBT 34

National Tsing Hua University ® copyright OIA National Tsing Hua University CPU simulation Module (3) HSA Signal Handler – Receive doorbell signal from HSA runtime and decode the signal handle (start kernel program) – Encode completion signal, and send it to the user program (finish kernel program) – Inform command packet processor to process commands 35

National Tsing Hua University ® copyright OIA National Tsing Hua University GPU Task Dispatcher (1) Command Buffer – Define command buffer interface for easy emulator/simulator plug-in MMIO, syscall, interrupt…etc – Receive the command packets from applications A command packet contains device id, opcode, and AQL packets which are enqueued by HSA runtime Command packet 36

National Tsing Hua University ® copyright OIA National Tsing Hua University GPU Task Dispatcher (2) Command packet processor – Fetch command packets from Command Buffer (FIFO) – Decode the command packets to extract AQL packet or custom data – Copy kernel object (executable code) to shared virtual memory – Link kernel object to emulator – Put kernel object to code cache – Dispatch jobs to HSA kernel agents or other emulation engines 37

National Tsing Hua University ® copyright OIA National Tsing Hua University Fast-Time GPU Simulator (1) Simulate a generic GPU model – The schedule unite assigns work groups to free CU threads in the GPU Thread Pool – Each CU thread executes all work items in a work group – The maximum number of CU threads is limited by host operating system 38

National Tsing Hua University ® copyright OIA National Tsing Hua University Fast-Time GPU Simulator (2) Schedule Unit – Master of compute units – Manages a centralized work pool – Treat a workgroup data as a atomic task(a workgroup as a basic unit) – Use spinlock to keep the synchronization of compute unit threads – Task distribution order is according to workgroup number order (increment order) 39

National Tsing Hua University ® copyright OIA National Tsing Hua University Fast-Time GPU Simulator (3) Compute Unit – Standalone thread – Has its own MMU (IOMMU) for share virtual memory access – Send the completion signal to HSA Signal Handler (CompletetionSignal) when job is done – Profile job information (TLB Hits/Misses) 40

National Tsing Hua University ® copyright OIA National Tsing Hua University M2S-GPU Simulator (1) A cycle-accurate simulator for AMD Southern Islands GPU model simulation – M2S Bridge Bridge Multi2Sim GPU Model to HSAemu – M2S GPU Module Simulate a cycle-accurate GPU model 41

National Tsing Hua University ® copyright OIA National Tsing Hua University M2S-GPU Simulator (2) M2S Bridge : An interface to launch M2S GPU Module – Initialize the data structures used by AMD Southern Islands GPU, including a memory register for AMD Southern Islands GPU to access the shared system memory in HSAemu – Invoke M2S GPU Module (the AMD Southern Islands GPU module in Multi2Sim) 42

National Tsing Hua University ® copyright OIA National Tsing Hua University M2S-GPU Simulator (3) M2S GPU Module – A cycle-accurate AMD Southern Islands GPU simulator in Multi2Sim Memory access is performed by HSAemu memory helper function to comply the hUMA model 43

National Tsing Hua University ® copyright OIA National Tsing Hua University GPU Helper Functions (1) Memory Helper Function – A soft-mmu of GPU with a page table worker and a TLB to enable hUMA model – Support the redirect access of a local segment memory to a non-shared private memory in GPU Kernel Information Helper Function – Collect and return information of GPU simulation and current execution state – Retrieve kernel information such as working item ID, work group size, etc, from AQL packet 44

National Tsing Hua University ® copyright OIA National Tsing Hua University GPU Helper Functions (2) Mathematic Helper Function – Simulate special mathematical instructions such as trigonometric instructions by calling the corresponding mathematical functions in standard library Synchronization Helper Function – Barrier synchronization implementation for generic GPU model simulation 45

National Tsing Hua University ® copyright OIA National Tsing Hua University Performance Evaluation Experimental Environment Benchmarks: – Nearest Neightbor (NN), K-Means, FFT, FWT, N-Body – Binary Search, Bitonic Sort, Reduction, FWT 46

National Tsing Hua University ® copyright OIA National Tsing Hua University Scalability of Fast-Time GPU Simulator Comparison of NN, K-means and FWT benchmarks on 32 physical cores The speedup is scalable when # of CU threads < # of host physical cores 47

National Tsing Hua University ® copyright OIA National Tsing Hua University SSE Optimization of Fast-Time GPU Simulator Performance comparison of FFT when turn on/off SSE optimization 48

National Tsing Hua University ® copyright OIA National Tsing Hua University N-Body Simulation by Fast-Time GPU Simulator N-Body Simulation All of host physical CPUs are running 49

National Tsing Hua University ® copyright OIA National Tsing Hua University Comparison of HSAemu and Multi2Sim (1) 50

National Tsing Hua University ® copyright OIA National Tsing Hua University Comparison of HSAemu and Multi2Sim (2) 51

National Tsing Hua University ® copyright OIA National Tsing Hua University Conclusions An HSA-compliant full system emulator has been implemented – A functional-accurate simulator for generic GPU model – A cycle-accurate simulator for AMD Southern Islands GPU model (from Multi2Sim) An HSA tool chain/SDK for OpenCL 1.2 Easy to plug-in different simulators/emulators – Provide a command buffer interface 52

National Tsing Hua University ® copyright OIA National Tsing Hua University Future work 53 OpenCL 2.0 support Enhance HSAemu by implementing more HSA features Integrate HSAemu with some existing cycle-accurate GPU simulators Design a cycle-accurate simulator based on PQEMU for generic CPU model Deisgn a cycle-accurate simulator based on PQEMU for big.LITTLE CPU model

National Tsing Hua University ® copyright OIA National Tsing Hua University Q & A 54