Enabling Multi-threaded Applications on Hybrid Shared Memory Manycore Architectures Tushar Rawat and Aviral Shrivastava Arizona State University, USA CML.

Slides:

Advertisements

Similar presentations

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Advertisements

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

Introductions to Parallel Programming Using OpenMP

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.

Threads. Readings r Silberschatz et al : Chapter 4.

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Types of Parallel Computers

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

CML Vector Class on Limited Local Memory (LLM) Multi-core Processors Ke Bai Di Lu and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University,

Lecture 4: Concurrency and Threads CS 170 T Yang, 2015 Chapter 4 of AD textbook.

Introduction CS 524 – High-Performance Computing.

Chapter Hardwired vs Microprogrammed Control Multithreading

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

Threads CSCI 444/544 Operating Systems Fall 2008.

Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.

A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.

A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.

Kamalapurkar Shounak Rajarshi Salil Joshi Rohan Bhavsar Sagar Pai Sandesh Low Latency Publisher-Subscriber Network for Stock Market Application Team WhiteWalkers.

Paper Review Building a Robust Software-based Router Using Network Processors.

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

Xiaocheng Zhou Intel Labs China “Single-chip Cloud Computer” An experimental many-core processor from Intel Labs.

Computer Architecture ECE 4801 Berk Sunar Erkay Savas.

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Lecture 1: What is a Computer? Lecture for CPSC 2105 Computer Organization by Edward Bosworth, Ph.D.

CML Enabling Multi-threaded Applications on Hybrid Shared Memory Manycore Architectures Dr. Aviral Shrivastava Dr. Partha Dasgupta Dr. Georgios Fainekos.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Lecture 4: MIPS Instruction Set Reminders: –Homework #1 posted: due next Wed. –Midterm #1 scheduled Friday September 26 th, 2014 Location: TODD 430 –Midterm.

IXP Lab 2012: Part 1 Network Processor Brief. NCKU CSIE CIAL Lab2 Outline Network Processor Intel IXP2400 Processing Element Register Memory Interface.

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.

Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

Alexey Pakhunov /XCG, Microsoft Research/ March 30 th, 2011.

Core Java Introduction Byju Veedu Ness Technologies httpdownload.oracle.com/javase/tutorial/getStarted/intro/definition.html.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.

Content Project Goals. Workflow Background. System configuration. Working environment. System simulation. System synthesis. Benchmark. Multicore.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

NCHU System & Network Lab Lab #6 Thread Management Operating System Lab.

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

TI Information – Selective Disclosure

Gwangsun Kim, Jiyun Jeong, John Kim

Thread & Processor Scheduling

Software Coherence Management on Non-Coherent-Cache Multicores

Chapter 1: A Tour of Computer Systems

Multi-core processors

The HP OpenVMS Itanium® Calling Standard

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Chapter 4: Threads.

Constructing a system with multiple computers or processors

Yiannis Nikolakopoulos

Constructing a system with multiple computers or processors

Programming with Shared Memory

EE 4xx: Computer Architecture and Performance Programming

Types of Parallel Computers

Presentation transcript:

Enabling Multi-threaded Applications on Hybrid Shared Memory Manycore Architectures Tushar Rawat and Aviral Shrivastava Arizona State University, USA CML

Port Multi-threaded Applications 11-Mar-15Tushar Rawat / Arizona State University1 Multicore System Multicore System HSM Manycore System HSM Manycore System ?

Threading and Shared Memory 11-Mar-15Tushar Rawat / Arizona State University2 Multicore System Multicore System Process Global Space Thread Threads have Shared and Implicit Access to Program Data

Processes and Shared Memory 11-Mar-15Tushar Rawat / Arizona State University3 HSM Manycore System HSM Manycore System Process Global Data is not implicitly shared among processes Process Global

Convenience Hardware 11-Mar-15Tushar Rawat / Arizona State University4 Multicore System Multicore System HSM Manycore System HSM Manycore System Hardware-based cache coherence

Convenience Hardware 11-Mar-15Tushar Rawat / Arizona State University5 Multicore System Multicore System HSM Manycore System HSM Manycore System Hardware-based cache coherence Small scratchpad-like on-chip memory Lacks hardware cache coherence

Identify shared data in multi-threaded program Map identified shared data to shared memory Contribution 11-Mar-15Tushar Rawat / Arizona State University6 On-chipOff-chip

Five Stage Approach 11-Mar-15Tushar Rawat / Arizona State University7 Variable Scope Within Threads Pointers Partition Data Thread To Process Analysis Translation

Input: Multi-threaded program source code Output: Name, size, type, read count and write count for each variable int array[10]; array[0] = 6; int foo = array[0]; Stage 1 – Variable Scope Analysis Tushar Rawat / Arizona State University811-Mar-15 Type: int array Size: 10 Read Count: 1 Write Count: 1 array Type: int Size: 1 Read Count: 0 Write Count: 1 foo

Input: A given variable (name) e.g. “total_threads” Output: Result stating whether the given variable exists within 1 thread, 2+ threads or none Stage 2 – Inter-thread Analysis Tushar Rawat / Arizona State University911-Mar-15 void thread … printf(“%d”, total_threads); … pthread_exit(NULL); Loop (or multiple threads launching same function) Thread

Variable w has a Definite relationship with ptr1 Variables x and y have Possibly a relationship with ptr2 Stage 3 – Alias and Pointer Analysis Tushar Rawat / Arizona State University1011-Mar-15 ptr1 = &w ptr2 = &xptr2 = &y if-path else-path

Stage 4 – Data Partitioning Tushar Rawat / Arizona State University1111-Mar-15 Off-chip shared memory (DRAM) On-chip shared memory (SRAM) Processing Core(s) RCCE_get RCCE_put array = (double *)RCCE_shmalloc(size * sizeof(double)) array = (double *)RCCE_malloc(size)

Convert Pthread source to RCCE application code Add RCCE-specific instructions and libraries Stage 5 – Program Translation Tushar Rawat / Arizona State University1211-Mar-15 PthreadRCCE pthread_self()RCCE_ue() pthread_mutex_lock()RCCE_acquire_lock() pthread_mutex_unlock()RCCE_release_lock() pthread_create(&thread, NULL, funcName, (void *)arg) funcName((void *) arg) RCCE_init(argc, argv)RCCE_finalize() RCCE.hRCCE_lib.h SCC_API.h Any remaining pthread code is removed from the source

Translator Development Tushar Rawat / Arizona State University1311-Mar-15 Linux Mint 12 Java OpenJDK 1.6 CETUS 1.3 ANTLR 2.7.5

HSM Architecture – 48-core Intel SCC Tushar Rawat / Arizona State University1411-Mar-15 Tile R R R RR R R R RRR R R R RRR R R R RRR R MC Memory Controller RRouter

On-chip shared memory - MPB Tushar Rawat / Arizona State University1511-Mar-15 MPB CC MIU Message Passing Buffer Cache Controller Mesh Interface Unit P54C Pentium® processor core P54C Core L1 cache P54C Core L1 cache 256 KB L2 cache 256 KB L2 cache MIU [to router] 16 KB MPB CC Tile

Using 32 of 48 available cores 384 KB on-chip SRAM, up to 64 GB off-chip DRAM One Linux operating system per core 800 MHz core frequency 1600 MHz network mesh frequency 1066 MHz off-chip DDR3 frequency Intel SCC Experiment Configuration 11-Mar-15Tushar Rawat / Arizona State University16

Both Core and Memory intensive programs Originals developed for Pthread Multicore systems Compiled using Intel C++ Compiler 8.1 (gcc 3.4.5), RCCE API version 2.0 Originals run on SCC for baseline Translated into SCC RCCE applications Run with only off-chip shared memory Run with mix of on-chip and off-chip shared memory Benchmarks 11-Mar-15Tushar Rawat / Arizona State University17

RCCE vs Pthread Performance 11-Mar-15Tushar Rawat / Arizona State University18

Off-chip vs On-chip Mem. Performance 11-Mar-15Tushar Rawat / Arizona State University19

Enabling Manycores – Performance 11-Mar-15Tushar Rawat / Arizona State University20

Analyzer identifies all shared data within Pthread program Translator maps data to both on-chip and off-chip memory Enables execution of multi-threaded programs for HSM architecture after conversion to many-core applications Important to use fast on-chip memory when possible: 8x improvement on average for benchmarks when using on- chip SRAM (MPB) vs only off-chip DRAM Takeaways 11-Mar-15Tushar Rawat / Arizona State University21

Thank you Questions 11-Mar-15Tushar Rawat / Arizona State University22

Intel SCC and MCPC 11-Mar-15Tushar Rawat / Arizona State University23