A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Bilgisayar Mühendisliği Bölümü GYTE - Bilgisayar Mühendisliği Bölümü Multithreading the SunOS Kernel J. R. Eykholt, S. R. Kleiman, S. Barton, R. Faulkner,
Multiple Processor Systems
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
Threads Irfan Khan Myo Thein What Are Threads ? a light, fine, string like length of material made up of two or more fibers or strands of spun cotton,
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
Extended Memory Semantics for Thread Synchronization Sheng Li, Ying Zhou Operating System Progress Report Nov 1 st, 2007 Sheng Li, Ying Zhou Operating.
1 CS 501 Spring 2002 CS 501: Software Engineering Lecture 19 Performance of Computer Systems.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 5: Threads Overview Multithreading Models Threading Issues Pthreads Solaris.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
Characterizing a New Class of Threads in Science Applications for High End Supercomputing Arun Rodrigues Richard Murphy Peter Kogge Keith Underwood Presentation.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
© 2004, D. J. Foreman 2-1 Concurrency, Processes and Threads.
PRASHANTHI NARAYAN NETTEM.
ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.
Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda §, Andrés Márquez §, Jarek Nieplocha §, Kristyn Maschhoff † and.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Threads, Thread management & Resource Management.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
CSE 451: Operating Systems Section 5 Midterm review.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
SoC CAD 2015/11/22 1 Instruction Set Extensions for Multi-Threading in LEON3 林孟諭 電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Department of Computer Science and Software Engineering
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Processes & Threads Introduction to Operating Systems: Module 5.
Threads, Thread management & Resource Management.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Sunpyo Hong, Hyesoon Kim
1 Why Threads are a Bad Idea (for most purposes) based on a presentation by John Ousterhout Sun Microsystems Laboratories Threads!
Initial Kernel Timing Using a Simple PIM Performance Model Daniel S. Katz 1*, Gary L. Block 1, Jay B. Brockman 2, David Callahan 3, Paul L. Springer 1,
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Lecturer 3: Processes multithreaded Operating System Concepts Process Concept Process Scheduling Operation on Processes Cooperating Processes Interprocess.
Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Distributed Shared Memory
Multiscalar Processors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Chapter 4: Multithreaded Programming
Overview Introduction General Register Organization Stack Organization
Real-time Software Design
The University of Adelaide, School of Computer Science
Computer Architecture: Multithreading (I)
Chapter 4: Threads.
The University of Adelaide, School of Computer Science
Using Packet Information for Efficient Communication in NoCs
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
CS333 Intro to Operating Systems
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
CSC Multiprocessor Programming, Spring, 2011
Lecture 17 Multiprocessors and Thread-Level Parallelism
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block University of Notre Dame MTAAP 2007,CA Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block University of Notre Dame MTAAP 2007,CA

MTAAP 2007, Long Beach, CA Outline Heterogeneous Lightweight Multithreaded Architecture Simulation environments, benchmarks and results Conclusions and future work

MTAAP 2007, Long Beach, CA Architecture Highlights Processing-In-Memory(PIM) Based  Effectively attack memory wall problem Highly multithreaded  Successfully hide large latencies and contentions Heterogeneous, Supports Extended Memory Semantics (EMS)  Extremely low overhead on context switch and synchronization

MTAAP 2007, Long Beach, CA Multithreaded Processors Multithreading reduces the processor idle time Thread context is part of the processor Multithreading Machines 1960s CDC s I/O Processor for the Space Shuttle 1980s Denelcor HEP 1990s Cray/Tera MTA Cray Eldorado Intel Xeon Sun Niagara Single Threaded Multithreaded

MTAAP 2007, Long Beach, CA Lightweight Threads Thread context (frame) is 32 double words (256 bytes)  Two double words are reserved for the thread status; 30 general purpose registers.  No other per thread state, easy for multithreading. Frames are stored in memory (No Register File)  Registers are aliases for memory locations

MTAAP 2007, Long Beach, CA Lightweight Multithreading Thread creation is fast and inexpensive - single instruction  Contrast with pthread creation - kernel intervention and as many as 10,000’s of instructions Unbounded Multithreading  Threads are part of the memory system rather than the processor state.  “Unlimited number” of threads per processor.  Many opportunities for issuing an instruction. Ultra-lightweight Processing  Unbounded Multithreading requires low overhead thread management and synchronization  At the memory bank, Greater data bandwidth, Low overhead

MTAAP 2007, Long Beach, CA Heterogeneous Architecture Lightweight Processor Chip (LPC) Issue instruction from ready threads on each clock cycle Architectural support for low overhead thread management

MTAAP 2007, Long Beach, CA Extended Memory Semantics (EMS) Memory subsystem is constructed of 65 bit dwords  64 bits of data  1 extension bit; 1: dword is Full, 0: dword is empty Extends Cray MTA E/F bits  Full/Empty: Contains data or not  Extra states: Metadata can contain frame pointer Same semantics apply to thread registers 64 bits of data/metadata Extension bit

MTAAP 2007, Long Beach, CA Single Producer/ Consumer on EMS LWP behavior for load_fe with A empty.  Location A changes state to “FVE: forward value, leave empty”  Content of A is the target address of the forward operation (all registers also have a memory address).

MTAAP 2007, Long Beach, CA Completing the Load How does the LWP complete the load_fe ?  store_ef arrives at A  Data associated with store is returned to T2:R2 – this completes the load_fe  Location A changes to the empty state.

MTAAP 2007, Long Beach, CA A More Complex Situation Consider a multiple producer/consumer problem such as locks.  Multiple threads (more than 3) all attempt to acquire the lock.  Memory requests will be queued up at the target location  EMS handler thread needed to handle the bookkeeping

MTAAP 2007, Long Beach, CA EMS Handler Overhead Invoking a EMS handler  Synchronized memory operations beyond the hardware supported single producer/consumer scenario Overhead  Creating the handler threads  To queue up memory requests, handlers need to spin on the target memory address to get exclusive access  Significant overhead on LWP CPU time, NoC traffic and memory bandwidth How to alleviate the overhead?

MTAAP 2007, Long Beach, CA Ultra-Lightweight Processor Alleviate burden from LWP For thread synchronization and management, Complex atomic memory operations Simple design, Minimal circuitry At the memory bank, Greatest data bandwidth (wide-word), no NoC traffic when accessing memory. Multithreaded

MTAAP 2007, Long Beach, CA Large-scale system

MTAAP 2007, Long Beach, CA Outline Heterogeneous Lightweight Multithreaded Architecture Simulation environments, benchmarks and results Conclusion and future work

MTAAP 2007, Long Beach, CA Simulation Environment DimC – Diminished C - An extension of the ANSI C - Expose low level architectural features - Support lightweight multithreading SALT -Simulator for the Analysis of LWP Timings -Contains LWPs, ULWPs, NoC and memory subsystems.

MTAAP 2007, Long Beach, CA Benchmark Suite Two categories of irregular problems. Complicated control structures such as recursion.  Such programs can achieve decent performance on conventional architectures but need great effort.  Not necessarily Invoking EMS handler or ULWP  N-Queens, Fibonacci Complicated control structures and dynamic data structures  Very hard to parallelize effectively on conventional SMPs.  EMS handler or ULWP support is necessary  Competing agents, SAT solver kernel

MTAAP 2007, Long Beach, CA N-Queens Find all solutions to the problem of placing N queens on an N*N chessboard such that no queen can attack another. Irregular problems with dynamic parallel recursion, Thread behavior is hard to predict.

MTAAP 2007, Long Beach, CA Competing Agents Multiple agents attempt to update a shared memory location simultaneously Each agent is implemented by a single thread. All threads are evenly distributed over four LWPs inside a single LPC Complicated control structures and dynamic data structures Using separate synchronized load/stores To characterize the effectiveness of the ULWP in reducing the cost of synchronization.

MTAAP 2007, Long Beach, CA SAT Solver/zChaff SAT-Boolean satisfiability problem ( from propositional logic )  fundamental to many problems in automated reasoning, CAD, CAM, machine vision, database, robotics, IC design, computer architecture, and network design.  Given a boolean formula (usually in CNF), check whether an assignment of boolean truth values to the variables in the formula exists, such that the formula evaluates to true.  For example, the CNF formula, x1 is true and x3 is false, then all three clauses are satisfied,regardless of the value of x2. zChaff, the modern variants of the DPLL algorithm, is used to implement SAT solver.

MTAAP 2007, Long Beach, CA N-Queens Successfully deploy all the parallelism  Completely dynamic, Ideal speedup  Saturation is only due to small data set Good performance can be achieved on conventional SMPs but need great extra effort

MTAAP 2007, Long Beach, CA Competing Agents EMS handler is the bottleneck in high contention situation Heterogeneous architecture can achieve unbounded scalability High contention is not a problem any more in the heterogeneous architecture

MTAAP 2007, Long Beach, CA SAT Solver/zChaff on Conventional SMPs Parallel implementation lead to performance degeneration The more processors, the worse performance Very hard to achieve good performance on conventional SMPs Data from Parallel Multithreaded Satisfiability Solver: Design and Implementation By Yulik Feldman, Intel

MTAAP 2007, Long Beach, CA SAT Solver/zChaff on Heterogeneous architecture Ideal speedup saturation is only due to small data set Successfully deployed all the parallelism Speedup Speedup Over serial version

MTAAP 2007, Long Beach, CA Outline Heterogeneous Lightweight Multithreaded Architecture Simulation environments, benchmarks and results Conclusions and future work

MTAAP 2007, Long Beach, CA Conclusions The Heterogeneous Lightweight Multithreaded Architecture  is a good solution for irregular problem that are hard/impossible to parallelize over conventional SMPs  Has very low overhead on context switching and synchronization  Can successfully hide latencies and contentions  Can provide unbounded multithreading and scalability  Can deploy all possible parallelism inside an irregular problem

MTAAP 2007, Long Beach, CA Future Work Provide standard language support Benchmark suites Large-scale system performance Comparison with conventional large-scale systems

MTAAP 2007, Long Beach, CA Acknowledgments DARPA  This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under its Contract No. NBCH University of Notre Dame Caltech/JPL Cray

MTAAP 2007, Long Beach, CA Thank you!