(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Slides:



Advertisements
Similar presentations
CRUISE: Cache Replacement and Utility-Aware Scheduling
Advertisements

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Lecture 6: Multicore Systems
High Performing Cache Hierarchies for Server Workloads
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
“ NAHALAL : Cache Organization for Chip Multiprocessors ” New LSU Policy By : Ido Shayevitz and Yoav Shargil Supervisor: Zvika Guz.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
CMT OS scheduling summary Yipkei Kwok 03/18/2008.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.
Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Sunpyo Hong, Hyesoon Kim
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Processor Level Parallelism 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Guy Martin, OSLab, GNU Fall-09
Reducing OLTP Instruction Misses with Thread Migration
Reducing Memory Interference in Multicore Systems
Adaptive Cache Partitioning on a Composite Core
Multi-core processors
5.2 Eleven Advanced Optimizations of Cache Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
CMPT 886: Computer Architecture Primer
Hardware Multithreading
CARP: Compression-Aware Replacement Policies
Massachusetts Institute of Technology
Adaptive Single-Chip Multiprocessing
Fine-grained vs Coarse-grained multithreading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Lecture 9: Caching and Demand-Paged Virtual Memory
Presentation transcript:

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

(2) CMTs (Chip multithreaded processors) CMP plus hardware multithreading –Supports a large number of thread contexts –Can hide memory latency by spawning multiple threads –High contention for shared resources Commercial Processors –Intel Core 2 Duo: I Core 2 Duo: 2 cores,... x2 L1 Cache, 4MB L2 Cache,1.86–2.93 Ghz –AMD Athlon 64 X2: 2 cores, 128KB x2 L1 Cache, 2MB L2 Cache,2.00–2.60 Ghz –SUN UltraSPARC T1 (Niagara): up to 8 cores, 24KB x8 L1 Cache, 3MB L2 Cache, 1.00–1.20Ghz

(3) CMT - Structure

(4) Multithreading Approaches Coarse grained –Context switch on memory access (switch on cache miss) –High switch cost (decision made late in the pipeline) Fine grained –Switches threads on every cycle –Performance of a single thread is very poor –Preferred by CMT processors

(5) Pipeline usage and Scheduling Thread Classification (can be done on CPI) –Compute Intensive Functional unit utilization is high –Memory Intensive Threads frequently stall for memory access OS Schedule has to balance demand for pipeline resources across cores by co-scheduling memory intensive and compute intensive applications

(6) Pipeline Contention Study (1) Experiment performed using modified SIMICS (SAM simulator) –4 cores and 4 thread contexts for each core –Tried several ways to schedule 16 threads –(a) and (b) match compute–intensive threads with memory– intensive threads., (c) and (d) place compute–intensive threads on the same core.

(7) Pipeline contention Study (2) Results as expected –Schedules (a) and (b) outperform (c) and (d) However, –Threads required large CPI variation  not always possible (Apps are very rarely just compute or memory intensive, they are often a mixture of both) –For real benchmarks, performance gains are modest (e.g. 5% improvement for SPEC)

(8) L1 Data Cache Contention Each core has four threads executing the same benchmark 32KB caches seem sufficient for the benchmarks studied IPC not sensitive to L1 miss ratio

(9) L2 Cache Contention 2 Core, 4 thread contexts per core, 9 benchmarks, two copies (18 benchmarks), 8 KB L1 L2 expected to have greater impact due to miss resulting in high latency memory access –Results corroborate L2 impact –IPC very sensitive to L2 miss ratio Summary: Equip the OS to handle L2 cache shortage

(10) Balance-Set Scheduling Originally proposed as a technique for virtual memory by Denning Concept: working set –Each program has a footprint, which if cached can decrease execution time –Solution: Schedule threads such that the combined working sets fit in the cache Problem: Working sets are not very good indicators of cache behavior –Programs do not access working sets uniformly

(11) The Reuse-Distance Model Proposed by Berg and Hagersten Reuse-Distance : Time b/w successive refs to the same memory location (measured in num. of memory refs) –Tries to capture temporal locality  lower the reuse-distance, greater the chance of reuse –Reuse-Distance histogram can be built at runtime Parallels to LRU stack used in LRU replacement

(12) Two Methods COMB: (i) sum the number of references for each reuse distance in each histogram, (ii) and multiply each reuse distance by the number of threads in the group, (iii) apply the reuse– distance estimation on the resulting histogram AVG: (i) assume that each thread runs with its own dedicated partition of a cache, (ii) estimate ratios for individual threads, (iii) compute the average.

(13) Comparison of COMB Vs AVG Both COMB and AVG come within 17% of actual COMB is computationally expensive –In a machine with 32 thread contexts and 100 threads the scheduler has to combine 100 C 32 histograms AVG wins!

(14) The Scheduling Algorithm (1) Step 1: Computing miss rate estimations (Periodically) –With N runnable threads and M hardware contexts, compute the miss rate estimations of the N C M groups of M threads by using the reuse–distance model and AVG Step 2: Choosing the L2 miss ratio threshold (Periodically) –Picks the smallest miss ratio among threads containing the greediest (cache intensive) thread Step 3: Identify the groups that will produce low cache miss ratios (Periodically) –The groups below the threshold are candidate groups (Every runnable thread has to be in a candidate group) Step 4: Scheduling decision (Every time a time slice expires) –Choose a group from the set of candidate groups and schedule the threads in the group to run during the current time slice

(15) The Scheduling Algorithm (2) To choose thread groups there can be two policies: performance–oriented (PERF) and fairness–oriented (FAIR) PERF: we select the group with the lowest miss ratio and containing threads that have not yet been selected, until each thread is represented in the schedule With FAIR, we select the group with the greatest number of the least frequently selected threads

(16) The Scheduling Algorithm: An Example (3)

(17) Performance Evaluation The 18–thread SPEC workload setup reused –Reuse-distance histograms computed offline –All combinations examined for computing candidate set Default refers to the SOLARIS default scheduler 19-37% improvement using PERF (9-18% using FAIR) –Doubling L2 gives same benefits as using PERF

(18) References "Chip Multithreading Systems Need a New Operating System Scheduler", Alexandra Fedorova, Christopher Small, Daniel Nussbaum, and Margo Seltzer "Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design", A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum