Scalability-Based Manycore Partitioning Hiroshi Sasaki Kyushu University Koji Inoue Kyushu University Teruo Tanimoto The University of Tokyo Hiroshi Nakamura.

Slides:



Advertisements
Similar presentations
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Advertisements

A Case for Refresh Pausing in DRAM Memory Systems
Lecture 6: Multicore Systems
2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
SLA-Oriented Resource Provisioning for Cloud Computing
1 Presenter: Chien-Chih Chen. 2 Dynamic Scheduler for Multi-core Systems Analysis of The Linux 2.6 Kernel Scheduler Optimal Task Scheduler for Multi-core.
Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
OpenFOAM on a GPU-based Heterogeneous Cluster
HPMMAP: Lightweight Memory Management for Commodity Operating Systems
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
Tao Yang, UCSB CS 240B’03 Unix Scheduling Multilevel feedback queues –128 priority queues (value: 0-127) –Round Robin per priority queue Every scheduling.
Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.
Computer Organization and Architecture
Multicore experiment: Plurality Hypercore Processor Performed by: Anton Fulman Ze’ev Zilberman Supervised by: Mony Orbach Characterization presentation.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
SAN FRANCISCO, CA, USA Adaptive Energy-efficient Resource Sharing for Multi-threaded Workloads in Virtualized Systems Can HankendiAyse K. Coskun Boston.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
CPU Scheduling Gursharan Singh Tatla 1-Feb-20111www.eazynotes.com.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim.
On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Software Performance Monitoring Daniele Francesco Kruse July 2010.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction University of California MICRO ’03 Presented by Jinho Seol.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
Sunpyo Hong, Hyesoon Kim
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
lecture 5: CPU Scheduling
Adaptive Cache Partitioning on a Composite Core
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Chapter 6: CPU Scheduling
Department of Computer Science University of California, Santa Barbara
Verilog to Routing CAD Tool Optimization
Process Scheduling B.Ramamurthy 2/23/2019.
CS703 – Advanced Operating Systems
Levels of Parallelism within a Single Processor
Chapter 6: CPU Scheduling
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Scalability-Based Manycore Partitioning Hiroshi Sasaki Kyushu University Koji Inoue Kyushu University Teruo Tanimoto The University of Tokyo Hiroshi Nakamura The University of Tokyo PACT 2012 Presented by Kim, Jong-yul

Contents Motivation SBMP Scheduler Scalability Prediction Core Partition Core Donation Phase Change Detection Evaluation Results Conclusions 2 / 27

Prospects Limitation of increasing F ILP, power wall, transistor scaling Multi-core, many-core system System APP2APP3 … APP1 Multi-threaded multiprogramming 3 / 27

Problem Traditional OS Assign equal CPU to all running apps Programs have different Scalability N ormalized T urnaround T ime Clock cycles when multiprogrammed with others Clock cycles when solo-run Workloads Average Workloads Performance 4 / 27 Linux: 2.04 Best Partitioning: 1.38

Experimental System allocation unit 5 / 27

SBMP Scheduler Scalability Prediction Core Partitioning Core Donation Phase Change Detection 6 / 27

Overview Assign cores considering scalability of applications SBMP: Scalability-Based Manycore Partitioning scheduler Partitioning Steady Scalability Prediction Core Partitioning Core Donation Detect 7 / 27

Steady Scalability Prediction Core Partitioning Core Donation Detect 8 / 27

Workloads Scalability Prediction (1/2) Cumulative retired instructions per second (IPS) Little effect from # of cores Total # of instructions 8% 9 / 27

Scalability Prediction (2/2) If obtained directly… Warm up branch prediction & cache system Need 8 allocations (6, 12, 18, …, 48) Simple model 3 coefficients (α, β, γ) 3 Samplings: 1 single core + 2 different configurations PerformanceAmdahl’s lawOverhead caused by additional core Over 3 seconds 10 / 27

Steady Scalability Prediction Core Partitioning Core Donation Detect 11 / 27

# of cores Relative performance Core Partitioning (1/2) High Medium Low # of cores Relative performance 12 / 27

Core Partitioning (2/2) Scalability-table for each program Key -value Key : # of cores Value : performance with [key] cores Goal Hill climbing algorithm Near optimal assignment Single-run Multiprogrammed 13 / 27

Steady Scalability Prediction Core Partitioning Core Donation Detect 14 / 27

Core Donation 1 program for each processor die CPU utilization Core1 Program1 CPU utilization ratio < Threshold (70%) Core2 Donor Donee: most beneficial one Utilization, scalability Priority: Donee < Donor Finer granularity Processor die (6 cores) time Program2 Donee 15 / 27

Steady Scalability Prediction Core Partitioning Core Donation Detect 16 / 27

Steady Scalability Prediction Core Partitioning Core Donation Detect 17 / 27

Detection (1/2) 1.Creation or termination of program 2.Phase transition detected in any of the programs Performance 18 / 27

Detection (2/2) – Phase Prediction Steady Scalability Prediction Core Partitioning Core Donation Detect 19 / 27

Evaluation Core Partitioning Phase Prediction Core Donation Overall Performance 20 / 27

Experimental System PARSEC benchmark suite 2.1 Processor4 X AMD Opteron 6172 # of dies / processor2 # of cores / die6 Total # of cores48 L3 cache size12 MB / socket Main memory96 GB DDR3 PC Linux kernel / 27

Workloads Core Partitioning SBMP-base Scalability Prediction + Core Partitioning Single-phase application (2 Medium + 2 Low) Workloads Performance Average Linux: 1.88 SBMP-base: / 27

Phase Prediction SBMP-PP (Phase Prediction) SBMP-base + Phase Prediction Multiple-phase application Workloads Linux: 1.89 SBMP-base: 2.09 SBMP-PP: / 27

Core Donation SBMP-CD (Core Donation) SBMP-PP + Core Donation 2 low CPU utilization + 2 normal Workloads Linux: 2.06 SBMP-PP: 1.68 SBMP-CD: / 27

Overall Results All programs Linux: 1.83 SBMP-base: 1.99 SBMP-PP: 1.70 (8%) SBMP-CD: 1.65 (11%) 72 Workloads 25 / 27

Conclusions OS scheduling on many core system Multiple Multi-threaded applications SBMP Scheduler Dynamic scalability prediction + Core partitioning Phase recognition Core Donation 11% over Linux 26 / 27

QnA 27 / 27

Hill Climbing Algorithm Find near optimal solution Start with arbitrary solution Incrementally changing a single element 28 / 27

Core Donation 1 program for each processor die CPU utilization P1 Program1 CPU utilization ratio < Threshold (70%) Donee: most beneficial one Utilization, scalability Priority: Donee < Donor Finer granularity Processor die (6 cores) Donor Program2 Donee P2 Program2 Donee 29 / 27

Evaluation PARSEC benchmark suite benchmark programs for 1 workload Gang-scheduling for Green, co-scheduling for others Exception: freqmine, multiple phase changes BLCR tool Evaluate only the parallel region 30 / 27

31 / 27