Lucas De Marchi sponsors: co-authors: Liria Matsumoto Sato

Slides:



Advertisements
Similar presentations
CPU Scheduling Tanenbaum Ch 2.4 Silberchatz and Galvin Ch 5.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
CPU Scheduling Basic Concepts Scheduling Criteria
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Operating System Concepts with Java – 7 th Edition, Nov 15, 2006 Silberschatz, Galvin and Gagne ©2007 Processes and Their Scheduling.
Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
1 Targeted execution enabling increased power efficiency John Goodacre Director, Program Management ARM Processor Division August 2009 MPSoC 2009 Anirban.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
CPU Scheduling - Multicore. Reading Silberschatz et al: Chapter 5.5.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
Politecnico di Torino Dipartimento di Automatica ed Informatica TORSEC Group Performance of Xen’s Secured Virtual Networks Emanuele Cesena Paolo Carlo.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
On Tuning Microarchitecture for Programs Daniel Crowell, Wenbin Fang, and Evan Samanas.
5: CPU Scheduling Last Modified: 10/25/2015 8:16:31 PM.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Guy Martin, OSLab, GNU Fall-09
lecture 5: CPU Scheduling
Predictable Cache Coherence for Multi-Core Real-Time Systems
CPU SCHEDULING.
Breakout Session 3 Alex, Mirco, Vojtech, Juraj, Christoph
April 6, 2001 Gary Kimura Lecture #6 April 6, 2001
Hardware Support for Embedded Operating System Security
Process Scheduling B.Ramamurthy 9/16/2018.
Chapter 5: CPU Scheduling
Chapter 6: CPU Scheduling
Chapter 6: CPU Scheduling
Department of Computer Science University of California, Santa Barbara
CS 143A - Principles of Operating Systems
Module 5: CPU Scheduling
Coe818 Advanced Computer Architecture
3: CPU Scheduling Basic Concepts Scheduling Criteria
Jason Neih and Monica.S.Lam
Scheduling.
Chapter 5: CPU Scheduling
Process scheduling Chapter 5.
Chapter 6: CPU Scheduling
CPU SCHEDULING.
Chapter 6: CPU Scheduling
Linköping University, IDA, ESLAB
Why Threads Are A Bad Idea (for most purposes)
Uniprocessor scheduling
Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Performance And Scalability In Oracle9i And SQL Server 2000
Chapter 6: CPU Scheduling
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Department of Computer Science University of California, Santa Barbara
Why Threads Are A Bad Idea (for most purposes)
Chapter 6: CPU Scheduling
Why Threads Are A Bad Idea (for most purposes)
Module 5: CPU Scheduling
Sarah Diesburg Operating Systems CS 3430
Presentation transcript:

Multi-core scheduling optimizations for soft real-time applications a cooperation aware approach Lucas De Marchi sponsors: co-authors: Liria Matsumoto Sato Patrick Bellasi William Fornaciari Wolfgang Betz

Agenda Introduction Motivation Objectives Analysis Optimization description Experimental results Conclusions & future works

Introduction – motivation

Introduction – motivation SMP + RT Multiple processing units inside a processor Determinism Parallel Programming Paradigms Data Level Parallelism (DLP) Competitive tasks Task Level Parallelism (TLP) Cooperative tasks

Introduction – motivation DLP TLP T1 T0 T2 T3 T0 T1 T2 Characterization: Synchronization Communication

Introduction – motivation Linux RT scheduler (mainline) Run as soon as possible (based on prio) ⇔ Use as many CPUs as possible Ok for DLP! But, what about TLP? Anyway, why do we care about TLP?

Objectives Study the behavior of RT Linux scheduler for cooperative tasks Optimize the RT scheduler Smooth integration into mainline kernel Don't throw away everything

Analysis – benchmark Simulation of a scenario where SW replaces HW Multimedia-like Mixed workload: DLP + TLP Challenge: map N tasks to M cores optimally

Analysis – metrics Throughput Sample mean time Determinism Sample variance

Analysis – metrics Influencing factors Preemption-disabled IRQ-disabled Locality of wake-up Intel i7 920: 4 cores 2.8 GHz

Analysis – locality of wake-up Migrations a) migration patterns (wave0, mixer1 and mixer2) wave0: 112010101010101010101030111301010101010131010321010033 wave1: 033333333333333333333313202133333333333303333202332121 wave2: 022022222222222222222222233302222222222222222213322330 wave3: 111101010101010101010101011110101010101010101010101002 mixer0: 023033333333333333333333333202333333333333333333322233 mixer1: 103302201010101010101010101010330101010101010101010101 mixer2: 230201010101010101010101012020101010101010101010101021 monitor: 111122222222222222222222231112222222222222222233322303

Analysis – locality of wake-up Migrations b) occasional migrations wave0: 302121331102303323303311032320011021111111120101010101 wave1: 223303013333212010121032121012333210320230333333333333 wave2: 033332232302003111201211330021322233323330220222222222 wave3: 110211120211101030031121003030120101101111111010101010 mixer0: 322210301110320231310303021213120013220320230333333333 mixer1: 001003321202220131103203212130320331201031033022010101 mixer2: 230101020201202020121020021010130101201202302010101010 monitor: 113022133333131312032133303222223233123111111222222222

Analysis – locality of wake-up Cache-miss rate measurements 1 CPU 2 CPUs 4 CPUs Increase (1 - 4) Load 7.58% 8.99% 9.44% +24.5% Store 9.29% 9.78% 11.62% +25.1%

Analysis – conclusion Why do we care about TLP? Common parallelization technique What about TLP? Current state of Linux scheduler is not as good as we want

Solution – benchmark Abstraction: One application level sends data to another w0 w1 w2 w3 m0 m1 m2 Parallelization (task competition) Serialization (task cooperation)

Solution – benchmark Abstraction: One application level sends data to another Reality: shared buffers + synchronization buffer w0 w1 Parallelization (task competition) m0 m1 m2 w2 w3 Serialization (task cooperation)

Solution – benchmark Abstraction: One application level sends data to another Reality: shared buffers + synchronization Dependencies: Define dependencies among tasks in the opposite way of data flow w0 w1 Parallelization (task competition) m0 m1 m2 w2 w3 Serialization (task cooperation)

Solution – idea Dependency followers If data do not go to tasks, then the tasks go to where data were produced Make tasks run on same CPU of their dependencies

Measurement tools Ftrace sched_switch tool (Carsten Emde)* gtkwave perf adhoc scripts * http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

Solution – task-affinity Dependency followers selection of the CPU in which a task (e.g. m0) executes takes into consideration the CPUs in which its dependencies (e.g. w0 and w1) ran last time. m0 m1 m2 w2 w3

Solution – task-affinity implementation 2 lists inside each task_struct: taskaffinity_list followme_list 2 system calls to add/delete affinities: sched_add_taskaffinity sched_del_taskaffinity

Experimental results Cache-miss rates Measurements without and with task-affinity More data/instruction hit the L1 cache less cache lines are invalidated

Experimental results What exactly to evaluate? Cache-miss rate is not exactly what we want to optimize Optimization objectives: Lower the time to produce a single sample Increase determinism on production of several samples w0 w1 w2 w3 m0 m1 m2

Experimental results Average execution time of each task w0 w1 w2 w3 m0 m1 m2

Experimental results Variance of execution time of each task w0 w1 w2 w3 m0 m1 m2

Experimental results Production time of each single sample Results obtained for 150,000 samples mainline task-affinity

Experimental results Production time of each single sample Empiric repartition function Real-time metric (normal distribution): average + 2 * standard deviation

mainline Max: 114 Min: 29 Avg: 37.8 Variance: 3.22 task-affinity

Experimental results summary

Conclusion & future works Average execution time is almost the same Determinism for real-time applications is improved Future works: Better focus on temporal locality Improve task-affinity configuration Test on other architectures Clean up the repository

Conclusion & future works Still a Work In Progress Git repository: git://git.politreco.com/linux-lcs.git Contact: lucas.demarchi@profusion.mobi lucas.de.marchi@gmail.com

Q & A

Solution – Linux scheduler Dependency followers Linux scheduler: Change the decision process of the CPU in which a task executes when it is woken up Context switch CPUi Core Scheduler Main scheduler Periodic scheduler Select task Scheduler classes RT Fair Idle Scheduler policies RR FIFO Normal Idle Batch