pTask: A Smart Prefetching Scheme for OS Intensive Applications

Slides:

Advertisements

Similar presentations

Request Dispatching for Cheap Energy Prices in Cloud Data Centers

Advertisements

SpringerLink Training Kit

Luminosity measurements at Hadron Colliders

From Word Embeddings To Document Distances

Choosing a Dental Plan Student Name

Virtual Environments and Computer Graphics

Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI

THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –

D. Phát triển thương hiệu

NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN

Điều trị chống huyết khối trong tai biến mạch máu não

BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.

Nasal Cannula X particulate mask

Evolving Architecture for Beyond the Standard Model

HF NOISE FILTERS PERFORMANCE

Electronics for Pedestrians – Passive Components –

Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel

L-Systems and Affine Transformations

CMSC423: Bioinformatic Algorithms, Databases and Tools

Some aspect concerning the LMDZ dynamical core and its use

Bayesian Confidence Limits and Intervals

实习总结（Internship Summary)

Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,

Front End Electronics for SOI Monolithic Pixel Sensor

Face Recognition Monday, February 1, 2016.

Solving Rubik's Cube By: Etai Nativ.

CS284 Paper Presentation Arpad Kovacs

انتقال حرارت 2 خانم خسرویار.

Summer Student Program First results

Theoretical Results on Neutrinos

HERMESでのHard Exclusive生成過程による核子内クォーク全角運動量についての研究

Wavelet Coherence & Cross-Wavelet Transform

yaSpMV: Yet Another SpMV Framework on GPUs

Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.

MOCLA02 Design of a Compact L-band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Fuel cell development program for electric vehicle

Overview of TST-2 Experiment

Optomechanics with atoms

داده کاوی سئوالات نمونه

Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium

ლექცია 4 - ფული და ინფლაცია

10. predavanje Novac i financijski sustav

Wissenschaftliche Aussprache zur Dissertation

FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,

Particle acceleration during the gamma-ray flares of the Crab Nebular

Interpretations of the Derivative Gottfried Wilhelm Leibniz

Advisor: Chiuyuan Chen Student: Shao-Chun Lin

Widow Rockfish Assessment

SiW-ECAL Beam Test 2015 Kick-Off meeting

On Robust Neighbor Discovery in Mobile Wireless Networks

Chapter 6 并发：死锁和饥饿 Operating Systems: Internals and Design Principles

You NEED your book!!! Frequency Distribution

Y V =0 a V =V0 x b b V =0 z

Fairness-oriented Scheduling Support for Multicore Systems

Climate-Energy-Policy Interaction

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Ch48 Statistics by Chtan FYHSKulai

The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.

Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs

Online Learning: An Introduction

Factor Based Index of Systemic Stress (FISS)

What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.

THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*

Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.

The Toroidal Sporadic Source: Understanding Temporal Variations

FW 3.4: More Circle Practice

ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف

Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM

Limits on Anomalous WWγ and WWZ Couplings from DØ

Presentation transcript:

pTask: A Smart Prefetching Scheme for OS Intensive Applications Prathmesh Kallurkar, Smruti R. Sarangi Indian Institute of Technology Delhi

Outline of the talk Introduction Characterization of OS intensive applications Design of our prefetching technique ‘pTask’ Results

Why Instruction Prefetching for OS Intensive Applications Non-privileged: Application Privileged: OS Each task has a different instruction footprint. Parse web request Create page Combined size of instruction footprints of all tasks is larger than the instruction cache Interrupt Read System call Write System call System call Time Application

Why another instruction prefetcher Proactive Instruction Fetch (MICRO’11) Return Address Directed Prefetching (MICRO ‘13) Prefetch operation All the time Hardware overhead 200 KB per core 64 KB per core Software overhead None Performance Improvement 20.51 % 19.78 % Storage overhead of additional structures would make it difficult for the deployment of such sophisticated prefetchers in real systems

What are we proposing Proactive Instruction Fetch (MICRO’11) Return Address Directed Prefetching (MICRO ‘13) pTask (MICRO ‘16) Prefetch operation All the time Certain points in the code Hardware overhead 200 KB per core 64 KB per core 72 B per core Software overhead None OS routines Performance Improvement 20.51 % 19.78 % 27.38 % pTask uses in-memory data structures to store the prefetch information

Apache: Variation in i-cache hit rate Tasks: system call handlers, interrupt (top half and bottom half) handlers, scheduler, applications i-cache hit rate drops mostly when we switch from one task to another Crux of our technique: Do not run prefetching all the time Prefetch only when transitioning from one task to another

Core functionality of OS tasks Identify prefetchable execution blocks Non-privileged: Application Privileged: OS libC functions that perform system call Core functionality of OS tasks Execution block of 125 functions OS Events HyperTask 1) System call 2) Interrupt (top-half) 3) Interrupt (bottom-half) 4) Schedule routine Time

(Execution between two system calls) HyperTasks System Execution Application (Execution between two system calls) OS Parse query Create reply System call Interrupt Bottom Half Scheduler Read file Page fault SCSI schedule()

Overview of the scheme: Leverage repetitive execution pattern for prefetching Time Overview of the scheme: Profile Normal Record the execution pattern of the HyperTask and find the list of frequently accessed functions. Prefetch the HyperTask’s frequently accessed functions into the i-cache before executing it Slower than baseline  Faster than baseline 

Outline of the talk Introduction Characterization of OS intensive applications Design of our prefetching technique ‘pTask’ Results

Benchmarks Network: Database: File system 1) Apache: Web server (Apache) 2) ISCP: Secure copy file from remote machine to local machine (SCP) 3) OSCP: Secure copy file from local machine to remote machine (SCP) Database: 4) DSS: Decision Support System (TPCH: Query 2) 5) OLTP: Online Transaction Processing (Sysbench benchmark suite) File system 6) Find: Filesystem browsing (Linux utility application Find) 7) FileSrv: Random reads and writes to file (Sysbench benchmark suite) 8) MailSrvIO: Imitate file operations of a mail server (Filebench benchmark suite)

Instruction break up Interrupt handlers 20% for Oscp (top-half and bottom-half) should also be considered for Instruction prefetching 20% for Oscp 90% for FileSrv

Most i-cache evictions occur because of inter-HyperTask competition 1313/40 Break-up of i-cache’s hit rate and evictions breakup of evictions hit rate i-cache hit rate: OS intensive applications (80-90%) SPEC/PARSEC (>99%) Most i-cache evictions occur because of inter-HyperTask competition

Challenge 1: Which functions should be prefetched? 9/10 profile runs 10 profile runs d d e e f f a a b b c c j j Start: : End 1/10 profile run g g h h i i Classes of functions based on their frequency in profile runs: 100 % (10/10) 90 % (9/10) 10 % (1/10)

Challenge 1: Which functions should be prefetched? Prefetch functions that were called in all profile runs (100%) Prefetch all functions (>0%) Advantage: Will be able to prefetch most functions Disadvantage: Many prefetch operations may remain unused Disadvantage: Will be able to prefetch limited number of functions only Advantage: Most of the prefetch will be used Classes of functions based on their frequency in profile runs: b e h j a c d f g i 100 % (10/10) 90 % (9/10) 10 % (1/10)

Challenge 1: Which functions should be prefetched? 1616/40 Challenge 1: Which functions should be prefetched? PX=prefetch functions that were accessed in x% of the invocations #accessed lines found in prefetch list Coverage = #lines accessed by the HyperTask #lines in prefetch list that were accessed Utility = #total lines in the HyperTask Choose P50 What we want -> High Coverage and High Utility

Challenge 2: How many times should we profile? After 10 profile runs, JaccardDistance < 0.005 Create a Prefetch list at the end of each profile run Plot the JaccardDistance between the Prefetch lists created at the each of consecutive profile runs 𝐽𝑎𝑐𝑐𝑎𝑟𝑑𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝐴,𝐵 = 𝐴∪𝐵 −|𝐴∩𝐵| |𝐴∪𝐵|

Outline of the talk Introduction Characterization of OS intensive applications Design of our prefetching technique ‘pTask’ Results

Execution Model With pTask routines pTask routines are called before starting each HyperTask

Outline of the design section Profiling Function annotation in OS and application executables Recording functions in a single execution HyperTask annotation Prefetch list Normal mode

Function annotation in OS and Application executables Add marker instruction to each function Add header to each executable (Function Map) Function foo () { recordFunc 0 …. } Function bar() { recordFunc 1 … Function foo () { …. } Function bar() { … Function IDx Start address #Lines 0x345 4 1 0x349 5

Profiling: Recording Function Execution No. of functions in the executable Function Vector (stored in memory) 1 1 Set bit number 5 Set bit number 0 All bits are set to ‘0’ at the start of a profile run Function call sequence: foo(): recordFunc 0 func(): recordFunc 5 Bit of each executed function should be set to ‘1’ at the end of profile run

Outline of the design section Profiling HyperTask annotation Identifying application HyperTasks Identifying OS HyperTasks Prefetch list Normal mode

Identifying Application HyperTasks Application HyperTask: user process’ execution between two system calls Encode the sequence of functions called by the user process between two system calls Captured using two 256-bit registers: pathSum and pathVector i = n % 256 pathVec[i] == 0 recordFunc n Y pathSum += i pathSum (left-shift) i pathVec[i] = 1

Identifying OS HyperTasks System calls: Naïve approach: Use system call ID (as defined by the OS developer) Read system call Write system call Network socket Disk file Both versions have different instruction sequences How to differentiate at run-time: Use ID of the application HyperTask that executed just before the system call

Identifying OS HyperTasks System call handler Interrupt handler Bottom half handler Scheduler Identifying OS HyperTasks ID of the previous application HyperTask Interrupt ID Program counter of the bottom half handler’s routine Add special instruction at the beginning of schedule()

Outline of the design section Profiling HyperTask annotation Prefetch list Profile store Prefetch store Normal mode

Function frequency list Profile Store Function Vector (End of profile run) 1 Increment frequencies of functions 3,6,9 Profile Store (in-memory) HyperTask ID #Profile runs Function frequency list H1 1 (3→1) (6→1) (9→1)

Function frequency list Prefetch list (run-length encoding) Prefetch Store HyperTask ID #Profile runs Function frequency list H1 10 (3→10) (6→2) (8→7) (9→9) Profile Store (in-memory) Choose functions that satisfy P50 criteria Functions deemed to be frequent for HyperTask H1: 3, 6, 9 HyperTask ID Prefetch list (run-length encoding) H1 (0x695, 3) (0x734, 2) (0x789→4) Prefetch Store (in-memory)

Outline of the design section Profiling HyperTask annotation Prefetch list Normal mode Issue prefetch operations Re-profiling

Normal mode Execute HyperTask Step 2: Execute HyperTask HyperTask boundary Execute HyperTask Step 1: Locate prefetch list and issue prefetch operations Step 2: Execute HyperTask recordFunc instruction: Treat as NOP for OS HyperTasks Update application taskID

(Avg. miss rateold – Avg. miss ratenew) > 8% Re-profiling (Avg. miss rateold – Avg. miss ratenew) > 8% Re-profile HyperTask if:

Outline of the talk Introduction Characterization of OS intensive applications Design of our prefetching technique ‘pTask’ Results

Evaluation Setup Native OS Benchmark Execution Guest OS Traces (Linux 2.6.32) QemuTrace Tejas Native OS

i-cache (4-way 32 KB), and d-cache (4-way 32 KB) Hardware Details Parameter Value Cores 16 Pipeline Out Of Order Pipeline Private cache i-cache (4-way 32 KB), and d-cache (4-way 32 KB) Shared cache L2 cache (8-way 4 MB) OS Linux 2.6.32 (Debian 6.0)

Evaluated Techniques Technique Granularity Hardware overhead Markov [MICRO’05] i-cache line 4 KB per core Call Graph Prefetching [TOCS’03] Functions 32 KB per core Proactive Instruction Fetch [MICRO’11] Stream of instructions 200 KB per core Return Address Directed Prefetching [MICRO’13] 64 KB per core pTask [proposed] Next HyperTask 72 B per core

Performance Comparison 3737/40 Performance Comparison pTask outperforms state of the art prefetchers by 6-7%

Breakup of i-cache accesses: Low iWaitPrefetch 3838/40 Breakup of i-cache accesses: Low iWaitPrefetch Prefetch opns complete before the line is accessed High iHit Able to predict the execution patterns better

Conclusion Proposed pTask, an instruction prefetcher that is optimized for OS intensive workloads Basic insights: Prefetch only when transitioning from one task to another Turn off prefetching for steady phase of execution when i-cache hit rates are high Demonstrated an average increase in instruction throughput of 7% over highly optimized instruction prefetching proposals for a suite of 8 OS intensive applications

Questions