University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

Slides:



Advertisements
Similar presentations
Wireless Networks Should Spread Spectrum On Demand Ramki Gummadi (MIT) Joint work with Hari Balakrishnan.
Advertisements

Real Time Versions of Linux Operating System Present by Tr n Duy Th nh Quách Phát Tài 1.
1 Copyright © 2012 Oracle and/or its affiliates. All rights reserved. Convergence of HPC, Databases, and Analytics Tirthankar Lahiri Senior Director, Oracle.
University of Colorado at Boulder Core Research Lab ZDDs for Dynamic Trace Analysis Graham Price Manish Vachharajani.
E81 CSE 532S: Advanced Multi-Paradigm Software Development Chris Gill Department of Computer Science and Engineering Washington University in St. Louis.
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
The University of Adelaide, School of Computer Science
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.
Virtual Memory. The Limits of Physical Addressing CPU Memory A0-A31 D0-D31 “Physical addresses” of memory locations Data All programs share one address.
Performance and power consumption evaluation of concurrent queue implementations 1 Performance and power consumption evaluation of concurrent queue implementations.
Multi-Core Packet Scattering to Disentangle Performance Bottlenecks Yehuda Afek Tel-Aviv University.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann et al CS530 Graduate Operating System Presented by.
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed.
University of Colorado at Boulder Core Research Lab Operating System Support for Pipeline Parallelism on Multicore Architectures Manish Vachharajani University.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado.
Frame Shared Memory: Line-Rate Networking on Commodity Hardware
Cache Table. ARP Modules Output Module Sleep until IP packet is received from IP Software Check cache table for entry corresponding to the destination.
Integrated  -Wireless Communication Platform Jason Hill.
CS533 - Concepts of Operating Systems 1 Class Discussion.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Srihari Makineni & Ravi Iyer Communications Technology Lab
CCNA 3 Week 4 Switching Concepts. Copyright © 2005 University of Bolton Introduction Lan design has moved away from using shared media, hubs and repeaters.
Cisco 3 - Switching Perrine. J Page 16/4/2016 Chapter 4 Switches The performance of shared-medium Ethernet is affected by several factors: data frame broadcast.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 5 (Deep Packet Inspection)
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 4 (Network Packet Filtering)
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.
Mohit Aron Peter Druschel Presenter: Christopher Head
Lecture 20: Consistency Models, TM
Outline CPU caches Cache coherence Placement of data
15-740/ Computer Architecture Lecture 3: Performance
Software Coherence Management on Non-Coherent-Cache Multicores
Memory Caches & TLB Virtual Memory
Alternative system models
New Cache Designs for Thwarting Cache-based Side Channel Attacks
Concurrent Data Structures for Near-Memory Computing
Computer Structure Multi-Threading
Short Circuiting Memory Traffic in Handheld Platforms
Anders Gidenstam Håkan Sundell Philippas Tsigas
Lecture 21: Transactional Memory
Yiannis Nikolakopoulos
Lecture 22: Consistency Models, TM
CS510 - Portland State University
Ch 17 - Binding Protocol Addresses
Lecture 21: Transactional Memory
Lecture: Consistency Models, TM
Lecture: Transactional Memory
Presentation transcript:

University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley and Manish Vachharajani University of Colorado at Boulder John Giacomoni

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Why? Why Pipelines? Multicore systems are the future Many apps can be pipelined if the granularity is fine enough – < 1 µs – 3.5 x interrupt handler

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Fine-Grain Pipelining Examples Network processing: –Intrusion detection (NID) –Traffic filtering (e.g., P2P filtering) –Traffic shaping (e.g., packet prioritization)

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Network Processing Scenarios LinkMbpsfpsns/frame T-11.52,941340,000 T ,90911,000 OC ,3333,000 OC ,219, GigE1,000.01,488, OC-482,500.05,000, GigE10, ,925,37367 OC-1929, ,697,84351

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Core-Placements 4x4 NUMA Organization (ex: AMD Opteron Barcelona) AP P IPOP DecEnc AP P IP APP OP IP Dec App Enc OP

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Example 3 Stage Pipeline

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Example 3 Stage Pipeline

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Communication Overhead

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Communication Overhead Locks 320ns GigE

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Communication Overhead Locks 320ns GigE Lamport 160ns

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Communication Overhead Locks 320ns Lamport 160ns Hardware 10ns GigE

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Communication Overhead Locks 320ns Lamport 160ns Hardware 10ns FastForward 28ns GigE

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab More Fine-Grain Pipelining Examples Network processing: –Intrusion detection (NID) –Traffic filtering (e.g., P2P filtering) –Traffic shaping (e.g., packet prioritization) Signal Processing –Media transcoding/encoding/decoding –Software Defined Radios Encryption –Counter-Mode AES Other Domains –Fine-grain kernels extracted from sequential applications

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab FastForward Cache-optimized point-to-point CLF queue 1.Fast 2.Robust against unbalanced stages 3.Hides die-die communication 4.Works with strong to weak memory consistency models

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Lamports CLF Queue (1) lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } lamp_dequeue(*data) { while (head == tail) {} *data = buf[tail]; tail = NEXT(tail); }

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Lamports CLF Queue (2) lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } headtail buf[0]buf[1]buf[2]buf[3] buf[4]buf[5]buf[6]buf[7] buf[ ] buf[n]

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab AMD Opteron Cache Example M

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Lamports CLF Queue (2) lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } headtail buf[0]buf[1]buf[2]buf[3] buf[4]buf[5]buf[6]buf[7] buf[ ] buf[n] Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Lamports CLF Queue (3) lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } head buf[0]buf[1]buf[2]buf[3] buf[4]buf[5]buf[6]buf[7] buf[ ] buf[n] Observe how cachelines will still ping-pong. What if the head/tail comparison was eliminated? tail

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab FastForward CLF Queue (1) lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); }

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab buf[1]buf[0] FastForward CLF Queue (2) ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } head buf[0]buf[1]buf[2]buf[3] buf[4]buf[5]buf[6]buf[7] buf[ ] buf[n] tail Observe how head/tail cachelines will NOT ping-pong. BUT, buf will still cause the cachelines to ping-pong.

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab FastForward CLF Queue (3) ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } head buf[0]buf[1]buf[2]buf[3] buf[4]buf[5]buf[6]buf[7] buf[ ] buf[n] tail Solution: Temporally slip stages by a cacheline. N:1 reduction in coherence misses per stage.

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Slip Timing

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Slip Timing Lost

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Maintaining Slip (Concepts) Use distance as the quality metric –Explicitly compare head/tail –Causes cache ping-ponging –Perform rarely

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Maintaining Slip (Method) adjust_slip() { dist = distance(producer, consumer); if (dist < *Danger*) { dist_old = 0; do { dist_old = dist; spin_wait(avg_stage_time * (*OK* - dist)); dist = distance(producer, consumer); } while (dist dist_old); }

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Comparative Performance LamportFastForward

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Thrashing and Auto-Balancing FastForward (Thrashing)FastForward (Balanced)

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Cache Verification FastForward (Thrashing)FastForward (Balanced)

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab On/Off Die Communications M On-die communication Off-die communication

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab On/Off-die Performance FastForward (On-Die)FastForward (Off-Die)

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Proven Property In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Work in Progress Operating Systems –27.5 ns/op 3.1 % cost reduction vs. reported 28.5 ns –Reduced jitter Applications –128bit AES encrypting filter Ethernet layer encryption at 1.45 mfps IP layer encryption at 1.51 mfps ~10 lines of code for each.

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Gazing into the Crystal Ball Locks 320ns Lamport 160ns Hardware 10ns FastForward 28ns GigE

University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Shared Memory Accelerated Queues Now Available! Questions?