Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 6: Multicore Systems
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Computer Abstractions and Technology
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Computer System Overview
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Multi-core architectures. Single-core computer Single-core CPU chip.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-Core Architectures
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Performance of mathematical software Agner Fog Technical University of Denmark
Srihari Makineni & Ravi Iyer Communications Technology Lab
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Operating System Isfahan University of Technology Note: most of the slides used in this course are derived from those of the textbook (see slide 4)
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Pipelining and Parallelism Mark Staveley
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 5 (Deep Packet Inspection)
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen
A Measurement Based Memory Performance Evaluation of Streaming Media Servers Garba Isa Yau and Abdul Waheed Department of Computer Engineering King Fahd.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Sunpyo Hong, Hyesoon Kim
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 4 (Network Packet Filtering)
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
COMP 740: Computer Architecture and Implementation
18-447: Computer Architecture Lecture 30B: Multiprocessors
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.
Architecture Background
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Hyperthreading Technology
EE 193: Parallel Computing
CMPT 886: Computer Architecture Primer
Computer Evolution and Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CMSC 611: Advanced Computer Architecture
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed

KICS, UETCopyright © Course Outline Introduction Multi-threading on multi-core processors Applications for multi-core processors Application layer computing on multi-core Performance measurement and tuning Architecture and application level performance evaluation methodology Case study: sniffer performance tuning

KICS, UETCopyright © Agenda for Today Multi-core system performance and scalability Methodology Profiling tools Benchmarking Benchmarking and profiling: a case study Multi-threaded sniffer performance tuning Single queue based design Lock-free design

Multi-Core Performance Measurement Methodology Credits for slides: Dr. Rodric Rabbah, IBM

KICS, UETCopyright © Keys to Parallel Performance Coverage Extent of parallelism in algorithm Determined by Amdahl’s law Granularity Size of workload partitions for each core Impact communication cost and load balance Locality Location of computation vs. communication Inter-processor communication  non-local Processor-memory communication  local

KICS, UETCopyright © Communication Cost Model

KICS, UETCopyright © Overlapping Communication with Computation

KICS, UETCopyright © Limits in Pipelining Communication Computation to communication ratio limits performance gains from pipelining Where else to look for performance?

KICS, UETCopyright © Lessons from Uniprocessors In uniprocessors, CPU communicates with memory Loads and stores are to uniprocessors as GET and PUT are to distributed memory multiprocessors How is communication overlap enhanced in uniprocessors? Spatial locality Temporal locality

KICS, UETCopyright © Spatial Locality CPU asks for data at address 1000 Memory sends data at address 1000 … 1064 Amount of data sent depends on architecture parameters such as the cache block size Works well if CPU actually ends up using data from 1001, 1002, …, 1064 Otherwise wasted bandwidth and cache capacity

KICS, UETCopyright © Temporal Locality Main memory access is expensive Memory hierarchy adds small but fast memories (caches) near the CPU Memories get bigger as distance from CPU increases CPU asks for data at address 1000 Memory hierarchy anticipates more accesses to same address and stores a local copy Works well if CPU actually ends up using data from 1000 over and over and over … Otherwise wasted cache capacity

KICS, UETCopyright © Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) Two keys to faster execution Load balance the work among the processors Make execution on each processor faster

KICS, UETCopyright © Understanding Performance Need some way of measuring performance Coarse grained measurements % gcc sample.c % time a.out 2.312u 0.062s 0: % % gcc sample.c –O3 % time a.out 1.921u 0.093s 0: % … but did we learn much about what’s going on?

KICS, UETCopyright © Measurements Using Counters Increasingly possible to get accurate measurements using performance counters Special registers in the hardware to measure events Insert code to start, read, and stop counter Measure exactly what you want, anywhere you want Can measure communication and computation duration But requires manual changes Monitoring nested scopes is an issue Heisenberg effect: counters can perturb execution time

KICS, UETCopyright © Dynamic Profiling Event-based profiling Interrupt execution when an event counter reaches a threshold Time-based profiling Interrupt execution every t seconds Works without modifying your code Does not require that you know where problem might be Supports multiple languages and programming models Quite efficient for appropriate sampling frequencies

KICS, UETCopyright © Performance Counter Examples Cycles (clock ticks) Instructions retired Pipeline stalls Cache hits Cache misses Number of instructions Number of loads Number of stores Number of floating point operations …

KICS, UETCopyright © Useful Derived Metrics Processor utilization Cycles / Wall Clock Time Instructions per cycle Instructions / Cycles Instructions per memory operation Instructions / Loads + Stores Average number of instructions per load miss Instructions / L1 Load Misses Memory traffic Loads + Stores * Lk Cache Line Size Bandwidth consumed Loads + Stores * Lk Cache Line Size / Wall Clock Time Many others Cache miss rate Branch misprediction rate

KICS, UETCopyright © Common Profiling Workflow

KICS, UETCopyright © Popular Runtime Profiling Tools GNU gprof Widely available with UNIX/Linux distributions gcc –O2 –pg foo.c –o foo./foo gprof foo Oprofile PAPI VTune na/eng/vtune/ Many others

KICS, UETCopyright © GNU grpof

KICS, UETCopyright © Uniprocessor Performance time = compute + wait Instruction level parallelism Multiple functional units, deeply pipelined, speculation,... Data level parallelism SIMD: short vector instructions (multimedia extensions) Hardware is simpler, no heavily ported register files Instructions are more compact Reduces instruction fetch bandwidth Complex memory hierarchies Multiple level caches, may outstanding misses, prefetching, …

KICS, UETCopyright © Instruction Locality

KICS, UETCopyright © Instruction Locality (2)

KICS, UETCopyright © Example: Cache/Memory Optimization

KICS, UETCopyright © Example: Cache/Memory Optimization (2)

KICS, UETCopyright © Example: Cache/Memory Optimization (3)

KICS, UETCopyright © Results from Cache Optimizations

KICS, UETCopyright © Summary: Programming for Performance Tune the parallelism first Then tune performance on individual processors Modern processors are complex Need instruction level parallelism for performance Understanding performance requires a lot of probing Optimize for the memory hierarchy Memory is much slower than processors Multi-layer memory hierarchies try to hide the speed gap Data locality is essential for performance

Architecture and Application Level Performance Evaluation using Benchmarking and Profiling Dual Processor Performance Characterization for XML Application Oriented Networking by Jason Ding and Abdul Waheed, in Proceedings of ICPP2007

KICS, UETCopyright © Problem Statement Application under consideration XML message processing Functions include: xpath apttern matching and schema validation Two types of processors: Single processor with two cores Inel Pentium-M Single processor with hyper-threading enabled Intel Xeon Two processor without hyper-threading Same Intel Xeon Need to understand performance characteristics

KICS, UETCopyright © Processors and System

KICS, UETCopyright © Notations

KICS, UETCopyright © Workload Characterization

KICS, UETCopyright © Measurement Setup

KICS, UETCopyright © Performance Counter Events

KICS, UETCopyright © Performance Metrics High level End-to-end throughput Architecture level CPI L2 cache misses Bus transactions per instruction retired Branch misprediction ratio

KICS, UETCopyright © Baseline Measurements

KICS, UETCopyright © Baseline Analysis

KICS, UETCopyright © End-to-End Throughput

KICS, UETCopyright © Comparison of CPIs

KICS, UETCopyright © L2 Cache Misses

KICS, UETCopyright © Bus Transactions Per Instruction

KICS, UETCopyright © Branch Mispredictions Ratios

KICS, UETCopyright © Conclusions Hyper-threading technology in Xeon processor scales poorly compared to dual Xeon or dual-core Pentium- M processors Higher CPI Lower bus traffic related to memory accesses A higher branch misprediction ratios Dual-core pentium M exhibits Better scalability for XML workloads Wide dynamic execution, efficient memory access and superior branch prediction Shared L2 cache in dual-core processors May become a bottleneck for I/O intensive workloads Simpler micro-architecture wrt cache coherence

Case Study: Multi-Threaded Sniffer Performacne Tuning Single and Multiple Queues based Designs with and without Thread Locking (Labs 3-5)

KICS, UETCopyright © Packet Sniffer Passive mode packet monitoring on link Each packet is captured from the link Does not impact actual packet flow Multiple layers of sniffer functionality Layer 2: Ethernet frame capture Layer 3: matching of 5-tuple Layer 7: payload string search  DPI Increasing compute intensity with layers Layer 2 is least while 7 is most compute intensive Throughput limited by compute intensity Multiple cores are expected to help scale throughput

KICS, UETCopyright © Sniffer Setup SenderReceiver Packet Sniffer 1 GigE Link System 1System 2 Data Packets Core Packet Mapping to Cores

KICS, UETCopyright © Single Queue Architecture T1T1 T0T0 T N-1 T1T1 T0T0 T1T1 T0T0 T1T1 T0T0 Dispatcher putting space Dispatcher putting direction Workers getting direction TNTN Worker Threads

KICS, UETCopyright © Single Queue Architecture (2) Single dispatcher thread Captures the packet from network Copies at the tail of a dispatcher queue Multiple worker threads Workers can access only specified locations Each location access is mutually exclusive Controlled by a flag per location No locking overhead No copying involved from dispatcher to worker Location access function: Get packet, if flag = 1 (workers) Put packet, if flag = 0 (Dispatcher) Worker processes the packet at three layer levels

KICS, UETCopyright © Performance Results Throughput does not scale with number of threads Layer 2: limited by link speed of 1 Gbps Layer 3 and 7: CPU bound Bottlenecks: Scanning of whole queue according to the status of flag Large stride size Needed more cache coherency

KICS, UETCopyright © Multiple Queues Architecture T N-1 T2T2 T1T1 T0T0 Dispatcher putting space Dispatcher put Workers get TNTN Worker Threads

KICS, UETCopyright © Multiple Queues Architecture (2) Queue size distributed between worker threads Dispatcher can access whole queue Each worker thread can access only dedicated sub-queue In-situ sniffing No copying from dispatcher to worker space Mutually exclusion is assured by get and set indices (get chases set) Location access directions No locking overhead Location access function: Get packet, if get < set (workers) Put packet, if get ≤ set (Dispatcher) Wait, otherwise

KICS, UETCopyright © Performance Results Throughput better for 1-2 threads but drops afterwards Inefficient (O(n)) enqueue and dequeue functions

KICS, UETCopyright © Optimized Architecture Improved queuing O(1) instead of O(n) One queue per worker thread No locking based design Other improvements Removed duplications in packet capturing Removed some redundant operations Layer 3 protocol type (IP, ARP, etc.) Improved 5-tuple matching

KICS, UETCopyright © Optimized Performance: Packet Capture Throughput scales from 1-2 threads Saturates at 1 Gbps link speed limit

KICS, UETCopyright © Optimized Performance: Matching 5-Tuple Performance similar to packet capturing 5-tuple comparison is not very expensive

KICS, UETCopyright © Optimized Performance: String Matching in Payload String not in payload: Throughput scales linearly from 1-4 threads/cores Saturates at 1 Gbps link bandwidth String in payload: Saturates with 2 threads/cores Significantly less compute intensive as string is found close to start

KICS, UETCopyright © Key Takeaways for Today’s Session Multi-core application development requires Identifying concurrent tasks Identifying interactions among tasks Mapping on to multiple cores using threads Fork-and-join based threading Initial development is fairly simple but… Real challenge is to obtain scalability Requires tuning at application as well as thread level performance tuning Locate bottlenecks using instrumentation, profiling, and measurements  challenging task