Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Lecture 6: Multicore Systems
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Concurrency. What is Concurrency Ability to execute two operations at the same time Physical concurrency –multiple processors on the same machine –distributing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Higher Computing Computer structure. What we need to know! Detailed description of the purpose of the ALU and control unitDetailed description of the.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
SICSA Concordance Challenge: Using Groovy and the JCSP Library Jon Kerridge.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
The University of Adelaide, School of Computer Science
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
CPU How It Works. 2 Generic Block Diagram CPU MemoryInputOutput Address Bus Data Bus.
A User-Lever Concurrency Manager Hongsheng Lu & Kai Xiao.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
I Power Higher Computing Software Development High Level Language Constructs.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises (Lab 2: Sorting)
Lab 2 Parallel processing using NIOS II processors
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 5 (Deep Packet Inspection)
Data Structures and Algorithms in Parallel Computing
Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.
Threads. Readings r Silberschatz et al : Chapter 4.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Logical & Physical Address Nihal Güngör. Logical Address In simplest terms, an address generated by the CPU is known as a logical address. Logical addresses.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Introduction to Performance Testing Performance testing is the process of determining the speed or effectiveness of a computer, network, software program.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 4 (Network Packet Filtering)
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
My Coordinates Office EM G.27 contact time:
Concurrency (Threads) Threads allow you to do tasks in parallel. In an unthreaded program, you code is executed procedurally from start to finish. In a.
ECE 297 Concurrent Servers Process, fork & threads ECE 297.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Chapter 4 – Thread Concepts
lecture 5: CPU Scheduling
Diskpool and cloud storage benchmarks used in IT-DSS
CS427 Multicore Architecture and Parallel Computing
Chapter 4 – Thread Concepts
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.
Computer Engg, IIT(BHU)
Teaching Parallel Computing with OpenMP on the Raspberry Pi
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
Overall Kernel Module Design
Chip&Core Architecture
Lecture Topics: 11/1 Hand back midterms
Ch 3.
Presentation transcript:

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises

1-2 Lab # 1: Parallel Programming and Performance measurement using MPAC

1-3 Lab 1 – Goals Objective Use MPAC benchmarks to measure the performance of different subsystems of multi-core based systems Understand accurate measurement method for multi-core based systems Use MPAC to learn to develop parallel programs Mechanism MPAC CPU and memory benchmarks will exercise the processor and memory unit by generating compute and memory intensive workload

1-4 Observations Observe the throughput with increasing number of threads for compute and memory intensive workloads Identify performance bottlenecks

1-5 MPAC fork and join infrastructure In MPAC based applications, the initialization and argument handling is performed by the main thread. The task to be run in parallel are forked to worker threads The worker threads join after completing their task. Final processing is done by main thread

1-6 MPAC code structure

1-7 Compile and Run on Host System (Memory Benchmark) Memory Benchmark host$ cd / /mpac_1.2 host$./configure host$ make clean host$ make host$ cd benchmarks/mem host$./mpac_mem_bm –n -s -r -t For Help host$./mpac_mem_bm –h

1-8 Cross Compile for Target System (Memory Benchmark) Cross Compile on Host System Go to Cavium SDK directory and run the command host$ source env-setup (where is the model of your target board. E.g. OCTEON_CN56XX) host$ cd / /mpac_1.2 host$./configure --host=i386-redhat-linux-gnu --target=mips64-octeon-linux-gnu export CC=mips64-octeon-linux-gnu-gcc host$ make clean host$ make CC=mips64-octeon-linux-gnu-gcc

1-9 Run on Target System (Memory Benchmark) Run on Target System Copy executable “mpac_mem_bm” on target system target$./mpac_mem_bm –n -s -r -t For Help target$./mpac_mem_bm –h

1-10 Performance Measurements (Memory) (a) 4 KB (b) 16 KB (c) 1 MB (d) 16 MB $./mpac_mem_bm –n -r –s 512 –t i$./mpac_mem_bm –n -r –s 2048 –t i $./mpac_mem_bm –n -r 100 –s –t i$./mpac_mem_bm –n -r 10 –s –t i Results taken on Cavium Networks EVB 5610 Board

1-11 Performance Measurements (Memory) Data sizes of 4 KB, 16 KB, 1 MB and 16 MB are used to exercise the L1, L2 Cache and main memory of the target system. Cavium Octeon (MIPS64) CN5610 Evaluation board is used as System under Test (SUT). With Data more than the size of L2 Cache (2 MB), the throughput is expected to not scale linearly. The linearity is seen because of the low latency interconnect used in place of system bus. Throughput scales linearly across number of threads for all cases.

1-12 Compile and Run on Host System (CPU Benchmark) CPU Benchmark host$ cd / /mpac_1.2 host$./configure host$ make clean host$ make host$ cd benchmarks/cpu host$./mpac_cpu_bm –n -r For Help host$./mpac_cpu_bm –h

1-13 Cross Compile for Target System (CPU Benchmark) Cross Compile on Host System Go to Cavium SDK directory and run the command host$ source env-setup (where is the model of your target board. E.g. OCTEON_CN56XX) host$ cd / /mpac_1.2 host$./configure --host=i386-redhat-linux-gnu --target=mips64-octeon-linux-gnu export CC=mips64-octeon-linux-gnu-gcc host$ make clean host$ make CC=mips64-octeon-linux-gnu-gcc

1-14 Run on Target System (CPU Benchmark) Run on Target System Copy executable “mpac_cpu_bm” on target system target$./mpac_cpu_bm –n -r For Help target$./mpac_cpu_bm –h

1-15 Performance Measurements (CPU) (a) Integer Unit Operation (b) Logical Unit Operation $./mpac_cpu_bm –n -r –a 1 –u i $./mpac_cpu_bm –n -r –a 1 –u l Results taken on Cavium Networks EVB 5610 Board

1-16 Performance Measurements (CPU) Integer Unit (summation) and Logical Unit (string operation) of the processor are exercised. Cavium Octeon (MIPS64) CN5610 Evaluation board is used as System under Test (SUT). Throughput scales linearly across number of threads for both cases.

1-17 Measurement of Execution Time Measuring the elapsed time since the start of a task until its completion is a straight-forward procedure in the context of a sequential task. This procedure becomes complex when the same task is executed concurrently by n threads on n distinct processors or cores. Not guaranteed that all tasks start at the same time or complete at the same time. Therefore, the measurement is imprecise due to concurrent nature of the tasks.

1-18 Cont… Execution time measured either globally or locally. In the case of global measurement, execution time is equal to the difference of time stamps taken at global fork and join instants. Local times can be measured and recorded by each of the n threads. After thread joining, the maximum of all these individual execution times provides an estimate of overall execution time.

1-19 Definitions LETE: Local Execution Time Estimation GETE: Global Execution Time Estimation

1-20 Cont… LETE GETE

1-21 The Problem Lack of Precision Some tasks finish before others Synchronization issue with large no. of cores Results not repeatable

1-22 Performance Measurement Methodologies For multithreaded case Get start time at the barrier (1) (2) (3)... (K) Repeat for N no. of iterations Get end time at the barrier For sequential case Get start time Repeat for N no. of iterations Get end time

1-23 Accurate LETE Measurement Methodology (1) (2) (3)... (K) Thread synchronization before each round using barrier Repeat for N no. of rounds Maximum elapsed time for the round

1-24 Measurement Observations

1-25 Accurate MINMAX Approach Repeat for N no. of Iterations Store thread local execution time for each thread for each iteration For an individual iteration store the largest execution time amongst the threads We have stored N largest execution time values Choose the minimum of that value to be your execution time. The MINMAX value!!

1-26 MPAC Hello World Objective To write a simple ” Hello World” program using MPAC Mechanism User specifies number of worker threads through command-line Each worker thread prints “Hello World” and exits

1-27 Compile and Run $ cd / /mpac_1.2/apps/hello $ make clean $ make $./mpac_hello_app –n