Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Computer Abstractions and Technology
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.
Higher Computing: Unit 1: Topic 3 – Computer Performance St Andrew’s High School, Computing Department Higher Computing Topic 3 Computer Performance.
Chapter Hardwired vs Microprogrammed Control Multithreading
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 1: Introduction.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Computer System Architectures Computer System Software
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Computer Performance Computer Engineering Department.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
Types of Computers Mainframe/Server Two Dual-Core Intel ® Xeon ® Processors 5140 Multi user access Large amount of RAM ( 48GB) and Backing Storage Desktop.
SICSA Concordance Challenge: Using Groovy and the JCSP Library Jon Kerridge.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises (Lab 2: Sorting)
Lab 2 Parallel processing using NIOS II processors
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 5 (Deep Packet Inspection)
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
Threads. Readings r Silberschatz et al : Chapter 4.
The Central Processing Unit (CPU)
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 4 (Network Packet Filtering)
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
My Coordinates Office EM G.27 contact time:
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Chapter 4 – Thread Concepts
lecture 5: CPU Scheduling
4- Performance Analysis of Parallel Programs
ISPASS th April Santa Rosa, California
BD-CACHE Big Data Caching for Datacenters
Chapter 4 – Thread Concepts
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.
Computer Engg, IIT(BHU)
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chip&Core Architecture
Multicore and GPU Programming
Lecture Topics: 11/1 Hand back midterms
Ch 3.
CS2100 Computer Organisation
Presentation transcript:

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

1-2 Lab # 1: Parallel Programming and Performance measurement using MPAC

1-3 Lab 1 – Goals Objective Use MPAC benchmarks to measure the performance of different subsystems of multi-core based systems Use MPAC to learn to develop parallel programs Mechanism MPAC CPU and memory benchmarks will exercise the processor and memory unit by generating compute and memory intensive workload

1-4 What to Look for Observations Observe the throughput with increasing number of threads for compute and memory intensive workloads Identify performance bottlenecks

1-5 Measurement of Execution Time Measuring the elapsed time since the start of a task until its completion is a straight-forward procedure in the context of a sequential task. This procedure becomes complex when the same task is executed concurrently by n threads on n distinct processors or cores. Not guaranteed that all tasks start at the same time or complete at the same time. Therefore, the measurement is imprecise due to concurrent nature of the tasks.

1-6 Cont… Execution time measured either globally or locally. In the case of global measurement, execution time is equal to the difference of time stamps taken at global fork and join instants. Local times can be measured and recorded by each of the n threads. After thread joining, the maximum of all these individual execution times provides an estimate of overall execution time.

1-7 Definitions LETE: Local Execution Time Estimation GETE: Global Execution Time Estimation

1-8 Cont… LETE GETE

1-9 The Problem Lack of Precision Some tasks finish before others Synchronization issue with large no. of cores Results not repeatable

1-10 Performance Measurement Methodologies For multithreaded case Get start time at the barrier (1) (2) (3)... (K) Repeat for N no. of iterations Get end time at the barrier For sequential case Get start time Repeat for N no. of iterations Get end time

1-11 Accurate LETE Measurement Methodology (1) (2) (3)... (K) Thread synchronization before each round using barrier Repeat for N no. of rounds Maximum elapsed time for the round

1-12 Measurement Observations

1-13 Accurate MINMAX Approach Repeat for N no. of Iterations Store thread local execution time for each thread for each iteration For an individual iteration store the largest execution time amongst the threads We have stored N largest execution time values Choose the minimum of that value to be your execution time. The MINMAX value!!

1-14 Compile and Run (Memory Benchmark) Memory Benchmark $ cd / /mpac_1.2 $./configure $ make clean $ make $ cd benchmarks/mem $./mpac_mem_bm –n -s -r -t For Help./mpac_cpu_bm –h

1-15 Compile and Run (CPU Benchmark) CPU Benchmark $ cd / /mpac_1.2 $./configure $ make clean $ make $ cd benchmarks/cpu $./mpac_cpu_bm –n -r For Help./mpac_cpu_bm –h

1-16 Performance Measurements (CPU) Integer Unit (summation), Floating Point Unit (sine) and Logical Unit (string operation) of the processor are exercised. Intel Xeon, AMD Opteron (x86) and Cavium Octeon (MIPS64) are used as System under Test (SUT). Throughput scales linearly across number of threads for all cases.

1-17 Performance Measurements (Memory) With concurrent symmetric threads one expects to see the memory-memory throughput scale with the number of threads. With data sizes of 4 KB, 16 KB and 1 MB, most of the memory accesses should hit L2 caches rather than the main memory. For these cases the throughput scales linearly.

1-18 Performance Measurements (Memory) Copying 16 MB requires extensive memory accesses In case of Intel shared bus is used. Thus, throughput is lower compared to the cases where accesses hit in L2 caches, and saturates as bus becomes a bottleneck Memory copy throughput saturates at around 40 Gbps, which is half of the available bus bandwidth (64 bits x 1333 MHz = 85.3 Gbps) For AMD and Cavium based SUT, throughput scales linearly for 16MB case due to their more efficient low- latency memory controllers instead of a shared system bus

1-19 MPAC fork and join infrastructure In MPAC based applications, the initialization and argument handling is performed by the main thread. The task to be run in parallel are forked to worker threads The worker threads join after completing their task. Final processing is done by main thread

1-20 MPAC code structure

1-21 MPAC Hello World Objective To write a simple ” Hello World” program using MPAC Mechanism User specifies number of worker threads through commandline Each worker thread prints “Hello World” and exits

1-22 Compile and Run $ cd / /mpac_1.2/apps/hello $ make clean $ make $./mpac_hello_app –n