ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE EMAIL: {JW1772, ZILIANG,

Slides:

Advertisements

Similar presentations

Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,

Advertisements

Speed, Accurate and Efficient way to identify the DNA.

MATH 224 – Discrete Mathematics

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, Java Version, Third Edition.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

Fundamentals of Python: From First Programs Through Data Structures

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

OpenFOAM on a GPU-based Heterogeneous Cluster

Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Third Edition Additions by Shannon Steinfadt SP’05.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.

Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Fourth Edition.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Chapter 3: The Efficiency of Algorithms

Selection Sort, Insertion Sort, Bubble, & Shellsort

Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Third Edition Additions by Shannon Steinfadt SP’05.

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs Martin Burtscher Department of Computer Science.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

Power Characteristics of Irregular GPGPU Programs Jared Coplin and Martin Burtscher Department of Computer Science 1.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Clone-Cloud. Motivation With the increasing use of mobile devices, mobile applications with richer functionalities are becoming ubiquitous But mobile.

PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

CSC 211 Data Structures Lecture 13

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

QCAdesigner – CUDA HPPS project

CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.

Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS

Static Process Scheduling

Emir Halepovic, Jeffrey Pang, Oliver Spatscheck AT&T Labs - Research

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

Sunpyo Hong, Hyesoon Kim

1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.

Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Algorithm Complexity is concerned about how fast or slow particular algorithm performs.

Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.

Parallel Density-based Hybrid Clustering

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Introduction to Computer Science - Alice

Objectives At the end of the class, students are expected to be able to do the following: Understand the purpose of sorting technique as operations on.

Chapter 3: The Efficiency of Algorithms

Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of.

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Alan Jovic1, Kresimir Jozic2, Davor Kukolja1,

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Chapter 3: The Efficiency of Algorithms

A Comparison-FREE SORTING ALGORITHM ON CPUs

Case Studies with Projections

Maximizing Speedup through Self-Tuning of Processor Allocation

Presentation transcript:

ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG, 1

THE XEON PHI COPROCESSOR Equipped with 60 x86-based cores, each capable of running 4 threads simultaneously. Designed for high computation density. Used in both Tianhe-2 and Stampede supercomputers. 2

OVERVIEW OF OUR WORK We profile the power and energy of multiple algorithms with contrasting workloads. Concentrating on the performance and energy impact of increasing the number of threads, running code in native versus offloaded mode, and co-running selected algorithms on the Xeon Phi. We describe how to correctly profile the instantaneous power of the Xeon Phi using the built-in power sensors. 3

XEON PHI POWER DATA Power data is collected using the MICAccessAPI - a C/C++ library that allows users to monitor and configure several metrics (including power) of the coprocessor. The power results that we present are measured and recorded by issuing the MicGetPowerUsage() call to the MICAccessAPI during execution of each experiment. 4

SELECTED ALGORITHMS Barnes-Hut simulation – O(nlogn) n-body approximation. Shellsort – comparison based exchange/insertion sort. SSSP – Single Source Shortest Path (Dijkstra’s algorithm) graph searching. Fibonacci – calculates 45 Fibonacci numbers. 5

POWER TRACING Graphing the instantaneous power of these algorithms allows us to confirm much of what can be inferred about the performance and energy from the implementation. It can help us identify features of different applications that aren’t otherwise obvious, and facilitate new findings. 6

BARNES-HUT Designed to solve the n-body simulation problem by approximating the forces acting on each body. Uses an octree data structure to achieve a time complexity of O(nlogn). Memory access and control flow patterns are irregular, since different parts of the octree must be traversed to compute forces from each body. Balanced workload, as each thread will perform the same amount of force calculation per iteration. 7

SHELLSORT Comparison based in-place sorting algorithm. Starts by sorting elements far from each other, reducing the gap between them. Workload gradually reduces because fewer swaps occur as the data set becomes relatively sorted. 8

SSSP Returns the distance between 2 chosen nodes of the input graph. Amount of parallelism changes throughout execution. Unbalanced workload, as each thread is given a different number of neighbor nodes to compute the distance. 9

FIBONACCI Calculates 45 Fibonacci sequence numbers. Each sequence position is assigned to a thread, which calculates the corresponding number. Highly unbalanced workload, as threads assigned to larger Fibonacci numbers (position 45 and 46) require much more work. Changing the OMP_WAIT_POLICY environment variable seemed to have no influence on the power trace of Fibonacci. 10

CORRECTLY PLOTTING THE INSTANTANEOUS POWER DATA CORRECT POWER TRACE – X AXIS AS TIMESTAMP INCORRECT POWER TRACE – X AXIS INCREMENTING BY SAMPLE NUMBER 11

NATIVE VS. OFFLOADED EXECUTION The Xeon Phi offers native and offloaded execution modes. During native execution, the program runs entirely on the coprocessor. Building a native application is a fast way to get existing software running with minimal code changes. Offloaded mode is a heterogeneous programming model where developers can designate specific code sections to run on the Xeon Phi. For our experiments, we offload the entire execution onto the Xeon Phi. 12

OFFLOADED SSSP The energy consumption is slightly higher for offloaded mode compared to native mode across each number of threads. This is because the performance is consistently slightly worse than native execution. However, the performance and energy deficit grows smaller as more threads are used. Intuitively, offloading to the Xeon Phi with a high number of threads (120, 240) implies energy savings assuming the host CPU is utilized 13

OFFLOADED SHELLSORT Offloaded shellsort reveals a much higher performance and energy deficit compared to its native version. Native shellsort consistently performs 3-4X faster than offloaded version. These results show great benefit in terms of performance and energy when running codes in native mode. Based on these results, generally speaking, codes that do not perform extensive I/0 operations and require a modest memory footprint should be executed in native mode. 14

CO-RUNNING PROGRAMS The Xeon Phi contains 60 physical cores and is capable of high computation density. We explore the viability of co-running complementary workloads on the Xeon Phi. We chose the Fibonacci calculation code as the ideal co-runner, as it performs best with lower thread counts. Mostly interested in revealing if co-running these codes will incur significant performance and energy losses. 15

BARNES-HUT & FIBONACCI CO-RUN These codes are able to co-run well because Barnes- Hut is a very balanced workload and benefits from using more Threads. Fibonacci actually declines in performance as a large number of threads are used. This allows us to give as many threads as possible to execute Barnes-Hut while leaving a small thread pool to execute Fibonacci. 16

SSSP & FIBONACCI CO-RUN Fibonacci is an example of a workload that will co-run well when paired with other programs with a high degree of parallelism. SSSP is still a good candidate to co-run with Fibonacci. It yields similar results to that of co-running Barnes-Hut. Assuming memory contention is low, each of these co-running programs will return with little performance cost. 17

CONCLUSIONS The power trace generated from the built-in power sensors of Xeon Phi can accurately capture the run-time program behavior. Running code in native mode yields better performance and consumes less energy compared to offload mode. Co-running programs with complementary workloads has potential to conserve energy with negligible performance degradation. 18

FUTURE WORK We need to investigate the heterogeneous power and energy implications of offloading work to the Xeon Phi from the host CPU. (Currently, we exclusively look at data from the Xeon Phi) Compare the performance and energy of these algorithms with corresponding CPU and GPU implementations. 19

ACKNOWLEDGEMENT The work reported in this paper is supported by the U.S. National Science Foundation under Grants CNS , CNS , CNS , and a grant from the Texas State University Research Enhancement Program. 20