Applications and Runtime for multicore/manycore March 21 2007 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington.

Slides:

Advertisements

Similar presentations

1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

1 Computational models of the physical world Cortical bone Trabecular bone.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

Chapter Hardwired vs Microprogrammed Control Multithreading

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

SALSASALSA Programming Abstractions for Multicore Clouds eScience 2008 Conference Workshop on Abstractions for Distributed Applications and Systems December.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

SALSASALSASALSASALSA Performance Analysis of High Performance Parallel Applications on Virtualized Resources Jaliya Ekanayake and Geoffrey Fox Indiana.

Service Aggregated Linked Sequential Activities GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data mining.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

PC08 Tutorial 1 CCR Multicore Performance ECMS Multiconference HPCS 2008 Nicosia Cyprus June Geoffrey Fox, Seung-Hee Bae, Neil Devadasan,

October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.

Computer System Architectures Computer System Software

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Service Aggregated Linked Sequential Activities GOALS: Increasing number of cores accompanied by continued data deluge. Develop scalable parallel data.

1 Performance of a Multi-Paradigm Messaging Runtime on Multicore Systems Poster at Grid 2007 Omni Austin Downtown Hotel Austin Texas September

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Message-based MVC and High Performance Multi-core Runtime Xiaohong Qiu December 21, 2006.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009.

1 Performance Measurements of CCR and MPI on Multicore Systems Expanded from a Poster at Grid 2007 Austin Texas September Xiaohong Qiu Research.

Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Classic Model of Parallel Processing

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Shanghai Many-Core Workshop, March Judy Qiu Research.

Message Management April Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN.

1 Multicore for Science Multicore Panel at eScience 2008 December Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Martin Kruliš by Martin Kruliš (v1.0)1.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Background Computer System Architectures Computer System Software.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

1 Multicore Salsa Parallel Computing and Web 2.0 Open Grid Forum Web 2.0 Workshop OGF21, Seattle Washington October Geoffrey Fox, Huapeng Yuan,

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Community Grids Laboratory

Parallel Processing - introduction

Parallel Programming By J. H. Wang May 2, 2017.

What happens inside a CPU?

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)

/ Computer Architecture and Design

Geoffrey Fox, Huapeng Yuan, Seung-Hee Bae Xiaohong Qiu

Biology MDS and Clustering Results

Chapter 4: Threads.

Hybrid Programming with OpenMP and MPI

Lecture 2 The Art of Concurrency

3 Questions for Cluster and Grid Use

Panel on Research Challenges in Big Data

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Applications and Runtime for multicore/manycore March Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN

Pradeep K. Dubey, Tomorrow What is …?What if …? Is it …? RecognitionMiningSynthesis Create a model instance RMS: Recognition Mining Synthesis Model-based multimodal recognition Find a model instance Model Real-time analytics on dynamic, unstructured, multimodal datasets Photo-realism and physics-based animation Today Model-lessReal-time streaming and transactions on static – structured datasets Very limited realism

Discussed in Seminars Rest mainly classic parallel computing Intel’s Application Stack

Some Bioinformatics Datamining 1. Multiple Sequence Alignment (MSA) – Kernel Algorithms HMM (Hidden Markov Model) pairwise alignments (dynamic programming) with heuristics (e.g. progressive, iterative method) 2. Motif Discovery –Kernel Algorithms: –MEME (Multiple Expectation Maximization for Motif Elicitation) –Gibbs sampler 3. Gene Finding (Prediction) –Hidden Markov Methods 4. Sequence Database Search –Kernel Algorithms BLAST (Basic Local Alignment Search Tool) PatternHunter FASTA

Berkeley Dwarfs Dense Linear Algebra Sparse Linear Algebra Spectral Methods N-Body Methods Structured Grids Unstructured Grids Pleasingly Parallel Combinatorial Logic Graph Traversal Dynamic Programming Branch & Bound Graphical Models (HMM) Finite State Machine Consistent in Sprit with Intel Analysis I prefer to develop a few key applications rather than debate their classification!

Client side Multicore applications “Lots of not very parallel applications” Gaming; Graphics; Codec conversion for multiple user conferencing …… Complex Data querying and data manipulation/optimization/regression ; database and datamining (including computer vision) (Recognition and Mining for Intel Analysis) –Statistical packages as in Excel and R Scenario and Model simulations (Synthesis for Intel) Multiple users give several Server side multicore applications There are important architecture issues including memory bandwidth not discussed here!

Approach I Integrate Intel, Berkeley and other sources including database (successful on current parallel machines like scientific applications) and define parallel approaches in “white paper” Develop some key examples testing 3 parallel programming paradigms –Coarse Grain functional Parallelism (as in workflow) including pleasingly parallel instances with different data –Fine Grain functional Parallelism (as in Integer Programming) –Data parallel (Loosely Synchronous as in Science) Construct so can use different run time including perhaps CCR/DSS, MPI, Data Parallel.NET May be these will become libraries used as in MapReduce Workflow Coordination Languages ….

Approach II Have looked at CCR in MPI style applications –Seems to work quite well and support more general messaging models NAS Benchmark using CCR to confirm its utility Developing 4 exemplar multi-core parallel applications –Support Vector Machines (linear algebra) Data Parallel –Deterministic Annealing (statistical physics) Data Parallel –Computer Chess or Mixed Integer Programming Fine Grain Parallelism –Hidden Markov Method (Genetic Algorithms) Loosely Coupled functional Parallelism Test high level coordination to such parallel applications in libraries

CCR for Data Parallel (Loosely Synchronous) Applications CCR supports general coordination of messages queued in ports in Handler or Rendezvous mode DSS builds service model on CCR and supports coarse grain functional parallelism Basic CCR supports fine grain parallelism as in computer chess (and use STM enabled primitives?) MPI has well known collective communication which supply scalable global synchronization etc. Look at performance of MPI_Sendrecv What is model that encompasses best shared and distributed memory approaches for “data parallel” problems –This could be put on top of CCR? Much faster internal versions of CCR

Thread0 Port 3 Thread2 Port 2 Port 1 Port 0 Thread3 Thread1 Thread2 Port 2 Thread0 Port 0 Port 3 Thread3 Port 1 Thread1 Thread3 Port 3 Thread2 Port 2 Thread0 Port 0 Thread1 Port 1 (a) Pipeline(b) Shift (d) Exchange Thread0 Port 3 Thread2 Port 2 Port 1 Port 0 Thread3 Thread1 (c) Two Shifts Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive while (c) and (d) use CCR Multiple Item Receive Use on AMD 4-core Xeon 4-core Xeon 8-core Latter do up to 8 way parallelism

Write Exchanged Messages Port 3 Port 2 Thread0 Thread3 Thread2 Thread1 Port 1 Port 0 Thread0 Write Exchanged Messages Port 3 Thread2 Port 2 Exchanging Messages with 1D Torus Exchange topology for loosely synchronous execution in CCR Thread0 Read Messages Thread3 Thread2 Thread1 Port 1 Port 0 Thread3 Thread1 Stage Break a single computation into different number of stages varying from 1.4 microseconds to 14 seconds for AMD (1.6 microseconds to 16 seconds for Xeon Quad core)

Stages (millions) Fixed amount of computation ( units) divided into 4 cores and from 1 to 10 7 stages on HP Opteron Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode Time Seconds 8.04 microseconds overhead per stage averaged from 1 to 10 million stages Overhead = Computation Computation Component if no Overhead 4-way Pipeline Pattern 4 Dispatcher Threads HP Opteron 1.4 microseconds computation per stage 14 microseconds computation per stage

Stage Overhead versus Thread Computation time Overhead per stage constant up to about million stages and then increases 14 Seconds Stage Computation 14 Microseconds

Stages (millions) Fixed amount of computation ( units) divided into 4 cores and from 1 to 10 7 stages on Dell 2 processor 2-core each Xeon Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode Time Seconds microseconds per stage averaged from 1 to 10 million stages 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Overhead = Computation Computation Component if no Overhead

Summary of Stage Overheads for AMD 2-core 2-processor Machine These are stage switching overheads for a set of runs with different levels of parallelism and different message patterns –each stage takes about 28 microseconds (500,000 stages)

Summary of Stage Overheads for Intel 2-core 2-processor Machine These are stage switching overheads for a set of runs with different levels of parallelism and different message patterns –each stage takes about 30 microseconds. AMD overheads in parentheses These measurements are equivalent to MPI latencies

Summary of Stage Overheads for Intel 4-core 2-processor Machine These are stage switching overheads for a set of runs with different levels of parallelism and different message patterns –each stage takes about 30 microseconds. 2-core 2- processor Xeon overheads in parentheses These measurements are equivalent to MPI latencies

XP-Pro 8-way Parallel Pipeline on two 4-core Xeon Histogram of 100 runs -- each run has 500,000 synchronizations following a thread execution that takes microseconds –So overhead of 6.1 microseconds modest Message size is just one integer Choose computation unit that is appropriate for a few microsecond stage overhead AMD 4-way microsecond Computation Unit XP Pro

8-way Parallel Shift on two 4-core Xeon Histogram of 100 runs -- each run has 500,000 synchronizations following a thread execution that takes microseconds –So overhead of 8.2 microseconds modest Shift versus pipeline adds a microsecond to cost Unclear what causes second peak XP-Pro VISTA AMD 4-way microsecond Computation Unit XP Pro

8-way Parallel Double Shift on two 4-core Xeon Histogram of 100 runs -- each run has 500,000 synchronizations following a thread execution that takes microseconds –So overhead of 22.3 microseconds significant –Unclear why double shift slow compared to shift Exchange performance partly reflects number of messages Opteron overheads significantly lower than Intel XP-Pro AMD 4-way microsecond Computation Unit XP Pro

AMD 2-core 2-processor Bandwidth Measurements Previously we measured latency as measurements corresponded to small messages. We did a further set of measurements of bandwidth by exchanging larger messages of different size between threads We used three types of data structures for receiving data Array in thread equal to message size Array outside thread equal to message size Data stored sequentially in a large array (“stepped” array) For AMD and Intel, total bandwidth 1 to 2 Gigabytes/second

Intel 2-core 2-processor Bandwidth Measurements For bandwidth, the Intel did better than AMD especially when one exploited cache on chip with small transfers For both AMD and Intel, each stage executed a computational task after copying data arrays of size 10 5 (labeled small), 10 6 (labeled large) or 10 7 double words. The last column is an approximate value in microseconds of the compute time for each stage. Note that copying 100,000 double precision words per core at a gigabyte/second bandwidth takes 3200 µs. The data to be copied (message payload in CCR) is fixed and its creation time is outside timed process

Typical Bandwidth measurements showing effect of cache with slope change 5,000 stages with run time plotted against size of double array copied in each stage from thread to stepped locations in a large array on Dell Xeon Multicore Time Seconds 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Total Bandwidth 1.0 Gigabytes/Sec up to one million double words and 1.75 Gigabytes/Sec up to 100,000 double words Array Size: Millions of Double Words Slope Change (Cache Effect)

Timing of HP Opteron Multicore as a function of number of simultaneous two- way service messages processed (November 2006 DSS Release) CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better DSS Service Measurements