1 Performance of a Multi-Paradigm Messaging Runtime on Multicore Systems Poster at Grid 2007 Omni Austin Downtown Hotel Austin Texas September 19 2007.

Slides:

Advertisements

Similar presentations

Shanghai Many-Core Workshop March Judy Qiu Research.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

OpenFOAM on a GPU-based Heterogeneous Cluster

SALSASALSASALSASALSA Using MapReduce Technologies in Bioinformatics and Medical Informatics Computing for Systems and Computational Biology Workshop SC09.

1 Multicore and Cloud Futures CCGSC September Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

SALSASALSA Judy Qiu Research Computing UITS, Indiana University.

SALSASALSA Programming Abstractions for Multicore Clouds eScience 2008 Conference Workshop on Abstractions for Distributed Applications and Systems December.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

SALSASALSASALSASALSA Performance Analysis of High Performance Parallel Applications on Virtualized Resources Jaliya Ekanayake and Geoffrey Fox Indiana.

Service Aggregated Linked Sequential Activities GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data mining.

1 Multicore SALSA Parallel Computing and Web 2.0 for Cheminformatics and GIS Analysis 2007 Microsoft eScience Workshop at RENCI The Friday Center for Continuing.

PC07BYOPA Parallel Computing 2007: Bring your own parallel application February 26-March Geoffrey Fox Community Grids Laboratory.

PC08 Tutorial 1 CCR Multicore Performance ECMS Multiconference HPCS 2008 Nicosia Cyprus June Geoffrey Fox, Seung-Hee Bae, Neil Devadasan,

Berlin SPARQL Benchmark (BSBM) Presented by: Nikhil Rajguru Christian Bizer and Andreas Schultz.

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.

1 Web 2.0, Grids and Parallel Computing Oxford University December Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University.

SALSASALSA Judy Qiu Assistant Director, Pervasive Technology Institute.

1 Multicore Salsa Parallel Programming 2.0 SC07 Reno Nevada November Geoffrey Fox, Huapeng Yuan, Seung-Hee Bae Community Grids Laboratory, Indiana.

SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

1 Multicore Salsa Parallel Programming 2.0 Peking University October Geoffrey Fox, Huapeng Yuan, Seung-Hee Bae Community Grids Laboratory, Indiana.

Service Aggregated Linked Sequential Activities GOALS: Increasing number of cores accompanied by continued data deluge. Develop scalable parallel data.

Applications and Runtime for multicore/manycore March Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

SALSASALSA International Conference on Computational Science June Kraków, Poland Judy Qiu

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Message-based MVC and High Performance Multi-core Runtime Xiaohong Qiu December 21, 2006.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

SALSASALSA Microsoft eScience Workshop December Indianapolis, Indiana Geoffrey Fox

1 Data Analysis from Cores to Clouds HPC 2008 High Performance Computing and Grids Cetraro Italy July Geoffrey Fox, Seung-Hee Bae,

1 Web 2.0, Grids and Parallel Computing OGF Workshop eScience 2007 December Geoffrey Fox Community Grids Laboratory, School of informatics Indiana.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

1 Performance Measurements of CCR and MPI on Multicore Systems Expanded from a Poster at Grid 2007 Austin Texas September Xiaohong Qiu Research.

Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.

1 Robust High Performance Optimization for Clustering, Multi-Dimensional Scaling and Mixture Models CGB Indiana University Lunchtime Talk January

1 High Performance Multi-Paradigm Messaging Runtime Integrating Grids and Multicore Systems e-Science 2007 Conference Bangalore India December

Service Aggregated Linked Sequential Activities: GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data.

SALSASALSA Research Technologies Round Table, Indiana University, December Judy Qiu

SALSA Group’s Collaborations with Microsoft SALSA Group Principal Investigator Geoffrey Fox Project Lead Judy Qiu Scott Beason,

Shanghai Many-Core Workshop, March Judy Qiu Research.

Shashwat Shriparv InfinitySoft.

Message Management April Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN.

1 Multicore for Science Multicore Panel at eScience 2008 December Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University.

SALSASALSASALSASALSA Multicore and Cloud Technologies for Data Intensive Applications Ballantine Hall 006, Indiana University Bloomington October 23, 2009.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

Distributed Handler Architecture (DHArch) Beytullah Yildiz Advisor: Prof. Geoffrey C. Fox.

Distributed Handler Architecture (DHArch) Beytullah Yildiz Advisor: Prof. Geoffrey C. Fox.

HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox

Exploring Parallelism with Joseph Pantoga Jon Simington.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

1 High Performance Robust Datamining for Cheminformatics Division of Chemical Information Session: Cheminformatics: From Teaching to Research ACS Spring.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

1 Multicore Salsa Parallel Computing and Web 2.0 Open Grid Forum Web 2.0 Workshop OGF21, Seattle Washington October Geoffrey Fox, Huapeng Yuan,

NFV Compute Acceleration APIs and Evaluation

Community Grids Laboratory

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Service Aggregated Linked Sequential Activities

Morgan Kaufmann Publishers

Early Experience with Cloud Technologies

Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.

Geoffrey Fox, Huapeng Yuan, Seung-Hee Bae Xiaohong Qiu

Microsoft eScience Workshop December 2008 Geoffrey Fox

GCC2008 (Global Clouds and Cores 2008) October Geoffrey Fox

Hybrid Programming with OpenMP and MPI

3 Questions for Cluster and Grid Use

Clouds and Grids Multicore and all that

Presentation transcript:

1 Performance of a Multi-Paradigm Messaging Runtime on Multicore Systems Poster at Grid 2007 Omni Austin Downtown Hotel Austin Texas September Xiaohong Qiu Research Computing UITS, Indiana University Bloomington IN Geoffrey Fox, H. Yuan, Seung-Hee Bae Community Grids Laboratory, Indiana University Bloomington IN George Chrysanthakopoulos, Henrik Frystyk Nielsen Microsoft Research, Redmond WA Presented by Geoffrey Fox

2 Motivation Exploring possible applications for tomorrow’s multicore chips (especially clients) with 64 or more cores (about 5 years) One plausible set of applications is data-mining of Internet and local sensors Developing Library of efficient data-mining algorithms –Clustering (GIS, Cheminformatics) and Hidden Markov Methods (Speech Recognition) Choose algorithms that can be parallelized well

3 Approach Need 3 forms of parallelism –MPI Style –Dynamic threads as in pruned search –Coarse Grain functional parallelism Do not use an integrated language approach as in Darpa HPCS Rather use “mash-ups” or “workflow” to link together modules in optimized parallel libraries Use Microsoft CCR/DSS where DSS is mash-up model built from CCR and CCR supports MPI or Dynamic threads

4 Microsoft CCR Supports exchange of messages between threads using named ports FromHandler: Spawn threads without reading ports Receive: Each handler reads one item from a single port MultipleItemReceive: Each handler reads a prescribed number of items of a given type from a given port. Note items in a port can be general structures but all must have same type. MultiplePortReceive: Each handler reads a one item of a given type from multiple ports. JoinedReceive: Each handler reads one item from each of two ports. The items can be of different type. Choice: Execute a choice of two or more port-handler pairings Interleave: Consists of a set of arbiters (port -- handler pairs) of 3 types that are Concurrent, Exclusive or Teardown (called at end for clean up). Concurrent arbiters are run concurrently but exclusive handlers are

Preliminary Results Parallel Deterministic Annealing Clustering in C# with speed-up of 7 on Intel 2 quadcore systems Analysis of performance of Java, C, C# in MPI and dynamic threading with XP, Vista, Windows Server, Fedora, Redhat on Intel/AMD systems Study of cache effects coming with MPI thread-based parallelism Study of execution time fluctuations in Windows (limiting speed-up to 7 not 8!)

Machines Used AMD4: HPxw9300 workstation, 2 AMD Opteron CPUs Processor 275 at 2.19GHz, 4 cores L2 Cache 4x1MB (summing both chips), Memory 4GB, XP Pro 64bit, Windows Server, Red Hat C# Benchmark Computational unit: µs Intel4: Dell Precision PWS670, 2 Intel Xeon Paxville CPUs at 2.80GHz, 4 cores L2 Cache 4x2MB, Memory 4GB, XP Pro 64bit C# Benchmark Computational unit: µs Intel8a: Dell Precision PWS690, 2 Intel Xeon CPUs E5320 at 1.86GHz, 8 cores L2 Cache 4x4M, Memory 8GB, XP Pro 64bit C# Benchmark Computational unit: µs Intel8b: Dell Precision PWS690, 2 Intel Xeon CPUs E5355 at 2.66GHz, 8 cores L2 Cache 4x4M, Memory 4GB, Vista Ultimate 64bit, Fedora 7 C# Benchmark Computational unit: µs Intel8c: Dell Precision PWS690, 2 Intel Xeon CPUs E5345 at 2.33GHz, 8 cores L2 Cache 4x4M, Memory 8GB, Red Hat 5.0, Fedora 7

AMD4: 4 CoreNumber of Parallel Computations (μs) Spawned Pipeline Shift Two Shifts (MPI) Pipeline Shift Exchange As Two Shifts Exchange CCR Overhead for a computation of µs between messaging Rendez vous

CCR Overhead for a computation of 29.5 µs between messaging Rendez vous Intel4: 4 CoreNumber of Parallel Computations (μs) Spawned Pipeline Shift Two Shifts MPI Pipeline Shift Exchange As Two Shifts Exchange

CCR Overhead for a computation of µs between messaging Rendez vous Intel8b: 8 CoreNumber of Parallel Computations (μs) Spawned Pipeline Shift Two Shifts MPI Pipeline Shift Exchange As Two Shifts Exchange

MPI Exchange Latency in µs with 500,000 stages (20-30 µs computation between messaging) MachineOSRuntimeGrainsParallelismMPI Exchange Latency Intel8c:gf12RedhatMPJEProcess8181 MPICH2Process840.0 MPICH2: FastProcess839.3 NemesisProcess84.21 Intel8c:gf20FedoraMPJEProcess8157 mpiJavaProcess8111 MPICH2Process864.2 Intel8bVistaMPJEProcess8170 FedoraMPJEProcess8142 FedorampiJavaProcess8100 VistaCCRThread820.2 AMD4XPMPJEProcess4185 RedhatMPJEProcess4152 RedhatmpiJavaProcess499.4 RedhatMPICH2Process439.3 XPCCRThread416.3 Intel4XPCCRThread425.8

Overhead (latency) of AMD4 PC with 4 execution threads on MPI style Rendezvous Messaging for Shift and Exchange implemented either as two shifts or as custom CCR pattern Stages (millions) Time Microseconds

Overhead (latency) of Intel8b PC with 8 execution threads on MPI style Rendezvous Messaging for Shift and Exchange implemented either as two shifts or as custom CCR pattern Stages (millions) Time Microseconds

MPICH mpiJava MPJE MPI Exchange Latency on AMD Stages (millions)

One thread on each core Thread i stores sum in A(i) is separation 1 – no variable access interference but cache line interference Thread i stores sum in A(X*i) is separation X Serious degradation if X < 64 bytes (8 words) and Vista or XP A is a double (8 bytes) Cache Line Interference

Deterministic Annealing See K. Rose, "Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems," Proceedings of the IEEE, vol. 80, pp , November 1998 Parallelization is similar to ordinary K-Means as we are calculating global sums which are decomposed into local averages and then summed over components calculated in each processor Many similar data mining algorithms (such as annealing for E-M expectation maximization) which have high parallel efficiency and avoid local minima

Clustering by Deterministic Annealing Use Physics Analogy for Clustering

Deterministically find cluster centers y j using “mean field approximation” – could use slower Monte Carlo

Annealing avoids local minima

Parallel Multicore Deterministic Annealing Clustering Parallel Overhead on 8 Threads Intel 8b Speedup = 8/(1+Overhead) 10000/(Grain Size n = points per core) Overhead = Constant1 + Constant2/n Constant1 = 0.05 to 0.1 (Client Windows) 10 Clusters 20 Clusters

Parallel Multicore Deterministic Annealing Clustering “Constant1” Increasing number of clusters decreases communication/memory bandwidth overheads Parallel Overhead for large (2M points) Indiana Census clustering on 8 Threads Intel 8b

Intel 8b C# with 1 Cluster: Vista Scaled Run Time for Clustering Kernel Run time for same workload per thread normalized by number of data points Expect Run Time independent of Number of threads if not for parallel and memory bandwidth overheads Work per data point proportional to number of clusters Number of Threads Run Time Secs

Intel 8b C# with 80 Clusters: Vista Scaled Run Time for Clustering Kernel Work per data point proportional to number of clusters so memory bandwidth and parallel overheads decrease as # clusters increase Number of Threads Run Time Secs

Intel 8c C with 80 Clusters: Redhat Run Time Fluctuations for Clustering Kernel This is average of standard deviation of run time of the 8 threads between messaging synchronization points Number of Threads Standard Deviation/Run Time

Intel 8c C with 80 Clusters: Redhat Scaled Run Time for Clustering Kernel Work per data point proportional to number of clusters so memory bandwidth and parallel overheads decrease as # clusters increase Number of Threads Run Time Secs

Intel 8b C# with 1 Cluster: Vista Run Time Fluctuations for Clustering Kernel This is average of standard deviation of run time of the 8 threads between messaging synchronization points Number of Threads Standard Deviation/Run Time

Intel 8b C# with 80 Clusters: Vista Run Time Fluctuations for Clustering Kernel This is average of standard deviation of run time of the 8 threads between messaging synchronization points Number of Threads Standard Deviation/Run Time

DSS Section We view system as a collection of services – in this case –One to supply data –One to run parallel clustering –One to visualize results – in this by spawning a Google maps browser –Note we are clustering Indiana census data DSS is convenient as built on CCR

PC07Intro 30 Timing of HP Opteron Multicore as a function of number of simultaneous two- way service messages processed (November 2006 DSS Release) CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better DSS Service Measurements

Clustering algorithm annealing by decreasing distance scale and gradually finds more clusters as resolution improved Here we see increasing to 30 as algorithm progresses