1 Multicore for Science Multicore Panel at eScience 2008 December 11 2008 Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Distributed Systems CS

1 Computational models of the physical world Cortical bone Trabecular bone.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

SALSA HPC Group School of Informatics and Computing Indiana University.

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1: Operating Systems Overview

Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

1 Multicore and Cloud Futures CCGSC September Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

The hybird approach to programming clusters of multi-core architetures.

SALSASALSA Programming Abstractions for Multicore Clouds eScience 2008 Conference Workshop on Abstractions for Distributed Applications and Systems December.

1 Challenges Facing Modeling and Simulation in HPC Environments Panel remarks ECMS Multiconference HPCS 2008 Nicosia Cyprus June Geoffrey Fox Community.

Computer System Architectures Computer System Software

SALSASALSASALSASALSA AOGS, Singapore, August 11-14, 2009 Geoffrey Fox 1,2 and Marlon Pierce 1

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Computer Performance Computer Engineering Department.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Applications and Runtime for multicore/manycore March Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

1 Performance of a Multi-Paradigm Messaging Runtime on Multicore Systems Poster at Grid 2007 Omni Austin Downtown Hotel Austin Texas September

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.

SALSASALSA Microsoft eScience Workshop December Indianapolis, Indiana Geoffrey Fox

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

SALSA HPC Group School of Informatics and Computing Indiana University.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

1 Performance Measurements of CCR and MPI on Multicore Systems Expanded from a Poster at Grid 2007 Austin Texas September Xiaohong Qiu Research.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Cloud Age Time to change the programming paradigm?

Service Aggregated Linked Sequential Activities: GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data.

SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox

Message Management April Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN.

SALSASALSASALSASALSA Cloud Panel Session CloudCom 2009 Beijing Jiaotong University Beijing December Geoffrey Fox

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

1 Cloud Systems Panel at HPDC Boston June Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University

Background Computer System Architectures Computer System Software.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

CPU Central Processing Unit

Community Grids Laboratory

CS5102 High Performance Computer Systems Thread-Level Parallelism

Parallel Programming By J. H. Wang May 2, 2017.

Early Experience with Cloud Technologies

Geoffrey Fox, Huapeng Yuan, Seung-Hee Bae Xiaohong Qiu

Microsoft eScience Workshop December 2008 Geoffrey Fox

Biology MDS and Clustering Results

Summary Background Introduction in algorithms and applications

Scalable Parallel Interoperable Data Analytics Library

Chapter 1 Introduction.

Hybrid Programming with OpenMP and MPI

Department of Intelligent Systems Engineering

3 Questions for Cluster and Grid Use

MapReduce: Simplified Data Processing on Large Clusters

CReSIS Cyberinfrastructure

Presentation transcript:

1 Multicore for Science Multicore Panel at eScience 2008 December Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University

Lessons Not surprising scientific programs will run very well on multicore systems We need to exploit commodity software environments so not clear that MPI best Multicore best practice and large scale distributed processing not scientific computing will drive Although MPI will get good performance On node we can replace MPI by threading which has several advantages: Avoids explicit communication MPI SEND/RECV in node Allows very dynamic implementation with # threads changing with time Asynchronous algorithms Between nodes, we need to combine the best of MPI and Hadoop/Dryad 2

Threading (CCR) Performance: 8-24 core servers Clustering of Medical Informatics data 4000 records – scaling for fixed problem size cores Dell Intel 6 core chip with 4 sockets : PowerEdge R900, 4x E7450 Xeon Six Cores, 2.4GHz, 12M Cache 1066Mhz FSB, 48 Gigabytes memory Intel core about 25% faster than Barcelona AMD core cores Parallel Overhead  1-efficiency = (PT(P)/T(1)-1) On P processors = (1/efficiency)-1

Parallel Overhead  1-efficiency = (PT(P)/T(1)-1) On P processors = (1/efficiency)-1 Curiously performance for fixed number of cores is (on 2 core Patient2000) Dell 4 core Laptop 21 minutes Then Dell 24 core Server 27 minutes Then my current 2 core Laptop 28 minutes Finally Dell 8/16 core AMD 34 minutes 4-core Laptop Precision M6400, Intel Core 2 Dual Extreme Edition QX GHz, 1067MHZ, 12M L2 Use Battery 1 Core Speed up Cores Speed up Cores Speed up Cores Speed up 4.08 MPI.Net on cluster of 8 16 core AMD systems Scaled Speed up Cores Fixed Problem size speed up on Laptops

Data Driven Architecture Typically one uses “data parallelism” to break data into parts and process parts in parallel so that each of Compute/Map phases runs in (data) parallel mode Different stages in pipeline corresponds to different functions “filter1” “filter2” ….. “visualize” Mix of functional and parallel components linked by messages 5 Disk/Database Compute (Map #1) Disk/Database Memory/Streams Compute (Reduce #1) Disk/Database Memory/Streams Disk/Database Compute (Map #2) Disk/Database Memory/Streams Compute (Reduce #2) Disk/Database Memory/Streams etc. Typically workflow MPI, Shared Memory Filter 1 Filter 2 Distributed or “centralized

Programming Model Implications I The distributed world is revolutionized by new environments (Hadoop, Dryad) supporting explicitly decomposed data parallel applications There can be high level languages However they “just” pick parallel modules from library – most realistic near term approach to parallel computing environments Party Line Parallel Programming Model: Workflow (parallel-- distributed) controlling optimized library calls Mashups, Hadoop and Dryad and their relations are likely to replace current workflow (BPEL..) Note no mention of automatic compilation Recent progress has all been in explicit parallelism 6

Programming Model Implications II Generalize owner-computes rule if data stored in memory of CPU-i, then CPU-i processes it To the disk-memory-maps rule CPU-i “moves” to Disk-i and uses CPU-i’s memory to load disk’s data and filters/maps/computes it Embodies data driven computation and move computing to the data MPI has wonderful features but it will be ignored in real world unless simplified CCR from Microsoft – only ~7 primitives – is one possible commodity multicore messaging environment It is roughly active messages Both threading CCR and process based MPI can give good (and similar) performance on multicore systems 7

Programming Model Implications III MapReduce style primitives really easy in MPI Map is trivial owner computes rule Reduce is “just” globalsum = MPI_communicator. Allreduce(partialsum, Operation.Add); With partialsum a sum calculated in parallel in CCR thread or MPI process Threading doesn’t have obvious reduction primitives? Here is a sequential version globalsum = 0.0; // globalsum often an array; for (int ThreadNo = 0; ThreadNo < Count; ThreadNo++) { globalsum += partialsum[ThreadNo] } Could exploit parallelism over indices of globalsum There is a huge amount of work on MPI reduction algorithms – can this be retargeted to MapReduce and Threading 8

Programming Model Implications IV MPI complications comes from Send or Recv not Reduce Here thread model is much easier as “Send” in MPI (within node) is just a memory access with shared memory PGAS model could address but not likely to be practical in near future One could link PGAS nicely with systems like Dryad/Hadoop Threads do not force parallelism so can get accidental Amdahl bottlenecks Threads can be inefficient due to cacheline interference Different threads must not write to same cacheline Avoid with artificial constructs like: partialsumC[ThreadNo] = new double[maxNcent + cachelinesize] Windows produces runtime fluctuations that give up to 5-10% synchronization overheads 9

Components of a Scientific Computing environment My laptop using a dynamic number of cores for runs Threading (CCR) parallel model allows such dynamic switches if OS told application how many it could – we use short-lived NOT long running threads Very hard with MPI as would have to redistribute data The cloud for dynamic service instantiation including ability to launch: MPI engines for large closely coupled computations Petaflops for million particle clustering/dimension reduction? Many parallel applications will run OK for large jobs with “millisecond” (as in Granules) not “microsecond” (as in MPI, CCR) latencies Workflow/Hadoop/Dryad will link together “seamlessly” 10