Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts Bert de Jong High Performance Software.

Slides:



Advertisements
Similar presentations
Far Fetched Prefetching?
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
2. Computer Clusters for Scalable Parallel Computing
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
IDC HPC User Forum Conference Appro Product Update Anthony Kenisky, VP of Sales.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
The hybird approach to programming clusters of multi-core architetures.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
Computer System Architectures Computer System Software
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Types of Computers Mainframe/Server Two Dual-Core Intel ® Xeon ® Processors 5140 Multi user access Large amount of RAM ( 48GB) and Backing Storage Desktop.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Challenges towards Elastic Power Management in Internet Data Center.
Directed Reading 2 Key issues for the future of Software and Hardware for large scale Parallel Computing and the approaches to address these. Submitted.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Distributed Computing CSC 345 – Operating Systems By - Fure Unukpo 1 Saturday, April 26, 2014.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
1 Distributed Databases BUAD/American University Distributed Databases.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
Distributed Programming CA107 Topics in Computing Series Martin Crane Karl Podesta.
Networking  Networking is of linking two or more computing devices together for the purpose of sharing data.
Macquarie Fields College of TAFE Version 2 – 13 March HARDWARE 4.
Outline Why this subject? What is High Performance Computing?
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Hadoop Aakash Kag What Why How 1.
Clouds , Grids and Clusters
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
N-Tier Architecture.
Super Computing By RIsaj t r S3 ece, roll 50.
PA an Coordinated Memory Caching for Parallel Jobs
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
CMSC 611: Advanced Computer Architecture
O.S Lecture 13 Virtual Memory.
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
湖南大学-信息科学与工程学院-计算机与科学系
Types of Computers Mainframe/Server
Distributed computing deals with hardware
TeraScale Supernova Initiative
Introduction to Operating Systems
4 Macquarie Fields College of TAFE Version 2 – 13 March 2000
Virtual Memory: Working Sets
Database System Architectures
Operating System Overview
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts Bert de Jong High Performance Software Development Molecular Science Computing Facility

2 Overall hardware issues Computer power per node has increased Increase of single CPU has flattened out (but you never know!) Multiple cores together tax out other hardware resources in a node Bandwidth and latency for other major hardware resources are far behind Affecting the flops we actually use Memory Very difficult to feed the CPU Multiple cores further reduce bandwidth Network Data access considerably slower than memory Speed of light is our enemy Disk input/output Slowest of them all, disks spin only so fast

3 Dealing with memory Amounts of data needed in coupled cluster can be huge Amplitudes Too large to store on a single node (except for T 1 ) Shared memory would be good, but will shared memory of 100s of terabytes be feasible and accessible? Integrals Recompute vs store (on disk or in memory) Can we avoid access to memory when recomputing Coupled cluster has one advantage: it can easily be formulated as matrix-multiplication Can be very efficient: DGEMM on EMSL’s 1.5 GHz Itanium-2 system reached over 95% of peak efficiency As long as we can get all the needed data in memory!

4 Dealing with networks With 10s of terabytes of data and distributed memory systems, getting data from remote nodes is inevitable Can be no problem, as long as you can hide the communication behind computation Fetch data while computing = one-sided communication NWChem uses Global Arrays to accomplish this Issues are Low bandwidth and high latency relative to increasing node speed Non-uniform network -Cabling a full fat tree can be cost prohibitive -Effect of network topology -Fault resiliency of network Multiple cores need to compete for limited number of busses Data contention increase with increasing node count Data locality, data locality, data locality

5 Dealing with spinning disks Using local disk Will only contain data needed by its own node Can be fast enough if you put large number of spindles behind it And, again, if you can hide behind computation (pre-fetch) With 100,000s of disks, chances of failure become significant Fault tolerance of computation becomes an issue Using globally shared disk Crucial when going to very large systems Allows for large files shared by large numbers of nodes Lustre file system of petabytes possible Speed limited by number of access points (hosts) Large number of reads and writes need to be handled by small number of hosts, creating lock and access contention

6 What about beyond 1 petaflop? Possibly 100,000s of multicore nodes How does one create a fat enough network between that many nodes? Possibly 32, 64, 128 or more cores per node All cores simply cannot do the same thing anymore Not enough memory bandwidth Not enough network bandwidth Heterogenous computing within a node (CPU+GPU) Designate nodes for certain tasks -Communication -Memory access, put and get -Recompute integrals hopefully using cache only -DGEMM operations Task scheduling will become an issue

7 WR Wiley Environmental Molecular Sciences Laboratory A national scientific user facility integrating experimental and computational resources for discovery and technological innovation