1 Computational models of the physical world Cortical bone Trabecular bone.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Computer Science, University of Oklahoma Reconfigurable Versus Fixed Versus Hybrid Architectures John K. Antonio Oklahoma Supercomputing Symposium 2008.

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

CS 140: Models of parallel programming: Distributed memory and MPI.

CM-5 Massively Parallel Supercomputer ALAN MOSER Thinking Machines Corporation 1993.

BY MANISHA JOSHI.  Extremely fast data processing-oriented computers.  Speed is measured in “FLOPS”.  For highly calculation-intensive tasks.  For.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

ICS 556 Parallel Algorithms Ebrahim Malalla Office: Bldg 22, Room

Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

CS 240A: Models of parallel programming: Distributed memory and MPI.

CS 240A: Models of parallel programming: Machines, languages, and complexity measures.

1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.

Lecture 1: Introduction to High Performance Computing.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

1 High-Performance Graph Computation via Sparse Matrices John R. Gilbert University of California, Santa Barbara with Aydin Buluc, LBNL; Armando Fox, UCB;

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

08/21/2012CS4230 CS4230 Parallel Programming Lecture 1: Introduction Mary Hall August 21,

9/3/2015Copyright Introduction CEE3804: Computer Applications for Civil and Environmental Engineers.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Lecture 2 : Introduction to Multicore Computing

9/21/2015Copyright Introduction CEE3804: Computer Applications for Civil and Environmental Engineers.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

CS 240A Models of parallel programming: Distributed memory and MPI.

Lecture 1: Introduction. Course Outline The aim of this course: Introduction to the methods and techniques of performance analysis of computer systems.

1 The First Computer [Adapted from Copyright 1996 UCB]

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

High Performance Computing An overview Alan Edelman Massachusetts Institute of Technology Applied Mathematics & Computer Science and AI Labs (Interactive.

Parallel IO for Cluster Computing Tran, Van Hoai.

Today's Software For Tomorrow's Hardware: An Introduction to Parallel Computing Rahul.S. Sampath May 9 th 2007.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

ECEN2102 Digital Logic Design Lecture 0 Course Overview Abdullah Said Alkalbani University of Buraimi.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.

Why Parallel/Distributed Computing Sushil K. Prasad

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

William Stallings Computer Organization and Architecture 6th Edition

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction Super-computing Tuesday

Constructing a system with multiple computers or processors

Architecture & Organization 1

Parallel Computers Today

Guoliang Chen Parallel Computing Guoliang Chen

Architecture & Organization 1

BIC 10503: COMPUTER ARCHITECTURE

Course Description: Parallel Computer Architecture

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Chapter 1 Introduction.

Chapter 01: Introduction

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

1 Computational models of the physical world Cortical bone Trabecular bone

“The unreasonable effectiveness of mathematics” As the “middleware” of scientific computing, linear algebra has supplied or enabled: Mathematical tools “Impedance match” to computer operations High-level primitives High-quality software libraries Ways to extract performance from computer architecture Interactive environments Computers Continuous physical modeling Linear algebra

3 Top 500 List (November 2014) = x P A L U Top500 Benchmark: Solve a large system of linear equations by Gaussian elimination

4 Co-author graph from 1993 Householder symposium Social network analysis (1993)

5 Facebook graph: > 1,000,000,000 vertices Social network analysis (2015)

Social network analysis Betweenness Centrality (BC) C B (v): Among all the shortest paths, what fraction of them pass through the node of interest? Brandes’ algorithm A typical software stack for an application enabled with the Combinatorial BLAS

An analogy? Computers Continuous physical modeling Linear algebra Discrete structure analysis Graph theory Computers

8 Graph 500 List (November 2014) Graph500 Benchmark: Breadth-first search in a large power-law graph

9 Floating-Point vs. Graphs, November 2014 = x P A L U Peta / 24 Tera is about 1, Petaflops 24 Terateps

10 Nov 2014: 34 Peta / 24 Tera ~ 1,400 Nov 2010: 2.5 Peta / 6.6 Giga ~ 380,000 Floating-Point vs. Graphs, November 2014 = x P A L U Petaflops 24 Terateps

Parallel Computers Today Nvidia GK110 GPU: 1.7 TFLOPS 61-processor Intel Xeon Phi: 1.0 TFLOPS  TFLOPS = floating point ops/sec  PFLOPS = 1,000,000,000,000,000 / sec Oak Ridge / Cray Titan 17.6 PFLOPS

Supercomputers 1976:Cray-1, 133 MFLOPS (10 6 ) Supercomputers 1976: Cray-1, 133 MFLOPS (10 6 )

Technology Trends: Microprocessor Capacity Moore’s Law: # transistors / chip doubles every 1.5 years Microprocessors keep getting smaller, denser, and more powerful. Gordon Moore (Intel co-founder) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

Trends in processor clock speed Triton’s clockspeed is still only 2600 Mhz in 2015!

4-core Intel Sandy Bridge (Triton uses an 8-core version) 2600 Mhz clock speed

Generic Parallel Machine Architecture Key architecture question: Where and how fast are the interconnects? Key algorithm question: Where is the data? Proc Cache L2 Cache L3 Cache Memory Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects

Triton memory hierarchy: I (Chip level) Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache L3 Cache (8MB) Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache (AMD Opteron 8-core Magny-Cours, similar to Triton’s Intel Sandy Bridge) Chip sits in socket, connected to the rest of the node...

Triton memory hierarchy II (Node level) Shared Node Memory (64GB) Node L3 Cache (8 MB) P L1/L2 L3 Cache (8 MB) P L1/L2 L3 Cache (8 MB) P L1/L2 L3 Cache (8 MB) P L1/L2 Chip

Triton memory hierarchy III (System level) 64GB Node 324 nodes, message-passing communication, no shared memory

Some models of parallel computation Computational model Shared memory SPMD / Message passing SIMD / Data parallel PGAS / Partitioned global Loosely coupled Hybrids … Languages Cilk, OpenMP, Pthreads … MPI Cuda, Matlab, OpenCL, … UPC, CAF, Titanium Map/Reduce, Hadoop, … ???

Parallel programming languages Many have been invented – *much* less consensus on what are the best languages than in the sequential world. Could have a whole course on them; we’ll look just a few. Languages you’ll use in homework: C with MPI (very widely used, very old-fashioned) Cilk Plus (a newer upstart) You will choose a language for the final project

Generic Parallel Machine Architecture Key architecture question: Where and how fast are the interconnects? Key algorithm question: Where is the data? Proc Cache L2 Cache L3 Cache Memory Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects