Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.

Slides:



Advertisements
Similar presentations
Lecture 6: Multicore Systems
Advertisements

Lecture 0: Introduction
Field Effect Transistors and their applications. There are Junction FETs (JFET) and Insulated gate FETs (IGFET) There are many types of IGFET. Most common.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
S. RossEECS 40 Spring 2003 Lecture 13 SEMICONDUCTORS: CHEMICAL STRUCTURE Start with a silicon substrate. Silicon has 4 valence electrons, and therefore.
Some Thoughts on Technology and Strategies for Petaflops.
CSCE 612: VLSI System Design Instructor: Jason D. Bakos.
Introduction to CMOS VLSI Design Lecture 0: Introduction
CSCE 190: Computing in the Modern World Dr. Jason D. Bakos
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
CSCE 212 Introduction to Computer Architecture
CSCE 613: Fundamentals of VLSI Chip Design Instructor: Jason D. Bakos.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Integrated Circuit Design and Fabrication Dr. Jason D. Bakos.
Lecture 0: Introduction. CMOS VLSI Design 4th Ed. 0: Introduction2 Introduction  Integrated circuits: many transistors on one chip.  Very Large Scale.
3.1Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
Introduction Integrated circuits: many transistors on one chip.
Lecture 2. Logic Gates Prof. Taeweon Suh Computer Science Education Korea University ECM585 Special Topics in Computer Design.
CH12 CPU Structure and Function
The Devices: Diode.
Z. Feng VLSI Design 1.1 VLSI Design MOSFET Zhuo Feng.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Dept. of Communications and Tokyo Institute of Technology
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
Introduction To Semiconductors
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
1 Integrated Circuits Basics Titov Alexander 25 October 2014.
Figure 9.1. Use of silicon oxide as a masking layer during diffusion of dopants.
Lecture 2. Logic Gates Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education & Research.
Electronics 1 Lecture 2 Ahsan Khawaja Lecturer Room 102 Department of Electrical Engineering.
Lecture 0: Introduction. CMOS VLSI Design 4th Ed. 0: Introduction2 Introduction  Integrated circuits: many transistors on one chip.  Very Large Scale.
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
Abstraction And Technology 1 Comp 411 – Fall /28/06 Computer Abstractions and Technology 1. Layer Cakes 2. Computers are translators 3. Switches.
ELECTRONIC PROPERTIES OF MATTER - Semi-conductors and the p-n junction -
Conductors – many electrons free to move
Trends in the Infrastructure of Computing
CMOS VLSI Design Introduction
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 9: September 26, 2011 MOS Model.
Introduction to CMOS Transistor and Transistor Fundamental
My Coordinates Office EM G.27 contact time:
2007/11/20 Paul C.-P. Chao Optoelectronic System and Control Lab., EE, NCTU P1 Copyright 2015 by Paul Chao, NCTU VLSI Lecture 0: Introduction Paul C.–P.
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
course Name: Semiconductors
COURSE NAME: SEMICONDUCTORS Course Code: PHYS 473.
Introduction to CMOS VLSI Design Lecture 0: Introduction.
1. Introduction. Diseño de Circuitos Digitales para Comunicaciones Introduction Integrated circuits: many transistors on one chip. Very Large Scale Integration.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
William Stallings Computer Organization and Architecture 6th Edition
LTFY – Physics and Engineering
Chapter 4 Processor Technology and Architecture
CIT 668: System Architecture
Computer Architecture and Organization
CSCE 212 Introduction to Computer Architecture
CSCE 190: Computing in the Modern World Dr. Jason D. Bakos
BIC 10503: COMPUTER ARCHITECTURE
3.1 Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
Trends in the Infrastructure of Computing
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos

CSCE 190: Computing in the Modern World 2 Elements

CSCE 190: Computing in the Modern World 3 Semiconductors Silicon is a group IV element (4 valence electrons, shells: 2, 8, 18, 32…) –Forms covalent bonds with four neighbor atoms (3D cubic crystal lattice) –Si is a poor conductor, but conduction characteristics may be altered –Add impurities/dopants (replaces silicon atom in lattice): Makes a better conductor Group V element (phosphorus/arsenic) => 5 valence electrons –Leaves an electron free => n-type semiconductor (electrons, negative carriers) Group III element (boron) => 3 valence electrons –Borrows an electron from neighbor => p-type semiconductor (holes, positive carriers) forward bias reverse bias P-N junction Spacing=543 pm

CSCE 190: Computing in the Modern World 4 MOSFETs body/bulk GROUND NMOS/NFETPMOS/PFET channel shorter length, faster transistor (dist. for electrons) body/bulk HIGH positive voltage (Vdd) negative voltage (rel. to body) (GND) (S/D to body is reverse-biased) current Metal-poly-Oxide-Semiconductor structures built onto substrate –Diffusion: Inject dopants into substrate –Oxidation: Form layer of SiO2 (glass) –Deposition and etching: Add aluminum/copper wires

CSCE 190: Computing in the Modern World 5 Layout 3-input NAND

CSCE 190: Computing in the Modern World 6 Logic Gates invNAND2 NAND3 NOR2

CSCE 190: Computing in the Modern World 7 Logic Synthesis Behavior: –S = A + B –Assume A is 2 bits, B is 2 bits, C is 3 bits ABC 00 (0) 000 (0) 00 (0)01 (1)001 (1) 00 (0)10 (2)010 (2) 00 (0)11 (3)011 (3) 01 (1)00 (0)001 (1) 01 (1) 010 (2) 01 (1)10 (2)011 (3) 01 (1)11 (3)100 (4) 10 (2)00 (0)010 (2) 10 (2)01 (1)011 (3) 10 (2) 100 (4) 10 (2)11 (3)101 (5) 11 (3)00 (0)011 (3) 11 (3)01 (1)100 (4) 11 (3)10 (2)101 (5) 11 (3) 110 (6)

CSCE 190: Computing in the Modern World 8 Microarchitecture

CSCE 190: Computing in the Modern World 9 Synthesized and P&R’ed MIPS Architecture

CSCE 190: Computing in the Modern World 10 Feature Size Shrink minimum feature size… –Smaller L decreases carrier time and increases current –Therefore, W may also be reduced for fixed current –C g, C s, and C d are reduced –Transistor switches faster (~linear relationship)

Minimum Feature Size YearProcessorSpeedTransistor SizeTransistors 1982i MHz 1.5 m ~134, i38616 – 40 MHz 1 m ~270, i MHz.8 m ~1 million 1993Pentium MHz.6 m ~3 million 1995Pentium Pro MHz.5 m ~4 million 1997Pentium II MHz.35 m ~5 million 1999Pentium III450 – 1400 MHz.25 m ~10 million 2000Pentium 41.3 – 3.8 GHz.18 m ~50 million 2005Pentium D2 threads/package.09 m ~200 million 2006Core 22 threads/die.065 m ~300 million 2008Core i7 “Nehalem”8 threads/die.045 m ~800 million 2011“Sandy Bridge”16 threads/die.032 m ~1.2 billion YearProcessorSpeedTransistor SizeTransistors 2008NVIDIA Tesla240 threads/die.065 m 1.4 billion 2010NVIDIA Fermi512 threads/die.040 m 3 billion CSCE 190: Computing in the Modern World 11

CPU Design ALU Control L1 cache L2 cache Control L1 cache L2 cache L3 cache CPU Core 1Core 2 FOR i = 1 to 1000 C[i] = A[i] + B[i] Copy part of A onto CPU Copy part of B onto CPU ADD Copy part of C into Mem time Copy part of A onto CPU Copy part of B onto CPU ADD Copy part of C into Mem … CSCE 190: Computing in the Modern World 12

Co-Processor Design ALU cache GPU Core 1 ALU cache ALU cache ALU cache ALU cache ALU cache ALU cache ALU cache ALU Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 FOR i = 1 to 1000 C[i] = A[i] + B[i] CSCE 190: Computing in the Modern World 13

Heterogeneous Computing CPU X58 Host Memory Co- processor QPIPCIe On board Memory add-in cardhost General purpose CPU + special-purpose processors in one system ~25 GB/s ~8 GB/s (x16) ????? ~150 GB/s for GeForce 260 System I/O controller CSCE 190: Computing in the Modern World 14

Heterogeneous Computing initialization 0.5% of run time “hot” loop 99% of run time clean up 0.5% of run time 49% of code 2% of code co-processor Kernel speedup Application speedup Execution time hours hours hours hours hours Combine CPUs and coprocs Example: –Application requires a week of CPU time –Offload computation consumes 99% of execution time CSCE 190: Computing in the Modern World 15

NVIDIA GPU Architecture Hundreds of simple processor cores Core design: –Only one instruction from each thread active in the pipeline at a time –In-order execution –No branch prediction –No speculative execution –No support for context switches –No system support (syscalls, etc.) –Small, simple caches –Programmer-managed on-chip scratch memory CSCE 190: Computing in the Modern World 16

Programming GPUs HOST (CPU) code: dim3 grid,block; grid.x=((VECTOR_SIZE/512) + (VECTOR_SIZE%512?1:0)); block.x=512; cudaMemcpy(a_device,a,VECTOR_SIZE * sizeof(double),cudaMemcpyHostToDevice); cudaMemcpy(b_device,b,VECTOR_SIZE * sizeof(double),cudaMemcpyHostToDevice); vector_add_device >>(a_device,b_device,c_device); cudaMemcpy(c_gpu,c_device,VECTOR_SIZE * sizeof(double),cudaMemcpyDeviceToHost); GPU code: __global__ void vector_add_device (double *a,double *b,double *c) { __shared__ double a_s,b_s,c_s; a_s=a[blockIdx.x*blockDim.x+threadIdx.x]; b_s=b[blockIdx.x*blockDim.x+threadIdx.x]; c_s=a_s+b_s; c[blockIdx.x*blockDim.x+threadIdx.x]=c_s; } CSCE 190: Computing in the Modern World 17

IBM Cell/B.E. Architecture 1 PPE, 8 SPEs Programmer must manually manage 256K memory and threads invocation on each SPE Each SPE includes a vector unit like the one on current Intel processors –128 bits wide CSCE 190: Computing in the Modern World 18

High-Performance Reconfigurable Computing Heterogeneous computing with reconfigurable logic, i.e. FPGAs CSCE 190: Computing in the Modern World 19

Field Programmable Gate Arrays CSCE 190: Computing in the Modern World 20

Heterogeneous Computing with FPGAs CSCE 190: Computing in the Modern World 21

Programming FPGAs CSCE 190: Computing in the Modern World 22

Heterogeneous Computing with FPGAs Annapolis Micro Systems WILDSTAR 2 PRO GiDEL PROCSTAR III CSCE 190: Computing in the Modern World 23

Heterogeneous Computing with FPGAs Convey HC-1 CSCE 190: Computing in the Modern World 24

Heterogeneous Computing with GPUs NVIDIA Tesla S1070 CSCE 190: Computing in the Modern World 25

Heterogeneous Computing now Mainstream: IBM Roadrunner Los Alamos, fastest computer in the world in 2008 (still in Top 10) 6,480 AMD Opteron (dual core) CPUs 12,960 PowerXCell 8i GPUs Each blade contains 2 Operons and 4 Cells 296 racks First ever petaflop machine (2008) 1.71 petaflops peak (1.7 billion million fp operations per second) 2.35 MW (not including cooling) –Lake Murray hydroelectric plant produces ~150 MW (peak) –Lake Murray coal plant (McMeekin Station) produces ~300 MW (peak) –Catawba Nuclear Station near Rock Hill produces 2258 MW CSCE 190: Computing in the Modern World 26

Conclusion Even though integration levels continue to increase exponentially, processor makers still need to budget their real estate to target specific application targets Use of heterogeneous computing is on the rise for both high performance computing and embedded computing CSCE 190: Computing in the Modern World 27