CSCE 190: Computing in the Modern World Dr. Jason D. Bakos

Slides:

Advertisements

Similar presentations

Lecture 0: Introduction

Advertisements

Chapter 1 An Introduction To Microprocessor And Computer

Instructor: Sazid Zaman Khan Lecturer, Department of Computer Science and Engineering, IIUC.

Field Effect Transistors and their applications. There are Junction FETs (JFET) and Insulated gate FETs (IGFET) There are many types of IGFET. Most common.

Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.

Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.

Some Thoughts on Technology and Strategies for Petaflops.

CSCE 612: VLSI System Design Instructor: Jason D. Bakos.

Introduction to CMOS VLSI Design Lecture 0: Introduction

CSCE 190: Computing in the Modern World Dr. Jason D. Bakos

Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.

CSCE 212 Introduction to Computer Architecture

CSCE 613: Fundamentals of VLSI Chip Design Instructor: Jason D. Bakos.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Integrated Circuit Design and Fabrication Dr. Jason D. Bakos.

Lecture 0: Introduction. CMOS VLSI Design 4th Ed. 0: Introduction2 Introduction  Integrated circuits: many transistors on one chip.  Very Large Scale.

3.1Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:

Introduction Integrated circuits: many transistors on one chip.

Lecture 2. Logic Gates Prof. Taeweon Suh Computer Science Education Korea University ECM585 Special Topics in Computer Design.

Z. Feng VLSI Design 1.1 VLSI Design MOSFET Zhuo Feng.

Dept. of Communications and Tokyo Institute of Technology

Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.

Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.

1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.

1 Integrated Circuits Basics Titov Alexander 25 October 2014.

Lecture 2. Logic Gates Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education & Research.

Lecture 0: Introduction. CMOS VLSI Design 4th Ed. 0: Introduction2 Introduction  Integrated circuits: many transistors on one chip.  Very Large Scale.

Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.

Abstraction And Technology 1 Comp 411 – Fall /28/06 Computer Abstractions and Technology 1. Layer Cakes 2. Computers are translators 3. Switches.

Trends in the Infrastructure of Computing

Introduction to CMOS Transistor and Transistor Fundamental

Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.

My Coordinates Office EM G.27 contact time:

Microprocessor Design Process

2007/11/20 Paul C.-P. Chao Optoelectronic System and Control Lab., EE, NCTU P1 Copyright 2015 by Paul Chao, NCTU VLSI Lecture 0: Introduction Paul C.–P.

Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.

COURSE NAME: SEMICONDUCTORS Course Code: PHYS 473.

Introduction to CMOS VLSI Design Lecture 0: Introduction.

1. Introduction. Diseño de Circuitos Digitales para Comunicaciones Introduction Integrated circuits: many transistors on one chip. Very Large Scale Integration.

SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.

William Stallings Computer Organization and Architecture 6th Edition

Computer Operations Part 2.

Lecture 1: CS/ECE 3810 Introduction

Chapter 4 Processor Technology and Architecture

CIT 668: System Architecture

Stateless Combinational Logic and State Circuits

INTRODUCTION TO MICROPROCESSORS

Architecture & Organization 1

INTRODUCTION TO MICROPROCESSORS

Computer Architecture and Organization

INTRODUCTION TO MICROPROCESSORS

CSCE 212 Introduction to Computer Architecture

Elementary Particles (last bit); Start Review for Final

Architecture & Organization 1

BIC 10503: COMPUTER ARCHITECTURE

3.1 Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:

FIELD EFFECT TRANSISTOR

Trends in the Infrastructure of Computing

A High Performance SoC: PkunityTM

Computer Evolution and Performance

Multicore and GPU Programming

Lecture 3 (Microprocessor)

CSE 502: Computer Architecture

Multicore and GPU Programming

Presentation transcript:

CSCE 190: Computing in the Modern World Dr. Jason D. Bakos Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos

Elements

Semiconductors + - - + Spacing=543 pm Silicon is a group IV element (4 valence electrons, shells: 2, 8, 18, 32…) Forms covalent bonds with four neighbor atoms (3D cubic crystal lattice) Si is a poor conductor, but conduction characteristics may be altered Add impurities/dopants (replaces silicon atom in lattice): Makes a better conductor Group V element (phosphorus/arsenic) => 5 valence electrons Leaves an electron free => n-type semiconductor (electrons, negative carriers) Group III element (boron) => 3 valence electrons Borrows an electron from neighbor => p-type semiconductor (holes, positive carriers) + - - + + + + + + + - - - - - - P-N junction forward bias reverse bias Spacing=543 pm

MOSFETs Metal-poly-Oxide-Semiconductor structures built onto substrate negative voltage (rel. to body) (GND) positive voltage (Vdd) NMOS/NFET PMOS/PFET + + + - - - - - - + + + current current channel shorter length, faster transistor (dist. for electrons) body/bulk GROUND body/bulk HIGH (S/D to body is reverse-biased) Metal-poly-Oxide-Semiconductor structures built onto substrate Diffusion: Inject dopants into substrate Oxidation: Form layer of SiO2 (glass) Deposition and etching: Add aluminum/copper wires

Layout 3-input NAND

Logic Gates inv NAND2 NAND3 NOR2

Logic Synthesis Behavior: S = A + B Assume A is 2 bits, B is 2 bits, C is 3 bits A B C 00 (0) 000 (0) 01 (1) 001 (1) 10 (2) 010 (2) 11 (3) 011 (3) 100 (4) 101 (5) 110 (6)

Microarchitecture

Synthesized and P&R’ed MIPS Architecture

Feature Size Shrink minimum feature size… Smaller L decreases carrier time and increases current Therefore, W may also be reduced for fixed current Cg, Cs, and Cd are reduced Transistor switches faster (~linear relationship)

Minimum Feature Size Year Processor Speed Transistor Size Transistors 1982 i286 6 - 25 MHz 1.5 mm ~134,000 1986 i386 16 – 40 MHz 1 mm ~270,000 1989 i486 16 - 133 MHz .8 mm ~1 million 1993 Pentium 60 - 300 MHz .6 mm ~3 million 1995 Pentium Pro 150 - 200 MHz .5 mm ~4 million 1997 Pentium II 233 - 450 MHz .35 mm ~5 million 1999 Pentium III 450 – 1400 MHz .25 mm ~10 million 2000 Pentium 4 1.3 – 3.8 GHz .18 mm ~50 million 2005 Pentium D 2 threads/package .09 mm ~200 million 2006 Core 2 2 threads/die .065 mm ~300 million 2008 Core i7 “Nehalem” 8 threads/die .045 mm ~800 million 2011 “Sandy Bridge” 16 threads/die .032 mm ~1.2 billion Year Processor Speed Transistor Size Transistors 2008 NVIDIA Tesla 240 threads/die .065 mm 1.4 billion 2010 NVIDIA Fermi 512 threads/die .040 mm 3 billion

CPU Design FOR i = 1 to 1000 C[i] = A[i] + B[i] CPU … Core 1 Core 2 Control ALU ALU Control ALU ALU ALU ALU ALU ALU time L1 cache L1 cache Copy part of A onto CPU L2 cache L2 cache Copy part of B onto CPU ADD ADD ADD ADD L3 cache Copy part of C into Mem Copy part of A onto CPU Copy part of B onto CPU ADD ADD ADD ADD Copy part of C into Mem …

Co-Processor Design FOR i = 1 to 1000 C[i] = A[i] + B[i] GPU Core 1 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 1 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 2 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 3 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 4 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 5 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 6 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 7 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 8

Heterogeneous Computing General purpose CPU + special-purpose processors in one system System I/O controller QPI PCIe Host Memory CPU On board Memory X58 Co-processor ~25 GB/s ~25 GB/s ~8 GB/s (x16) ????? ~150 GB/s for GeForce 260 host add-in card

Heterogeneous Computing Combine CPUs and coprocs Example: Application requires a week of CPU time Offload computation consumes 99% of execution time initialization 0.5% of run time 49% of code “hot” loop 99% of run time 2% of code Kernel speedup Application Execution time 50 34 5.0 hours 100 3.3 hours 200 67 2.5 hours 500 83 2.0 hours 1000 91 1.8 hours clean up 0.5% of run time 49% of code co-processor

NVIDIA GPU Architecture Hundreds of simple processor cores Core design: Only one instruction from each thread active in the pipeline at a time In-order execution No branch prediction No speculative execution No support for context switches No system support (syscalls, etc.) Small, simple caches Programmer-managed on-chip scratch memory

Programming GPUs HOST (CPU) code: GPU code: dim3 grid,block; grid.x=((VECTOR_SIZE/512) + (VECTOR_SIZE%512?1:0)); block.x=512; cudaMemcpy(a_device,a,VECTOR_SIZE * sizeof(double),cudaMemcpyHostToDevice); cudaMemcpy(b_device,b,VECTOR_SIZE * sizeof(double),cudaMemcpyHostToDevice); vector_add_device<<<grid,block>>>(a_device,b_device,c_device); cudaMemcpy(c_gpu,c_device,VECTOR_SIZE * sizeof(double),cudaMemcpyDeviceToHost); GPU code: __global__ void vector_add_device (double *a,double *b,double *c) { __shared__ double a_s,b_s,c_s; a_s=a[blockIdx.x*blockDim.x+threadIdx.x]; b_s=b[blockIdx.x*blockDim.x+threadIdx.x]; c_s=a_s+b_s; c[blockIdx.x*blockDim.x+threadIdx.x]=c_s; }

IBM Cell/B.E. Architecture 1 PPE, 8 SPEs Programmer must manually manage 256K memory and threads invocation on each SPE Each SPE includes a vector unit like the one on current Intel processors 128 bits wide

High-Performance Reconfigurable Computing Heterogeneous computing with reconfigurable logic, i.e. FPGAs

Field Programmable Gate Arrays

Heterogeneous Computing with FPGAs

Programming FPGAs

Heterogeneous Computing with FPGAs Annapolis Micro Systems WILDSTAR 2 PRO GiDEL PROCSTAR III

Heterogeneous Computing with FPGAs Convey HC-1

Heterogeneous Computing with GPUs NVIDIA Tesla S1070

Heterogeneous Computing now Mainstream: IBM Roadrunner Los Alamos, fastest computer in the world in 2008 (still in Top 10) 6,480 AMD Opteron (dual core) CPUs 12,960 PowerXCell 8i GPUs Each blade contains 2 Operons and 4 Cells 296 racks First ever petaflop machine (2008) 1.71 petaflops peak (1.7 billion million fp operations per second) 2.35 MW (not including cooling) Lake Murray hydroelectric plant produces ~150 MW (peak) Lake Murray coal plant (McMeekin Station) produces ~300 MW (peak) Catawba Nuclear Station near Rock Hill produces 2258 MW

Conclusion Even though integration levels continue to increase exponentially, processor makers still need to budget their real estate to target specific application targets Use of heterogeneous computing is on the rise for both high performance computing and embedded computing