CSCE 190: Computing in the Modern World Dr. Jason D. Bakos

Slides:



Advertisements
Similar presentations
Lecture 0: Introduction
Advertisements

Chapter 1 An Introduction To Microprocessor And Computer
Instructor: Sazid Zaman Khan Lecturer, Department of Computer Science and Engineering, IIUC.
Field Effect Transistors and their applications. There are Junction FETs (JFET) and Insulated gate FETs (IGFET) There are many types of IGFET. Most common.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
Some Thoughts on Technology and Strategies for Petaflops.
CSCE 612: VLSI System Design Instructor: Jason D. Bakos.
Introduction to CMOS VLSI Design Lecture 0: Introduction
CSCE 190: Computing in the Modern World Dr. Jason D. Bakos
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
CSCE 212 Introduction to Computer Architecture
CSCE 613: Fundamentals of VLSI Chip Design Instructor: Jason D. Bakos.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Integrated Circuit Design and Fabrication Dr. Jason D. Bakos.
Lecture 0: Introduction. CMOS VLSI Design 4th Ed. 0: Introduction2 Introduction  Integrated circuits: many transistors on one chip.  Very Large Scale.
3.1Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
Introduction Integrated circuits: many transistors on one chip.
Lecture 2. Logic Gates Prof. Taeweon Suh Computer Science Education Korea University ECM585 Special Topics in Computer Design.
Z. Feng VLSI Design 1.1 VLSI Design MOSFET Zhuo Feng.
Dept. of Communications and Tokyo Institute of Technology
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.
1 Integrated Circuits Basics Titov Alexander 25 October 2014.
Lecture 2. Logic Gates Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education & Research.
Lecture 0: Introduction. CMOS VLSI Design 4th Ed. 0: Introduction2 Introduction  Integrated circuits: many transistors on one chip.  Very Large Scale.
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
Abstraction And Technology 1 Comp 411 – Fall /28/06 Computer Abstractions and Technology 1. Layer Cakes 2. Computers are translators 3. Switches.
Trends in the Infrastructure of Computing
Introduction to CMOS Transistor and Transistor Fundamental
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
My Coordinates Office EM G.27 contact time:
Microprocessor Design Process
2007/11/20 Paul C.-P. Chao Optoelectronic System and Control Lab., EE, NCTU P1 Copyright 2015 by Paul Chao, NCTU VLSI Lecture 0: Introduction Paul C.–P.
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
COURSE NAME: SEMICONDUCTORS Course Code: PHYS 473.
Introduction to CMOS VLSI Design Lecture 0: Introduction.
1. Introduction. Diseño de Circuitos Digitales para Comunicaciones Introduction Integrated circuits: many transistors on one chip. Very Large Scale Integration.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
William Stallings Computer Organization and Architecture 6th Edition
Computer Operations Part 2.
Lecture 1: CS/ECE 3810 Introduction
Chapter 4 Processor Technology and Architecture
CIT 668: System Architecture
Stateless Combinational Logic and State Circuits
INTRODUCTION TO MICROPROCESSORS
Architecture & Organization 1
INTRODUCTION TO MICROPROCESSORS
Computer Architecture and Organization
INTRODUCTION TO MICROPROCESSORS
CSCE 212 Introduction to Computer Architecture
Elementary Particles (last bit); Start Review for Final
Architecture & Organization 1
BIC 10503: COMPUTER ARCHITECTURE
3.1 Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
FIELD EFFECT TRANSISTOR
Trends in the Infrastructure of Computing
A High Performance SoC: PkunityTM
Computer Evolution and Performance
Multicore and GPU Programming
Lecture 3 (Microprocessor)
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

CSCE 190: Computing in the Modern World Dr. Jason D. Bakos Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos

Elements

Semiconductors + - - + Spacing=543 pm Silicon is a group IV element (4 valence electrons, shells: 2, 8, 18, 32…) Forms covalent bonds with four neighbor atoms (3D cubic crystal lattice) Si is a poor conductor, but conduction characteristics may be altered Add impurities/dopants (replaces silicon atom in lattice): Makes a better conductor Group V element (phosphorus/arsenic) => 5 valence electrons Leaves an electron free => n-type semiconductor (electrons, negative carriers) Group III element (boron) => 3 valence electrons Borrows an electron from neighbor => p-type semiconductor (holes, positive carriers) + - - + + + + + + + - - - - - - P-N junction forward bias reverse bias Spacing=543 pm

MOSFETs Metal-poly-Oxide-Semiconductor structures built onto substrate negative voltage (rel. to body) (GND) positive voltage (Vdd) NMOS/NFET PMOS/PFET + + + - - - - - - + + + current current channel shorter length, faster transistor (dist. for electrons) body/bulk GROUND body/bulk HIGH (S/D to body is reverse-biased) Metal-poly-Oxide-Semiconductor structures built onto substrate Diffusion: Inject dopants into substrate Oxidation: Form layer of SiO2 (glass) Deposition and etching: Add aluminum/copper wires

Layout 3-input NAND

Logic Gates inv NAND2 NAND3 NOR2

Logic Synthesis Behavior: S = A + B Assume A is 2 bits, B is 2 bits, C is 3 bits A B C 00 (0) 000 (0) 01 (1) 001 (1) 10 (2) 010 (2) 11 (3) 011 (3) 100 (4) 101 (5) 110 (6)

Microarchitecture

Synthesized and P&R’ed MIPS Architecture

Feature Size Shrink minimum feature size… Smaller L decreases carrier time and increases current Therefore, W may also be reduced for fixed current Cg, Cs, and Cd are reduced Transistor switches faster (~linear relationship)

Minimum Feature Size Year Processor Speed Transistor Size Transistors 1982 i286 6 - 25 MHz 1.5 mm ~134,000 1986 i386 16 – 40 MHz 1 mm ~270,000 1989 i486 16 - 133 MHz .8 mm ~1 million 1993 Pentium 60 - 300 MHz .6 mm ~3 million 1995 Pentium Pro 150 - 200 MHz .5 mm ~4 million 1997 Pentium II 233 - 450 MHz .35 mm ~5 million 1999 Pentium III 450 – 1400 MHz .25 mm ~10 million 2000 Pentium 4 1.3 – 3.8 GHz .18 mm ~50 million 2005 Pentium D 2 threads/package .09 mm ~200 million 2006 Core 2 2 threads/die .065 mm ~300 million 2008 Core i7 “Nehalem” 8 threads/die .045 mm ~800 million 2011 “Sandy Bridge” 16 threads/die .032 mm ~1.2 billion Year Processor Speed Transistor Size Transistors 2008 NVIDIA Tesla 240 threads/die .065 mm 1.4 billion 2010 NVIDIA Fermi 512 threads/die .040 mm 3 billion

CPU Design FOR i = 1 to 1000 C[i] = A[i] + B[i] CPU … Core 1 Core 2 Control ALU ALU Control ALU ALU ALU ALU ALU ALU time L1 cache L1 cache Copy part of A onto CPU L2 cache L2 cache Copy part of B onto CPU ADD ADD ADD ADD L3 cache Copy part of C into Mem Copy part of A onto CPU Copy part of B onto CPU ADD ADD ADD ADD Copy part of C into Mem …

Co-Processor Design FOR i = 1 to 1000 C[i] = A[i] + B[i] GPU Core 1 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 1 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 2 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 3 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 4 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 5 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 6 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 7 cache ALU ALU ALU ALU ALU ALU ALU ALU Core 8

Heterogeneous Computing General purpose CPU + special-purpose processors in one system System I/O controller QPI PCIe Host Memory CPU On board Memory X58 Co-processor ~25 GB/s ~25 GB/s ~8 GB/s (x16) ????? ~150 GB/s for GeForce 260 host add-in card

Heterogeneous Computing Combine CPUs and coprocs Example: Application requires a week of CPU time Offload computation consumes 99% of execution time initialization 0.5% of run time 49% of code “hot” loop 99% of run time 2% of code Kernel speedup Application Execution time 50 34 5.0 hours 100 3.3 hours 200 67 2.5 hours 500 83 2.0 hours 1000 91 1.8 hours clean up 0.5% of run time 49% of code co-processor

NVIDIA GPU Architecture Hundreds of simple processor cores Core design: Only one instruction from each thread active in the pipeline at a time In-order execution No branch prediction No speculative execution No support for context switches No system support (syscalls, etc.) Small, simple caches Programmer-managed on-chip scratch memory

Programming GPUs HOST (CPU) code: GPU code: dim3 grid,block; grid.x=((VECTOR_SIZE/512) + (VECTOR_SIZE%512?1:0)); block.x=512; cudaMemcpy(a_device,a,VECTOR_SIZE * sizeof(double),cudaMemcpyHostToDevice); cudaMemcpy(b_device,b,VECTOR_SIZE * sizeof(double),cudaMemcpyHostToDevice); vector_add_device<<<grid,block>>>(a_device,b_device,c_device); cudaMemcpy(c_gpu,c_device,VECTOR_SIZE * sizeof(double),cudaMemcpyDeviceToHost); GPU code: __global__ void vector_add_device (double *a,double *b,double *c) { __shared__ double a_s,b_s,c_s; a_s=a[blockIdx.x*blockDim.x+threadIdx.x]; b_s=b[blockIdx.x*blockDim.x+threadIdx.x]; c_s=a_s+b_s; c[blockIdx.x*blockDim.x+threadIdx.x]=c_s; }

IBM Cell/B.E. Architecture 1 PPE, 8 SPEs Programmer must manually manage 256K memory and threads invocation on each SPE Each SPE includes a vector unit like the one on current Intel processors 128 bits wide

High-Performance Reconfigurable Computing Heterogeneous computing with reconfigurable logic, i.e. FPGAs

Field Programmable Gate Arrays

Heterogeneous Computing with FPGAs

Programming FPGAs

Heterogeneous Computing with FPGAs Annapolis Micro Systems WILDSTAR 2 PRO GiDEL PROCSTAR III

Heterogeneous Computing with FPGAs Convey HC-1

Heterogeneous Computing with GPUs NVIDIA Tesla S1070

Heterogeneous Computing now Mainstream: IBM Roadrunner Los Alamos, fastest computer in the world in 2008 (still in Top 10) 6,480 AMD Opteron (dual core) CPUs 12,960 PowerXCell 8i GPUs Each blade contains 2 Operons and 4 Cells 296 racks First ever petaflop machine (2008) 1.71 petaflops peak (1.7 billion million fp operations per second) 2.35 MW (not including cooling) Lake Murray hydroelectric plant produces ~150 MW (peak) Lake Murray coal plant (McMeekin Station) produces ~300 MW (peak) Catawba Nuclear Station near Rock Hill produces 2258 MW

Conclusion Even though integration levels continue to increase exponentially, processor makers still need to budget their real estate to target specific application targets Use of heterogeneous computing is on the rise for both high performance computing and embedded computing