Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos ECE, Univ.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA Grids, Blocks, and Threads
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
GPU Architecture and Programming
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Introduction to CUDA Programming
Some things are naturally parallel
Programming Massively Parallel Graphics Processors
CUDA Execution Model – III Streams and Events
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos ECE, Univ. of Toronto Summer 2010 Some slides/material from: UIUC course by Wen-Mei Hwu and David Kirk UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem NVIDIA by Simon Green and others as noted on slides

How to Get High Performance Computation –Calculations –Data communication/Storage Tons of Compute Engines Tons of Storage Unlimited Bandwidth Zero/Low Latency

Calculation capabilities How many calculation units can be built? Today’s silicon chips –About 1B transistors –30K transistors for a 52b multiplier ~30K multipliers –260mm^2 area (mid-range) –112microns^2 for FP unit (overestimated) ~2K FP units Frequency ~ 3Ghz common today –TFLOPs possible Disclaimer: back-on-the-envelop calculations – take with a grain of salt Can build lots of calculation units (ALUs) Tons of Compute Engines ?

How about Communication/Storage Need data feed and storage The larger the slower Takes time to get there and back –Multiple cycles even on the same die Tons of Compute Engines Tons of Slow Storage Unlimited Bandwidth Zero/Low Latency  

Is there enough parallelism? Keep this busy? –Needs lots of independent calculations Parallelism/Concurrency Much of what we do is sequential –First do 1, then do 2, then if X do 3 else do 4 Tons of Compute Engines Tons of Storage Unlimited Bandwidth Zero/Low Latency

Today’s High-End General Purpose Processors Localize Communication and Computation Try to automatically extract parallelism time Tons of Slow Storage Faster cache Slower Cache Automatically extract instruction level parallelism Large on-die caches to tolerate off-chip memory latency

Some things are naturally parallel

Sequential Execution Model int a[N]; // N is large for (i =0; i < N; i++) a[i] = a[i] * fade; time Flow of control / Thread One instruction at the time Optimizations possible at the machine level

Data Parallel Execution Model / SIMD int a[N]; // N is large for all elements do in parallel a[index] = a[index] * fade; time This has been tried before: ILLIAC III, UIUC, 1966

Single Program Multiple Data / SPMD int a[N]; // N is large for all elements do in parallel if (a[i] > threshold) a[i]*= fade; time The model used in today’s Graphics Processors

CPU vs. GPU overview CPU: –Handles sequential code well –Can’t take advantage of massively parallel code –Off-chip bandwidth lower –Peak Computation capability lower GPU: –Requires massively parallel computation –Handles some control flow –Higher off-chip bandwidth –Higher peak computation capability

Programmer’s view GPU as a co-processor (2008) CPU Memory GPU GPU Memory 1GB on our systems 3GB/s – 8GB.s 6.4GB/sec – 31.92GB/sec 8B per transfer 141GB/sec

Target Applications int a[N]; // N is large for all elements of a compute a[i] = a[i] * fade Lots of independent computations –CUDA threads need not be independent

Programmer’s View of the GPU GPU: a compute device that: –Is a coprocessor to the CPU or host –Has its own DRAM (device memory) –Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

Why are threads useful? Parallelism Concurrency: –Do multiple things in parallel –Uses more hardware  Gets higher performance Needs more functional units

Why are threads useful #2 – Tolerating stalls Often a thread stalls, e.g., memory access Multiplex the same functional unit Get more performance at a fraction of the cost

GPU vs. CPU Threads GPU threads are extremely lightweight Very little creation overhead In the order of microseconds All done in hardware GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few

Execution Timeline time 1. Copy to GPU mem 2. Launch GPU Kernel GPU / Device 2’. Synchronize with GPU 3. Copy from GPU mem CPU / Host

Programmer’s view First create data on CPU memory CPU Memory GPU GPU Memory

Programmer’s view Then Copy to GPU CPU Memory GPU GPU Memory

Programmer’s view GPU starts computation  runs a kernel CPU can also continue CPU Memory GPU GPU Memory

Programmer’s view CPU and GPU Synchronize CPU Memory GPU GPU Memory

Programmer’s view Copy results back to CPU CPU Memory GPU GPU Memory

Computation partitioning: At the highest level: –Think of computation as a series of loops: for (i = 0; i < big_number; i++) –a[i] = some function for (i = 0; i < big_number; i++) –a[i] = some other function for (i = 0; i < big_number; i++) –a[i] = some other function Kernels

Computation Partitioning -- Kernel CUDA exposes the hardware to the programmer Programmer must manually partition work appropriately Programmers view is hierarchical: –Think of data as an array

Per Kernel Computation Partitioning Computation Grid: 2D Case Threads within a block can communicate/synchronize –Run on the same multiprocessor Threads across blocks can’t communicate –Shouldn’t touch each others data –Behavior undefined Block thread

Thread Coordination Overview Race-free access to data

GBT: Grids of Blocks of Threads Why? Realities of integrated circuits: need to cluster computation and storage to achieve high speeds Programmers view of data and computation partitioning

Block and Thread IDs Threads and blocks have IDs –So each thread can decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data –Convenience not necessity Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) IDs and dimensions are accessible through predefined “variables”, e.g., blockDim.x and threadIdx.x

Execution Model: Ordering Execution order is undefined Do not assume and use: block 0 executes before block 1 Thread 10 executes before thread 20 And any other ordering even if you can observe it –Future implementations may break this ordering –It’s not part of the CUDA definition –Why? More flexible hardware options

Programmer’s view: Memory Model Different memories with different uses and performance –Some managed by the compiler –Some must be managed by the programmer Arrows show whether read and/or write is possible

Execution Model Summary (for your reference) Grid of blocks of threads –1D/2D grid of blocks –1D/2D/3D blocks of threads All blocks are identical: –same structure and # of threads Block execution order is undefined Same block threads: –can synchronize and share data fast (shared memory) Threads from different blocks: –Cannot cooperate –Communication through global memory Threads and Blocks have IDs –Simplifies data indexing –Can be 1D, 2D, or 3D (threads) Blocks do not migrate: execute on the same processor Several blocks may run over the same processor

CUDA Software Architecture cuda…() cu…() e.g., fft()

Reasoning about CUDA call ordering GPU communication via cuda…() calls and kernel invocations –cudaMalloc, cudaMemCpy Asynchronous from the CPU’s perspective –CPU places a request in a “CUDA” queue –requests are handled in-order Streams allow for multiple queues –Order within each queue honored –No order across queues –More on this much later on

My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd >> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); } GPU CPU

CUDA API: Example int a[N]; for (i =0; i < N; i++) a[i] = a[i] + x; 1.Allocate CPU Data Structure 2.Initialize Data on CPU 3.Allocate GPU Data Structure 4.Copy Data from CPU to GPU 5.Define Execution Configuration 6.Run Kernel 7.CPU synchronizes with GPU 8.Copy Data from GPU to CPU 9.De-allocate GPU and CPU memory

1. Allocate CPU Data float *ha; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N);... } No memory allocated on the GPU side Pinned memory allocation results in faster CPU to/from GPU copies But pinned memory cannot be paged-out More on this later cudaMallocHost (…)

2. Initialize CPU Data (dummy) float *ha; int i; for (i = 0; i < N; i++) ha[i] = i;

3. Allocate GPU Data float *da; cudaMalloc ((void **) &da, sizeof (float) * N); Notice: no assignment side –NOT: da = cudaMalloc (…) Assignment is done internally: –That’s why we pass &da Space is allocated in Global Memory on the GPU

GPU Memory Allocation The host manages GPU memory allocation: –cudaMalloc (void **ptr, size_t nbytes) –Must explicitly cast to ( void **) cudaMalloc ((void **) &da, sizeof (float) * N); –cudaFree (void *ptr); cudaFree (da); –cudaMemset (void *ptr, int value, size_t nbytes); cudaMemset (da, 0, N * sizeof (int)); Check the CUDA Reference Manual

4. Copy Initialized CPU data to GPU float *da; float *ha; cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyHostToDevice); // DIRECTION

Host/Device Data Transfers The host initiates all transfers: cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction) Asynchronous from the CPU’s perspective –CPU thread continues In-order processing with other CUDA requests enum cudaMemcpyKind –cudaMemcpyHostToDevice –cudaMemcpyDeviceToHost –cudaMemcpyDeviceToDevice

5. Define Execution Configuration How many blocks and threads/block int threads_block = 64; int blocks = N / threads_block; if (blocks % N != 0) blocks += 1; Alternatively: blocks = (N + threads_block – 1) / threads_block;

6. Launch Kernel & 7. CPU/GPU Synchronization Instructs the GPU to launch blocks x threads_block threads: darradd > (da, 10f, N); cudaThreadSynchronize (); // forces CPU to wait darradd: kernel name >> execution configuration –More on this soon (da, x, N): arguments –256 – 8 byte limit / No variable arguments

CPU/GPU Synchronization CPU does not block on cuda…() calls –Kernel/requests are queued and processed in-order –Control returns to CPU immediately Good if there is other work to be done –e.g., preparing for the next kernel invocation Eventually, CPU must know when GPU is done Then it can safely copy the GPU results cudaThreadSynchronize () –Block CPU until all preceding cuda…() and kernel requests have completed

8. Copy data from GPU to CPU & 9. DeAllocate Memory float *da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da); // display or process results here free (ha);

The GPU Kernel __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = da[i] + x; } BlockIdx: Unique Block ID. –Numerically asceding: 0, 1, … BlockDim: Dimensions of Block = how many threads it has –BlockDim.x, BlockDim.y, BlockDim.z –Unused dimensions default to 0 ThreadIdx: Unique per Block Index –0, 1, … –Per Block

Array Index Calculation Example int i = blockIdx.x * blockDim.x + threadIdx.x; a[0]a[63]a[64]a[127]a[128]a[191]a[192] blockIdx.x = 0blockIdx.x = 1blockIdx.x = 2 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 i = 0i = 63i = 64i = 127i = 128i = 191 i = 192 Assuming blockDim.x = 64

CUDA Function Declarations __global__ defines a kernel function –Must return void –Can only call __device__ functions __device__ and __host__ can be used together –Two difference versions generated Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() devicehost __host__ float HostFunc() host

__device__ Example Add x to a[i] multiple times __device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; } __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10); }

Kernel and Device Function Restrictions __device__ functions cannot have their address taken –e.g., f = &addmany; *f(…); For functions executed on the device: –No recursion darradd (…) { darradd (…) } –No static variable declarations inside the function darradd (…) { static int canthavethis; } –No variable number of arguments e.g., something like printf (…)

My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd >> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); } GPU CPU

How to get high-performance #1 Programmer managed Scratchpad memory –Bring data in from global memory –Reuse –16KB/banked –Accessed in parallel by 16 threads Programmer needs to: –Decide what to bring and when –Decide which thread accesses what and when –Coordination paramount

How to get high-performance #2 Global memory accesses –32 threads access memory together –Can coalesce into a single reference –E.g., a[threadID] works well Control flow –32 threads run together –If they diverge there is a performance penalty Texture cache –When you think there is locality

Are GPUs really that much faster than CPUs 50x – 200x speedups typically reported Recent work found –Not enough effort goes into optimizing code for CPUs But: –The learning curve and expertise needed for CPUs is much larger

ECE Overview -ECE research Profile -Personnel and budget -Partnerships with industry Our areas of expertise -Biomedical Engineering -Communications -Computer Engineering -Electromagnetics -Electronics -Energy Systems -Photonics -Systems Control -Slides from F. Najm (Chair) and T. Sargent (Research Vice Chair)

About our group Computer Architecture –How to build the best possible system –Best: performance, power, cost, etc. Expertise in high-end systems –Micro-architecture –Multi-processor and Multi-core systems Current Research Support: –AMD, IBM, NSERC, Qualcomm (planned) Claims to fame –Memory Dependence Prediction Commercially implemented and licensed –Snoop Filtering: IBM Blue Gene

UofT-DRDC Partnership

Examples of industry research contracts with ECE in the past 8 years AMD Agile Systems Inc Altera ARISE Technologies Asahi Kasei Microsystems Bell Canada Bell Mobility Cellular Bioscrypt Inc Broadcom Corporation Ciclon Semiconductor Cybermation Inc Digital Predictive Systems Inc. DPL Science Eastman Kodak Electro Scientific Industries EMS Technologies Exar Corp FOX-TEK Firan Technology Group Fuji Electric 62 Fujitsu Gennum H2Green Energy Corporation Honeywell ASCa, Inc. Hydro One Networks Inc. IBM Canada Ltd. IBM IMAX Corporation Intel Corporation Jazz Semiconductor KT Micro LG Electronics Maxim MPB Technologies Microsoft Motorola Northrop Grumman NXP Semiconductors ON Semiconductor Ontario Lottery and Gaming Corp Ontario Power Generation Inc. Panasonic Semiconductor Singapore Peraso Technologies Inc. Philips Electronics North America Redline Communications Inc. Research in Motion Ltd. Right Track CAD Robert Bosch Corporation Samsung Thales Co., Ltd Semiconductor Research Corporation Siemens Aktiengesellschaft Sipex Corporation STMicroelectronics Inc. Sun Microsystems of Canada Inc. Telus Mobility Texas Instruments Toronto Hydro-Electric System Toshiba Corporation Xilinx Inc.

63 Eight Research Groups 1.Biomedical Engineering 2.Communications 3.Computer Engineering 4.Electromagnetics 5.Electronics 6.Energy Systems 7.Photonics 8.Systems Control ECE

Computer Engineering Group Human-Computer Interaction –Willy Wong, Steve Mann Multi-sensor information systems –Parham Aarabi Computer Hardware –Jonathan Rose, Steve Brown, Paul Chow, Jason Anderson Computer Architecture –Greg Steffan, Andreas Moshovos, Tarek Abdelrahman, Natalie Enright Jerger Computer Security –Davie Lie, Ashvin Goel

65 Biomedical Engineering Neurosystems –Berj L. Bardakjian, Roman Genov. –Willy Wong, Hans Kunov –Moshe Eizenman Rehabilitation –Milos Popovic, Tom Chau. Medical Imaging –Michael Joy, Adrian Nachman. –Richard Cobbold –Ofer Levi Proteomics –Brendan Frey. –Kevin Truong. Ca 2+

Communications Group Study of the principles, mathematics and algorithms that underpin how information is encoded, exchanged and processed Three Sub-Groups: 1.Networks 2.Signal Processing 3.Information Theory

Sequence Analysis

Image Analysis and Computer Vision Computer vision and graphics Embedded computer vision Pattern recognition and detection

Networks

Quantum Cryptography and Computing

Computer Engineering System Software –Michael Stumm, H-A. Jacobsen, Cristiana Amza, Baochun Li Computer-Aided Design of Circuits –Farid Najm, Andreas Veneris, Jianwen Zhu, Jonathan Rose

Electronics Group UofT-IBM Partnership 72 n 14 active professors; largest electronics group in Canada. n Breadth of research topics: l Electronic device modelling l Semiconductor technology l VLSI CAD and Systems l FPGAs l DSP and Mixed-mode ICs l Biomedical microsystems l High-speed and mm-wave ICs and SoCs n Lab for (on-wafer) SoC and IC testing through 220 GHz

73 Intelligent Sensory Microsystems n Mixed-signal VLSI circuits l Low-power, low-noise signal processing, computing and ADCs n On-chip micro-sensors l Electrical, chemical, optical n Project examples l Brain-chip interfaces l On-chip biochemical sensors l CMOS imagers

74 mm-Wave and 100+GHz systems on chip n Modelling mm-wave and noise performance of active and passive devices past 300 GHz. n GHz multi-gigabit data rate phased-array radios n Single-chip GHz automotive radar n 170 GHz transceiver with on-die antennas

Electromagnetics Group Metamaterials: From microwaves to optics –Super-resolving lenses for imaging and sensing –Small antennas –Multiband RF components –CMOS phase shifters Electromagnetics of High-Speed Circuits –Signal integrity in high-speed digital systems Microwave integrated circuit design, modeling and characterization Computational Electromagnetics –Interaction of Electromagnetic Fields with Living Tissue Antennas –Telecom and Wireless Systems –Reflectarrays –Wave electronics –Integrated antennas –Controlled-beam antennas –Adaptive and diversity antennas

Super-lens capable of resolving details down to  Small and broadband antennas Scanning antennas with CMOS MTM chips METAMATERIALS (MTMs)

Computational Electromagnetics Fast CAD for RF/ optical structures Modeling of Metamaterials Plasmonic Left-Handed Media Leaky-Wave Antennas Microstrip spiral inductor Optical power splitter

78 Energy Systems Group Power Electronics –High power (> 1.2 MW) converters modeling, control, and digital control realization –Micro-Power Grids converters for distributed resources, dc distribution systems, and HVdc systems –Low-Power Electronics Integrated power supplies and power management systems-on-chip for low-power electronics –computers, cell phones, PDA-s, MP3 players, body implants –Harvesting Energy from humans

79 IC for cell phone power supplies U of T Matrix Converter for Micro-Turbine Generator Voltage Control System for Wind Power Generators Energy Systems Research

Photonics Group

Photonics Group: Bio-Photonics

Basic & applied research in control engineering World-leading group in Control theory _______________________________________ ________ Optical Signal-to-Noise Ratio opt. with game theory Erbium-doped fibre amplifier design Analysis and design of digital watermarks for authentication Nonlinear control theory –application to magnetic levitation, micro positioning system distributed control of mobile autonomous robots –Formations, collision avoidance Systems Control Group