2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution.

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.

Parallell Processing Systems1 Chapter 4 Vector Processors.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Computer Vision Optical Flow

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

Order-Independent Texture Synthesis Li-Yi Wei Marc Levoy Gcafe 1/30/2003.

Performance and Energy Bounds for Multimedia Applications on Dual-processor Power-aware SoC Platforms Weng-Fai WONG 黄荣辉 Dept. of Computer Science National.

Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

1 Link Division Multiplexing (LDM) for NoC Links IEEE 2006 LDM Link Division Multiplexing Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion –

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Finish Adaptive Space Carving Anselmo A. Montenegro †, Marcelo Gattass ‡, Paulo Carvalho † and Luiz Velho † †

ENGG 6090 Topic Review1 How to reduce the power dissipation? Switching Activity Switched Capacitance Voltage Scaling.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Extracted directly from:

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

ISCA2000 Norman Margolus MIT/BU SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Chapter One Introduction to Pipelined Processors.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Finish Hardware Accelerated Voxel Coloring Anselmo A. Montenegro †, Luiz Velho †, Paulo Carvalho † and Marcelo Gattass ‡ †

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

ShiDianNao: Shifting Vision Processing Closer to the Sensor

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Sunpyo Hong, Hyesoon Kim

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Dr. J. Shanbehzadeh M.HosseinKord Science and Research Branch of Islamic Azad University Machine Vision 1/49 slides.

Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

† Dept. Computer Science and Engineering The Pennsylvania State University ‡ IMEC, Belgium Estimating Influence of Data Layout Optimizations on SDRAM Energy.

Motion tracking TEAM D, Project 11: Laura Gui - Timisoara Calin Garboni - Timisoara Peter Horvath - Szeged Peter Kovacs - Debrecen.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Backprojection Project Update January 2002

ESE532: System-on-a-Chip Architecture

School of Engineering University of Guelph

ISPASS th April Santa Rosa, California

FPGA Acceleration of Convolutional Neural Networks

Architecture & Organization 1

Exploring Concentration and Channel Slicing in On-chip Network Router

5.2 Eleven Advanced Optimizations of Cache Performance

SoC and FPGA Oriented High-quality Stereo Vision System

Architecture & Organization 1

Introduction to Heterogeneous Parallel Computing

Hossein Omidian, Guy Lemieux

Computer Evolution and Performance

Presentation transcript:

2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision Applications of Multi-resolution Processing Background: Linear Pipeline and Segment Pipeline  Linear pipeline: Duplicate processing elements (PEs) for each pyramid level and all PEs work together for all pyramid levels in parallel + Less demand of off-chip memory bandwidth - Poor efficient use of the PE resources - Area and power overhead  Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another + Save computational resources - Require very high memory bandwidth Our Approach: Time-Sharing Pipeline Architecture  The same PE works for all the pyramid levels in parallel in a time-sharing pipeline manner  Each work-cycle, compute -> 1 pixels for G2 (coarsest level), -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth  One single PE runs at full speed as segment pipeline  As low memory traffic as linear pipeline A combined approach: Time-Sharing Pipeline in Gaussian Pyramid and Laplacian Pyramid Laplacian Pyramid G0 Gaussian Pyramid  Single PE  Linebuffer pyramid  Timing MUX  The convolution engine can be replaced with other processing elements for a more complicated multi-resolution pyramid system Time-sharing Pipeline in Optical Flow Estimation (L-K)  Three time sharing pipeline work simultaneously: Two for Gaussian pyramids construction (fine to the coarse scale) One for motion estimation (coarse to the fine scale)  Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Line Buffer, Sliding Window Registers and Blocklinear  Sliding window operations  Pixels are streaming into the on-chip line buffer for temporal storage  Line buffer size is proportional to the image width, making the line buffer cost for high resolution images huge  Inspired by the GPU block-linear texture memory layout  Significantly reduce the linebuffer size Linebuffer width is equal to the block width Data refetch at boundary Blocklinear Image Processing Hardware Synthesis in 32nm CMOS Genesis-based chip generator encapsulates all the parameters (e.g., window size, pyramid levels) and allows the automated generation of synthesizable HDL hardware for design space exploration Block diagram of a convolution-based time-sharing pyramid engine (e.g., 3-level Gaussian pyramid engine with a 3x3 convolution window) Hardware chip generator GUI Area Evaluation Design points are running at 500 MHz on 32nm CMOS Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP)  TP consumes much less PE area due to the time-sharing of the same PE among different pyramid levels  The cost of extra shift registers and controlling logic for time-sharing configurations are negligible compared with the reduction of the PE cost Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP)  TP consumes increasingly more area compared to SP as the pyramid levels grow  The overhead of TP over SP is fairly small for designs with small windows Memory Bandwidth Evaluation  DRAM traffic is an order of magnitude less than SP  Energy saving  TP Only accesses the source images from the DRAM, and to return the resulting motion vectors back to the DRAM  All other intermediate memory traffic is completely eliminated Overall Performance & Energy Evaluation  Energy consumption is dominated by DRAM accesses  vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost  vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing  TP is almost 2x faster than SP  TP is only slightly slower than LP while eliminating all the logic costs BlockLinear Design Evaluation P(N) = Parallel Degree. B(N) = Number of Blocks.  Increase number of blocks reduces linebuffer area, while remains the same throughput  This chart demonstrates various design trade-offs Simulation Result  Optical flow (velocity) on a benchmark image with a left-to-right movement  The proposed TP-based implementation produces the same motion vectors as the SP-based implementation, validating the approach

2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision  Linear pipeline: Duplicate processing elements (PEs) for each pyramid level and all PEs work together for all pyramid levels in parallel Pro: Less demand of off-chip memory bandwidth Con: o Poor efficient use of the PE resources o Area and power overhead  Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another Pro: Save computational resources Con: Require very high memory bandwidth Applications in Multi-resolution Processing Panorama Stitching HDR Detail Enhancement Optical Flow Existing Solutions: Linear Pipeline and Segment Pipeline  The same PE works for all the pyramid levels in parallel in a time-sharing pipeline manner  Each work-cycle, compute -> 1 pixels for G2 (coarsest level), -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth  One single PE runs at full speed as segment pipeline  As low memory traffic as linear pipeline G0 Gaussian Pyramid  Single PE  Linebuffer pyramid  Timing MUX  The convolution engine can be replaced with other processing elements for a more complicated multi-resolution pyramid system  Three time sharing pipeline work simultaneously: Two for Gaussian pyramids construction (fine to the coarse scale) One for motion estimation (coarse to the fine scale)  Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Simulation Result of Hierarchical Lucas-Kanade Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP)  TP consumes much less PE area due to the time-sharing of the same PE among different pyramid levels  The cost of extra shift registers and controlling logic for time-sharing configurations are negligible compared with the reduction of the PE cost Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP)  TP consumes increasingly more area compared to SP as the pyramid levels grow  The overhead of TP over SP is fairly small for designs with small windows  Energy consumption is dominated by DRAM accesses  vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost  vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing  TP is almost 2x faster than SP  TP is only slightly slower than LP while eliminating all the logic costs  DRAM traffic is an order of magnitude less than SP  Energy saving  TP Only accesses the source images from the DRAM, and to return the resulting motion vectors back to the DRAM  All other intermediate memory traffic is completely eliminated Proposed Solution: Time-sharing Pipeline Application Demonstration Laplacain Pyramid Hierarchical Lucas-Kanade Evaluation Area BandwidthPower

2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision  Linear pipeline: Replicate processing elements (PEs) for each pyramid level; all PEs work in parallel for all pyramid levels Pro: Less demand of off-chip memory bandwidth Con: o Inefficient use of the PE resources o Area and power overhead  Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another Pro: Saves computational resources Con: Requires very high memory bandwidth Panorama Stitching HDR Detail Enhancement Optical Flow  The same PE works for all the pyramid levels in parallel as a time-sharing pipeline  Each work-cycle, compute -> 1 pixel for G2 (coarsest level) -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth  One PE runs at full speed as a segment pipeline  As low memory traffic as a linear pipeline G0 Gaussian Pyramid  Single PE  Linebuffer pyramid  Timing MUX  The convolution engine can be replaced by other processing elements for a more complicated multi- resolution pyramid system  Three time sharing pipelines work simultaneously: o Two for Gaussian pyramids construction (from fine to coarse scale) o One for motion estimation (from coarse to fine scale)  Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Simulation Result of Hierarchical Lucas-Kanade Optical Flow Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP)  TP consumes much less PE area  The cost of extra shift registers and controlling logic is negligible compared to the reduction of the PE cost Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP)  TP consumes more area as the pyramid levels grow  The area cost is still competitive in small window  Energy consumption is dominated by DRAM accesses  vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost  vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing  TP is almost 2x faster than SP  TP is only slightly slower than LP while eliminating all the logic costs  DRAM traffic is an order of magnitude less than SP  Energy saving  TP only accesses the source images from the DRAM, and returns the motion vectors back to the DRAM  All other intermediate memory traffic is completely eliminated Proposed Solution: Time-sharing Pipeline Application Demonstration Laplacian Pyramid Hierarchical Lucas-Kanade Evaluation Area BandwidthPower Applications in Multi-resolution Processing Existing Solutions: Linear Pipeline and Segment Pipeline