Download presentation
Presentation is loading. Please wait.
Published byShanna Willis Modified over 9 years ago
1
http://www.c 2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision Applications of Multi-resolution Processing Background: Linear Pipeline and Segment Pipeline Linear pipeline: Duplicate processing elements (PEs) for each pyramid level and all PEs work together for all pyramid levels in parallel + Less demand of off-chip memory bandwidth - Poor efficient use of the PE resources - Area and power overhead Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another + Save computational resources - Require very high memory bandwidth Our Approach: Time-Sharing Pipeline Architecture The same PE works for all the pyramid levels in parallel in a time-sharing pipeline manner Each work-cycle, compute -> 1 pixels for G2 (coarsest level), -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth One single PE runs at full speed as segment pipeline As low memory traffic as linear pipeline A combined approach: Time-Sharing Pipeline in Gaussian Pyramid and Laplacian Pyramid Laplacian Pyramid G0 Gaussian Pyramid Single PE Linebuffer pyramid Timing MUX The convolution engine can be replaced with other processing elements for a more complicated multi-resolution pyramid system Time-sharing Pipeline in Optical Flow Estimation (L-K) Three time sharing pipeline work simultaneously: Two for Gaussian pyramids construction (fine to the coarse scale) One for motion estimation (coarse to the fine scale) Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Line Buffer, Sliding Window Registers and Blocklinear Sliding window operations Pixels are streaming into the on-chip line buffer for temporal storage Line buffer size is proportional to the image width, making the line buffer cost for high resolution images huge Inspired by the GPU block-linear texture memory layout Significantly reduce the linebuffer size Linebuffer width is equal to the block width Data refetch at boundary Blocklinear Image Processing Hardware Synthesis in 32nm CMOS Genesis-based chip generator encapsulates all the parameters (e.g., window size, pyramid levels) and allows the automated generation of synthesizable HDL hardware for design space exploration Block diagram of a convolution-based time-sharing pyramid engine (e.g., 3-level Gaussian pyramid engine with a 3x3 convolution window) Hardware chip generator GUI Area Evaluation Design points are running at 500 MHz on 32nm CMOS Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP) TP consumes much less PE area due to the time-sharing of the same PE among different pyramid levels The cost of extra shift registers and controlling logic for time-sharing configurations are negligible compared with the reduction of the PE cost Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP) TP consumes increasingly more area compared to SP as the pyramid levels grow The overhead of TP over SP is fairly small for designs with small windows Memory Bandwidth Evaluation DRAM traffic is an order of magnitude less than SP Energy saving TP Only accesses the source images from the DRAM, and to return the resulting motion vectors back to the DRAM All other intermediate memory traffic is completely eliminated Overall Performance & Energy Evaluation Energy consumption is dominated by DRAM accesses vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing TP is almost 2x faster than SP TP is only slightly slower than LP while eliminating all the logic costs BlockLinear Design Evaluation P(N) = Parallel Degree. B(N) = Number of Blocks. Increase number of blocks reduces linebuffer area, while remains the same throughput This chart demonstrates various design trade-offs Simulation Result Optical flow (velocity) on a benchmark image with a left-to-right movement The proposed TP-based implementation produces the same motion vectors as the SP-based implementation, validating the approach
2
http://www.c 2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision Linear pipeline: Duplicate processing elements (PEs) for each pyramid level and all PEs work together for all pyramid levels in parallel Pro: Less demand of off-chip memory bandwidth Con: o Poor efficient use of the PE resources o Area and power overhead Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another Pro: Save computational resources Con: Require very high memory bandwidth Applications in Multi-resolution Processing Panorama Stitching HDR Detail Enhancement Optical Flow Existing Solutions: Linear Pipeline and Segment Pipeline The same PE works for all the pyramid levels in parallel in a time-sharing pipeline manner Each work-cycle, compute -> 1 pixels for G2 (coarsest level), -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth One single PE runs at full speed as segment pipeline As low memory traffic as linear pipeline G0 Gaussian Pyramid Single PE Linebuffer pyramid Timing MUX The convolution engine can be replaced with other processing elements for a more complicated multi-resolution pyramid system Three time sharing pipeline work simultaneously: Two for Gaussian pyramids construction (fine to the coarse scale) One for motion estimation (coarse to the fine scale) Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Simulation Result of Hierarchical Lucas-Kanade Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP) TP consumes much less PE area due to the time-sharing of the same PE among different pyramid levels The cost of extra shift registers and controlling logic for time-sharing configurations are negligible compared with the reduction of the PE cost Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP) TP consumes increasingly more area compared to SP as the pyramid levels grow The overhead of TP over SP is fairly small for designs with small windows Energy consumption is dominated by DRAM accesses vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing TP is almost 2x faster than SP TP is only slightly slower than LP while eliminating all the logic costs DRAM traffic is an order of magnitude less than SP Energy saving TP Only accesses the source images from the DRAM, and to return the resulting motion vectors back to the DRAM All other intermediate memory traffic is completely eliminated Proposed Solution: Time-sharing Pipeline Application Demonstration Laplacain Pyramid Hierarchical Lucas-Kanade Evaluation Area BandwidthPower
3
http://www.c 2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision Linear pipeline: Replicate processing elements (PEs) for each pyramid level; all PEs work in parallel for all pyramid levels Pro: Less demand of off-chip memory bandwidth Con: o Inefficient use of the PE resources o Area and power overhead Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another Pro: Saves computational resources Con: Requires very high memory bandwidth Panorama Stitching HDR Detail Enhancement Optical Flow The same PE works for all the pyramid levels in parallel as a time-sharing pipeline Each work-cycle, compute -> 1 pixel for G2 (coarsest level) -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth One PE runs at full speed as a segment pipeline As low memory traffic as a linear pipeline G0 Gaussian Pyramid Single PE Linebuffer pyramid Timing MUX The convolution engine can be replaced by other processing elements for a more complicated multi- resolution pyramid system Three time sharing pipelines work simultaneously: o Two for Gaussian pyramids construction (from fine to coarse scale) o One for motion estimation (from coarse to fine scale) Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Simulation Result of Hierarchical Lucas-Kanade Optical Flow Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP) TP consumes much less PE area The cost of extra shift registers and controlling logic is negligible compared to the reduction of the PE cost Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP) TP consumes more area as the pyramid levels grow The area cost is still competitive in small window Energy consumption is dominated by DRAM accesses vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing TP is almost 2x faster than SP TP is only slightly slower than LP while eliminating all the logic costs DRAM traffic is an order of magnitude less than SP Energy saving TP only accesses the source images from the DRAM, and returns the motion vectors back to the DRAM All other intermediate memory traffic is completely eliminated Proposed Solution: Time-sharing Pipeline Application Demonstration Laplacian Pyramid Hierarchical Lucas-Kanade Evaluation Area BandwidthPower Applications in Multi-resolution Processing Existing Solutions: Linear Pipeline and Segment Pipeline
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.