Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.

Slides:



Advertisements
Similar presentations
3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Section A A Step-By-Step Description of the System Generator Flow For a Colour Space Convertor In this section, a colour image stored as.
School of Computing Science Simon Fraser University
Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.
High Dynamic Range Emeka Ezekwe M11 Christopher Thayer M12 Shabnam Aggarwal M13 Charles Fan M14 Manager: Matthew Russo 6/26/
1 An Exploration of the MPEG Algorithm Using Latency Insensitive Design EE249 Presentation (12/04/1999) Trevor Meyerowitz Mentored by: Luca Carloni.
Case Study ARM Platform-based JPEG Codec HW/SW Co-design
Jpeg Encoder Accelerator Advanced Embedded Systems Architecture EE-382N-4 Fall 2009 Anup P. Joshi Chandra Bhushan Prakash Karthick Santhanam Pratap Ramanathan.
MPEG2 FGS Implementation ECE 738 Advanced Digital Image Processing Author: Deshan Yang 05/01/2003.
HW/SW Co-Design of an MPEG-2 Decoder Pradeep Dhananjay Kiran Divakar Leela Kishore Kothamasu Anthony Weerasinghe.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
JPEG Compression in Matlab
Samsung Poland R&D Center © Samsung Electronics Co., LTD S/W Platform Team | Ver.DateDescriptionAuthorReviewer /09/18Initial VersionMarek.
Viterbi Decoder Project Alon weinberg, Dan Elran Supervisors: Emilia Burlak, Elisha Ulmer.
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Introduction to JPEG Alireza Shafaei ( ) Fall 2005.
MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.
Graphics on Key by Eyal Sarfati and Eran Gilat Supervised by Prof. Shmuel Wimer, Amnon Stanislavsky and Mike Sumszyk 1.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
Introduction of Intel Processors
3. ISP Hardware Design & Verification
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Real-Time Operating Systems for Embedded Computing 李姿宜 R ,06,10.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Group No 5 1.Muhammad Talha Islam 2.Karim Akhter 3.Muhammad Arif 4.Muhammad Umer Khalid.
Company LOGO Mid semester presentation Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.
Supervisor: INA RIVKIN Students: Video manipulation algorithm on ZYNQ Part B.
Hardware/Software Codesign Case Study : JPEG Compression.
Computer Architecture CPSC 350
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
Copyright © 2003 Texas Instruments. All rights reserved. DSP C5000 Chapter 18 Image Compression and Hardware Extensions.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Lab 2 Parallel processing using NIOS II processors
80386DX functional Block Diagram PIN Description Register set Flags Physical address space Data types.
Data compression. lossless – looking for unicolor areas or repeating patterns –Run length encoding –Dictionary compressions Lossy – reduction of colors.
Performed by: Dor Kasif, Or Flisher Instructor: Rolf Hilgendorf Jpeg decompression algorithm implementation using HLS PDR presentation Winter Duration:
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
Low Power, High-Throughput AD Converters
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Introduction to JPEG m Akram Ben Ahmed
Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳 宸.
Sunpyo Hong, Hyesoon Kim
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
MPEG CODING PROCESS. Contents  What is MPEG Encoding?  Why MPEG Encoding?  Types of frames in MPEG 1  Layer of MPEG1 Video  MPEG 1 Intra frame Encoding.
System-on-Chip Design Homework Solutions
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Implementing JPEG Encoder for FPGA ECE 734 PROJECT Deepak Agarwal.
Hardware Architecture
Buffering Techniques Greg Stitt ECE Department University of Florida.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Introduction to the FPGA and Labs
Lab 4 HW/SW Compression and Decompression of Captured Image
NFV Compute Acceleration APIs and Evaluation
Adaptive Mobile Applications
Morgan Kaufmann Publishers
Cache Memory Presentation I
Computer Architecture CSCE 350
Short Circuiting Memory Traffic in Handheld Platforms
Reconfigurable Computing
Lossless JPEG transcoding
The JPEG Standard.
Graphics Processing Unit
Presentation transcript:

Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison

2 Outline Introduction on SoC Motivation Verilog implementation of JPEG encoder Integrated SoC simulator Future work

3 System-on-Chip Platform Mobile computing – New driving force  Smartphones, Tablets SoC – Popular solution  Qualcomm’s Snapdragon, Samsung’s Exyons  General-purpose CPU, Graphics processing, Application-specific accelerators, Modem, etc.

4 Resource Management on SoC Schematic of Snapdragon SoC

5 Resource Management on SoC Memory bandwidth is the most critical resource shared on SoC Shared Memory Channel

6 Motivation Heterogeneous system  CPU – Sensitive to memory latency  GPU – High bandwidth demand, real-time deadline  DSP, multimedia processor – Low response latency requirement Key problem  No architectural simulator available for SoC platform  Integrated CPU-GPU simulator: Goal of this project  Design a hardware JPEG encoder using Verilog  Write an architectural model for the hardware encoder  Integrate into a CPU simulator (gem5) as one step to build an architectural simulator for SoC platform

7 JPEG Encoder (Verilog) Implementation Matlab generates input matrix; read by testbench; Input 8x8 blocks of data (24-bit) into the encoder; one pixel per clock cycle;  Operand collector to ensure the full block is ready  To tolerant variable memory access latency RGB to YCbCr conversion DCT on 8x8 blocks Quantization; multiply (2^13/Qij) then right shift DPCM and Huffman Encoding for DC components; RLE and Huffman Encoding for AC components; Bit streams coming from Y, Cb and Cr are combined to form an output stream (temporal multiplexing)

8 JPEG Encoder Result tif format 768KB output jpg format 68KB

9 Synthesis Result & Throughput Synopsys Design Compiler TSMC 45nm general-purpose library, 800MHz ~1.0e7 blocks per sec

10 Simulator Integration Difficult to find a standard  Which hardware components to include?  Low level implementation details: pipelining, circuit design, etc. Use Trimaran instead  A widely-used compilation/architecture infrastructure  General VLIW/Application-specific processor  Configured to model DSP processor JPEG encoder on Trimaran  Software implementation  9.16e7 1GHz – 91.6ms ( verilog design ~0.4ms )

11 Simulator Integration Still separate process; communicate using shared memory structure in Linux OS; Memory Requests on Trimaran side will be feed to CPU simulator (gem5) side; simulate the DRAM timing and respond; gem5 (CPU)Trimaran (DSP) Request queue Memory subsystem (M5) Response queue Request queue Memory subsystem (M5) Shared memory clock tick set reset tick Tick scheduler L2 cache

12 Future Work Figure out how Trimaran simulates timing info Get lock-step execution done Figure out real-world usage scenario Real research – writing papers – graduate

13 THANK YOU!

14 BACKUP SLIDES

15 Some Details RGB – YCbCr  24-bit in; 24-bit out;  Pipelined; 3 cycles; 1 – mult; 2 – sum; 3 – rounding; DCT  8-bit in, pipelined; bit output;  Internal 32-bit;  Output_enable set when input enable unset, so requiring idle cycle between 8x8 blocks Quantization  4 cycles; 1 – latch in; 2 – quantify; 3 – buffer; 4 – rounding; Huffman Encoding  DC calculated first, AC calculated in zigzag order;  Totally 13 cycles inserted between 8x8 blocks

16 Some Details FIFO buffer  Check for 0xFF in the bitstream, add a dummy 0x00;  Append 0xFFD9 at the end Post-processing  MATLAB generates JPEG header and standard Huffman table  Then get the actual JPEG file