Roman Kofman & Sergey Kleyman Neta Peled & Hillel Mendelson Supervisor: Mike Sumszyk Final Presentation of part A (Annual project)

Slides:

Advertisements

Similar presentations

Synchronous Static Random Access Memory (SSRAM). Internal Structure of a SSRAM AREG: Address Register CREG: Control Register INREG: Input Register OUTREG:

Advertisements

CSCI 4717/5717 Computer Architecture

Lecture 12 Reduce Miss Penalty and Hit Time

ARM-DSP Multicore Considerations CT Scan Example.

Internal Logic Analyzer Final presentation-part B

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

Reconfigurable Computing - Clocks John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia.

1 Asynchronous Bit-stream Compression (ABC) IEEE 2006 ABC Asynchronous Bit-stream Compression Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion.

Mid semester Presentation Data Packages Generator & Flow Management Data Packages Generator & Flow Management Data Packages Generator & Flow Management.

Computer Organization and Architecture The CPU Structure.

1 Project supervised by: Dr Michael Gandelsman Project performed by: Roman Paleria, Avi Yona 12/5/2003 Multi-channel Data Acquisition System Mid-Term Presentation.

Firmware implementation of Integer Array Sorter Characterization presentation Dec, 2010 Elad Barzilay Uri Natanzon Supervisor: Moshe Porian.

Programmable logic and FPGA

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Elad Hadar Omer Norkin Supervisor: Mike Sumszyk Winter 2010/11 Date: Technion – Israel Institute of Technology Faculty of Electrical Engineering High Speed.

Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.

1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk Semesterial project SPRING.

Sub- Nyquist Sampling System Hardware Implementation System Architecture Group – Shai & Yaron Data Transfer, System Integration and Debug Environment Part.

Final presentation – part B Olga Liberman and Yoav Shvartz Advisor: Moshe Porian April 2013 S YMBOL G ENERATOR 2 semester project.

By: Oleg Schtofenmaher Maxim Fudim Supervisor: Walter Isaschar Characterization presentation for project Winter 2007 ( Part A)

Elad Hadar Omer Norkin Supervisor: Mike Sumszyk Winter 2010/11, Single semester project. Date:22/4/12 Technion – Israel Institute of Technology Faculty.

Survey of Existing Memory Devices Renee Gayle M. Chua.

Firmware based Array Sorter and Matlab testing suite Final Presentation August 2011 Elad Barzilay & Uri Natanzon Supervisor: Moshe Porian.

6 Memory Management and Processor Management Management of Resources Measure of Effectiveness – On most modern computers, the operating system serves.

© 2010 Altera Corporation—Public Easily Build Designs Using Altera’s Video and Image Processing Framework 2010 Technology Roadshow.

Neta Peled & Hillel Mendelson Supervisor: Mike Sumszyk Final Presentation of part B Annual project.

PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.

Power-Aware RAM Processing for FPGAs December 9, 2005 Power-aware RAM Processing for FPGA Embedded Memory Blocks Russell Tessier University of Massachusetts.

Performed by: Yaron Recher & Shai Maylat Supervisor: Mr. Rolf Hilgendorf המעבדה למערכות ספרתיות מהירות הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל.

Supervisor: INA RIVKIN Students: Video manipulation algorithm on ZYNQ Part B.

Chapter 4 Memory Management Virtual Memory.

LZRW3 Data Compression Core Dual semester project April 2013 Project part A final presentation Shahar Zuta Netanel Yamin Advisor: Moshe porian.

Cisco 3 - Switching Perrine. J Page 16/4/2016 Chapter 4 Switches The performance of shared-medium Ethernet is affected by several factors: data frame broadcast.

Design of a Novel Bridge to Interface High Speed Image Sensors In Embedded Systems Tareq Hasan Khan ID: ECE, U of S Term Project (EE 800)

By: Daniel BarskyNatalie Pistunovich Supervisors: Rolf HilgendorfInna Rivkin 10/06/2010.

High Speed Digital Systems Lab. Agenda  High Level Architecture.  Part A.  DSP Overview. Matrix Inverse. SCD  Verification Methods. Verification Methods.

1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.

Project Final Semester A Presentation Implementing a compressor in software and decompression in hardware Presents by - Schreiber Beeri Yavich Alon Guided.

FPGA firmware of DC5 FEE. Outline List of issue Data loss issue Command error issue (DCM to FEM) Command lost issue (PC with USB connection to GANDALF)

ALU (Continued) Computer Architecture (Fall 2006).

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Part A Final Dor Obstbaum Kami Elbaz Advisor: Moshe Porian August 2012 FPGA S ETTING U SING F LASH.

By: Daniel Barsky, Natalie Pistunovich Supervisors: Rolf Hilgendorf, Ina Rivkin Characterization Sub Nyquist Implementation Optimization 11/04/2010.

Menu Navigation Presented by: Tzahi Ezra Advisors: Moshe Porian Netanel Yamin One semester project Presented on: Project initiation: NOV 2014.

Menu Navigation Presented by: Tzahi Ezra Advisors: Moshe Porian Netanel Yamin One semester project Project initiation: NOV 2014 PROJECT’S MID PRESENTATION.

Mid presentation Part A Project Netanel Yamin & by: Shahar Zuta Moshe porian Advisor: Dual semester project November 2012.

Neta Peled & Hillel Mendelson Supervisor: Mike Sumszyk Annual project אביב תשס " ט.

COMP541 Memories II: DRAMs

Encryption / Decryption on FPGA Final Presentation Written by: Daniel Farcovich ID Saar Vigodskey ID Advisor: Mony Orbach Summer.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

LKr readout and trigger R. Fantechi 3/2/2010. The CARE structure.

Sub- Nyquist Sampling System Hardware Implementation System Architecture Group – Shai & Yaron Data Transfer, System Integration and Debug Environment Part.

FPGA BASED REAL TIME VIDEO PROCESSING Characterization presentation Presented by: Roman Kofman Sergey Kleyman Supervisor: Mike Sumszyk.

“With 1 MB RAM, we had a memory capacity which will NEVER be fully utilized” - Bill Gates.

Status and Plans for Xilinx Development

Buffering Techniques Greg Stitt ECE Department University of Florida.

COMP541 Memories II: DRAMs

Presenter: Darshika G. Perera Assistant Professor

Memory Hierarchy Ideal memory is fast, large, and inexpensive

The Memory System (Chapter 5)

William Stallings Computer Organization and Architecture 8th Edition

96-channel, 10-bit, 20 MSPS ADC board with Gb Ethernet optical output

COMP541 Memories II: DRAMs

Cache Memory Presentation I

Clock Domain Crossing Keon Amini.

Graphics Hardware: Specialty Memories, Simple Framebuffers

Wavelet “Block-Processing” for Reduced Memory Transfers

Main Memory Background

Preliminary design of the behavior level model of the chip

Presentation transcript:

Roman Kofman & Sergey Kleyman Neta Peled & Hillel Mendelson Supervisor: Mike Sumszyk Final Presentation of part A (Annual project)

 Project Recap  Data Flow  Blocks implementation  Conclusions  Project B - Time Table

The algorithm: Nonlinear Diffusion The algorithm: Nonlinear Diffusion use numeric solution with iterations to solve the diffusion equation use numeric solution with iterations to solve the diffusion equation Why use it for image processing? Why use it for image processing? Image noise is smoothed Image noise is smoothed Edges remain sharp Edges remain sharp

Original image

dt = 30 !!! one iteration dt = 30 !!! one iteration Look at the edges (sharp!) Look at the hat (smoothed)

Difficulties with the semi-implicit model: Difficulties with the semi-implicit model:  Very complex design (Thomas), makes real time almost impossible  Transpose entire image  Reverse order loop  multiple memory accesses So why use this model ??? So why use this model ???  Strong effect - good results after very few iterations

DVI IN DVI IN DVI OUT DVI OUT Lines PIPE Thomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ Columns PIPE Thomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ T’ How to implement T’ In real time???

Transpose DVI IN DVI IN DDRII T’ WRITE DDRII T’ READ PIPEThomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ PIPE Thomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ DVI OUT DVI OUT columns rows M-RAM WRITE M-RAM READ Transpose DDRII T’ WRITE DDRII T’ READ M-RAM WRITE M-RAM READ Freq controller: 4F to F DDRII T’ WRITE DDRII T’ READ Freq controller: F to 4F DDRII T’ WRITE DDRII T’ READ Double buffers External memory Balanced channels Reduced frequency

AGENDA  Internal memory blocks:  Addressing controller  Transpose  Line reverse  External memory:  Double buffer on DDR  Up/down rate controller  DVI synchronization

Addressing controller Addressing method - First attempt:Addressing method - First attempt: Use cache organization approach: Use cache organization approach: Fast - direct access to data in memoryFast - direct access to data in memory Easy to implement - no logic is needed for “translation”Easy to implement - no logic is needed for “translation” However, expensive : 10 bits is more than we need for column representation10 bits is more than we need for column representation 4bits10bits 1bit rowAreacolumn 15 bits

Addressing controller 1 st attempt implementation requires: 98KB1 st attempt implementation requires: 98KB 1 M-RAM block is 64KB1 M-RAM block is 64KB Solution Use consecutive addressing Use consecutive addressing Address = block + row + phase Address = block + row + phase Requires “translation” … but: Requires “translation” … but: Size: 61KB - Fits! Size: 61KB - Fits! Quartus report

Addressing controller Address translation units

AGENDA  Internal memory blocks:  Addressing controller  Transpose  Line reverse  External memory:  Double buffer on DDR  Up/down rate controller  DVI synchronization

Transpose DVI IN DVI IN DDRII T’ WRITE DDRII T’ READ PIPEThomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ PIPE Thomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ DVI OUT DVI OUT columns lines M-RAM WRITE M-RAM READ Transpose DDRII T’ WRITE DDRII T’ READ M-RAM WRITE M-RAM READ Freq controller: 4F to F DDRII T’ WRITE DDRII T’ READ Freq controller: F to 4F DDRII T’ WRITE DDRII T’ READ

TransposeGoal: write the transposed data, so it can later be read sequentially, in rowswrite the transposed data, so it can later be read sequentially, in rowsProblem: Random access in DDR is too expensive: 32 clk penalty!Random access in DDR is too expensive: 32 clk penalty!solution: Use internal memory to inverse order:Use internal memory to inverse order: - “pay” most penalty in random accesses to FPGA mem - “pay” most penalty in random accesses to FPGA mem Write to DDR in “windows” :Write to DDR in “windows” : - Enable sequential row write - Penalty only every row skip

Transpose how it works: M-RAM WRITE M-RAM READ DDRII T’ WRITE DDRII T’ READ Penalty every row skip Sequential read from DDR Penalty all the time !

AGENDA  Internal memory blocks:  Addressing controller  Transpose  Line reverse  External memory:  Double buffer on DDR  Up/down rate controller  DVI synchronization

Transpose DVI IN DVI IN DDRII T’ WRITE DDRII T’ READ PIPEThomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ PIPE Thomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ DVI OUT DVI OUT columns lines M-RAM WRITE M-RAM READ Transpose DDRII T’ WRITE DDRII T’ READ M-RAM WRITE M-RAM READ Freq controller: 4F to F DDRII T’ WRITE DDRII T’ READ Freq controller: F to 4F DDRII T’ WRITE DDRII T’ READ

Reverse Line Order Used for Thomas algorithmUsed for Thomas algorithm Implementation Implementation On M4K blocksOn M4K blocks Double sized buffer with alternating pointers for Read/WriteDouble sized buffer with alternating pointers for Read/Write Read Write Swap addresses Read Write

AGENDA  Internal memory blocks:  Addressing controller  Transpose  Line reverse  External memory:  Double buffer on DDR  Up/down rate controller  DVI synchronization

Transpose DVI IN DVI IN DDRII T’ WRITE DDRII T’ READ PIPEThomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ PIPE Thomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ DVI OUT DVI OUT columns lines M-RAM WRITE M-RAM READ Transpose DDRII T’ WRITE DDRII T’ READ M-RAM WRITE M-RAM READ Freq controller: 4F to F DDRII T’ WRITE DDRII T’ READ Freq controller: F to 4F DDRII T’ WRITE DDRII T’ READ

We need very large double buffers, that can be integrated easily with FPGA designs We need very large double buffers, that can be integrated easily with FPGA designs FPGA is resource limited FPGA is resource limited Solution: use external memory for this purpose. Solution: use external memory for this purpose.

Enables efficient usage of the memory on GiDEL PROC board Enables efficient usage of the memory on GiDEL PROC board Up to 16 ports per bank, 2 banks per FPGA Up to 16 ports per bank, 2 banks per FPGA Each port may be forced to access a different memory area and limited to a certain address space Each port may be forced to access a different memory area and limited to a certain address space Straightforward random memory access with random ports – slow and not efficient Straightforward random memory access with random ports – slow and not efficient Segmented working mode option for sequential ports. Enables to perform fast read/write bursts. Segmented working mode option for sequential ports. Enables to perform fast read/write bursts.

 Two ports: sequential read and write. Each accesses a different memory area.  Implement double buffer: by switching the starting address at the end of every burst.

Pipeline Design Multi port coreOurEntity with Controller Control signals Write sequential port Read sequential port Fixed CLK External DVI CLK PROBLEM

Add FIFO to implement data rate matching. Add FIFO to implement data rate matching. Altera provides dual-clock FIFO (DCFIFO) megafunction. Using it before and after each write/read port would solve the problem. Altera provides dual-clock FIFO (DCFIFO) megafunction. Using it before and after each write/read port would solve the problem. Control logic is integrated into the control entity. Control logic is integrated into the control entity. Extra FIFOs = extra FPGA resources Extra FIFOs = extra FPGA resources

Solution Pipeline Design Multi port coreOurEntity with Controller Control signals Write sequential port Read sequential port

DVI clk Multi clk

Reset Prepare for read \ write Read \ write Flush Following DDR protocol including wait states Symmetric read \ write bursts according to FIFOs states Burst length can be adjusted Next slide… Buffer controller Schema

Problem: Data is written to DDR, only when the internal DDR FIFO is full Problem: Data is written to DDR, only when the internal DDR FIFO is full Solution: Flush forces the FIFO to pass data. Not using the Accurate flush length results in image noise! Solution: Flush forces the FIFO to pass data. Not using the Accurate flush length results in image noise! Problem: Flush delay length is not constant and depends on burst length Problem: Flush delay length is not constant and depends on burst length Solution: stretch write bursts until FIFO is almost full. This will lower flush influence. Solution: stretch write bursts until FIFO is almost full. This will lower flush influence.

Reset Prepare for read \ write Read \ writeFlush Fixed controller Schema Internal fifo is almost full

Up to 8 buffers per memory bank Up to 8 buffers per memory bank Must comply with bandwidth restrictions (MultiPort utilization) Must comply with bandwidth restrictions (MultiPort utilization) Integration effort Integration effort

AGENDA  Internal memory blocks:  Addressing controller  Transpose  Line reverse  External memory:  Double buffer on DDR  Up/down rate controller  DVI synchronization

Transpose DVI IN DVI IN DDRII T’ WRITE DDRII T’ READ PIPEThomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ PIPE Thomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ DVI OUT DVI OUT columns lines M-RAM WRITE M-RAM READ Transpose DDRII T’ WRITE DDRII T’ READ M-RAM WRITE M-RAM READ Freq controller: 4F to F DDRII T’ WRITE DDRII T’ READ Freq controller: F to 4F DDRII T’ WRITE DDRII T’ READ

In original design – down rate used internal memory. However, needed FIFO will not fit on FPGA In original design – down rate used internal memory. However, needed FIFO will not fit on FPGA Implementation is based on the DDR buffer with asymmetric read / write Implementation is based on the DDR buffer with asymmetric read / write Extra DDR access Extra DDR access Input output DCFIFOs are asymmetric in size Input output DCFIFOs are asymmetric in size Full data path Full data path Down rate buffer save to DDR only 1 frame out of 4 Down rate buffer save to DDR only 1 frame out of 4 Up rate buffer read from DDR same frame 4 times Up rate buffer read from DDR same frame 4 times

Prepare for write Read Flush reset Prepare for read Write Flush reset Prepare for write Flush Read/write reset Prepare for write Flush Read/write reset Re/Wr Sync controller

AGENDA  Internal memory blocks:  Addressing controller  Transpose  Line reverse  External memory:  Double buffer on DDR  Up/down rate controller  DVI synchronization

DVI in controller Mux Flag frame Flag detector Signal generation DVI rx DVI tx 24 data bit 12 bits hsync vsync date enable clk FPGA Data path with memory access Data path with memory access PLL 24bit to 12bit double rate gen hsync gen vsync gen de clk The signals must Pass through the same long delays as data extra bits written to memory

DVI in controller Mux Flag frame Flag detector Signal generation Send a known flag through the data path Send a known flag through the data path Start generating according to flag arrival Start generating according to flag arrival DVI rx DVI tx 24 data bit 12 bits hsync vsync date enable clk FPGA Data path with memory access Data path with memory access PLL 24bit to 12bit double rate gen hsync gen vsync gen de clk

Freq controller: 4F to F Freq controller: 4F to F Transpose DVI IN DVI IN DDRII T’ WRITE DDRII T’ READ PIPEThomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ PIPE Thomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ DVI OUT DVI OUT DDRII T’ WRITE DDRII T’ READ columns lines 48bit M-RAM WRITE M-RAM READ M-RAM WRITE M-RAM READ Transpose DDRII T’ WRITE DDRII T’ READ M-RAM WRITE M-RAM READ Delay M-RAM WRITE M-RAM READ

Transpose DVI IN DVI IN DDRII T’ WRITE DDRII T’ READ PIPEThomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ PIPE Thomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ DVI OUT DVI OUT columns lines 48bit M-RAM WRITE M-RAM READ M-RAM WRITE M-RAM READ Transpose DDRII T’ WRITE DDRII T’ READ M-RAM WRITE M-RAM READ Delay M-RAM WRITE M-RAM READ Freq controller: 4F to F DDRII T’ WRITE DDRII T’ READ Freq controller: F to 4F DDRII T’ WRITE DDRII T’ READ

Transpose DVI IN DVI IN DDRII T’ WRITE DDRII T’ READ PIPEThomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ PIPE Thomas 3 M4K LINE REVERSE WRITE M4K LINE REVERSE READ M4K LINE REVERSE WRITE M4K LINE REVERSE READ DVI OUT DVI OUT columns lines M-RAM WRITE M-RAM READ Transpose DDRII T’ WRITE DDRII T’ READ M-RAM WRITE M-RAM READ Freq controller: 4F to F DDRII T’ WRITE DDRII T’ READ Freq controller: F to 4F DDRII T’ WRITE DDRII T’ READ

Summery Internal memory blocks: Internal memory blocks: Addressing controller Addressing controller Transpose Transpose Line reverse Line reverse External memory: External memory: Double buffer on DDR Double buffer on DDR Up/down rate controller Up/down rate controller DVI synchronization DVI synchronization

Problem with the board’s RESET Problem with the board’s RESET Problem with loading design Problem with loading design

Plan and implement logic blocks: Plan and implement logic blocks: SQRT, DIV are the main problemSQRT, DIV are the main problem Verify required precisionVerify required precision (based on our conclusions from part A) Integration of frequency controllers and transpose blocks Integration of frequency controllers and transpose blocks Implement one full iteration Implement one full iteration

Divide between 2 problems: Design of logic blocks Design of logic blocks Full DDR blocks integration Full DDR blocks integrationHow? Implement the processing algorithm for a smaller frame - Avoid using external memory Implement the processing algorithm for a smaller frame - Avoid using external memory

DVI IN DVI IN DVI OUT DVI OUT Logic blocks M-RAM WRITE M-RAM READ M-RAM WRITE M-RAM READ Sample smaller frame

Project B goal: create end to end data path - with Image Processing