COE 405 Programmable Logic and Storage Devices

Slides:



Advertisements
Similar presentations
FPGA (Field Programmable Gate Array)
Advertisements

Hao wang and Jyh-Charn (Steve) Liu
EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.
Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
George Mason University ECE 645 – Computer Arithmetic Introduction to FPGA Devices.
ECE 448 Lecture 7 FPGA Devices
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Configurable System-on-Chip: Xilinx EDK
Evolution of implementation technologies
Programmable logic and FPGA
Introduction to Field Programmable Gate Arrays (FPGAs) COE 203 Digital Logic Laboratory Dr. Aiman El-Maleh College of Computer Sciences and Engineering.
February 4, 2002 John Wawrzynek
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
The Xilinx Spartan 3 FPGA EGRE 631 2/2/09. Basic types of FPGA’s One time programmable Reprogrammable (non-volatile) –Retains program when powered down.
ECE 2372 Modern Digital System Design
COE 405 Programmable Logic and Storage Devices Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Dr. Aiman.
System Arch 2008 (Fire Tom Wada) /10/9 Field Programmable Gate Array.
CPLD (Complex Programmable Logic Device)
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
Programmable Logic Devices
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Field Programmable Gate Arrays (FPGAs) An Enabling Technology.
Basic Sequential Components CT101 – Computing Systems Organization.
ECE 448 Lecture 6 FPGA devices
EE3A1 Computer Hardware and Digital Design
Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.
M.Mohajjel. Why? TTM (Time-to-market) Prototyping Reconfigurable and Custom Computing 2Digital System Design.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
George Mason University ECE 448 – FPGA and ASIC Design with VHDL FPGA Devices ECE 448 Lecture 5.
Delivered by.. Love Jain p08ec907. Design Styles  Full-custom  Cell-based  Gate array  Programmable logic Field programmable gate array (FPGA)
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Introduction to the FPGA and Labs
This chapter in the book includes: Objectives Study Guide
Issues in FPGA Technologies
ETE Digital Electronics
Programmable Logic Devices
Sequential Programmable Devices
Sequential Logic Design
Memories.
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
EMT 351/4 DIGITAL IC DESIGN Week # Synthesis of Sequential Logic 10.
Instructor: Dr. Phillip Jones
This chapter in the book includes: Objectives Study Guide
Lecture 15: Synthesis of Memories in FPGA
Programmable Logic Memories
Instructor: Alexander Stoytchev
Interfacing Memory Interfacing.
Combinatorial Logic Design Practices
Field Programmable Gate Array
Field Programmable Gate Array
Field Programmable Gate Array
We will be studying the architecture of XC3000.
The Xilinx Virtex Series FPGA
XC4000E Series Xilinx XC4000 Series Architecture 8/98
FIGURE 7.1 Conventional and array logic diagrams for OR gate
Introduction to Verilog
SYNTHESIS OF SEQUENTIAL LOGIC
ECE 448 Lecture 7 FPGA Devices
ECE 448 Lecture 5 FPGA Devices
Basic Adders and Counters Implementation of Adders
The Xilinx Virtex Series FPGA
Introduction to VLSI Design Logic Arrays
"Computer Design" by Sunggu Lee
FPGA’s 9/22/08.
Programmable logic and FPGA
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

COE 405 Programmable Logic and Storage Devices Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals

Outline History of Computational Fabrics ASIC vs. FPGA Reconfigurable Logic Anti-Fuse-Based Approach (Actel) RAM Based Field Programmable Logic (Xilinx) CLBs Carry & Control Logic FPGA Memory Implementation

History of Computational Fabrics Discrete devices: relays, transistors (1940s-50s) Discrete logic gates (1950s-60s) Integrated circuits (1960s-70s) e.g. TTL packages: Data Book for 100’s of different parts Gate Arrays (IBM 1970s) Transistors are pre-placed on the chip & Place and Route software puts the chip together automatically – only program the interconnect (mask programming) Software Based Schemes (1970’s- present) Run instructions on a general purpose core

History of Computational Fabrics ASIC Design (1980’s to present) Turn Verilog directly into layout using a library of standard cells Effective for high-volume and efficient use of silicon area Programmable Logic (1980’s to present) A chip that is reprogrammed after it has been fabricated Examples: PALs, PLAs, EPROM, EEPROM, PLDs, FPGAs Excellent support for mapping from Verilog

What is an FPGA? A filed programmable gate array (FPGA) is a reprogrammable silicon chip. Using prebuilt logic blocks and programmable routing resources, you can configure these chips to implement custom hardware functionality without ever having to pick up a breadboard or soldering iron. You develop digital computing tasks in software and compile them down to a configuration file or bitstream that contains information on how the components should be wired together.

ASIC vs. FPGA FPGA ASIC Field Programmable Application Specific Gate Array ASIC Application Specific Integrated Circuit designs must be sent for expensive and time consuming fabrication in semiconductor foundry bought off the shelf and reconfigured by designers themselves no physical layout design; design ends with a bitstream used to configure a device designed all the way from behavioral description to physical layout

ASIC vs. FPGA ASICs FPGAs Off-the-shelf High performance Low development cost Low power Short time to market Low cost in high volumes Reconfigurability

Other FPGA Advantages Manufacturing cycle for ASIC is very costly, lengthy and engages lots of manpower Mistakes not detected at design time have large impact on development time and cost FPGAs are perfect for rapid prototyping of digital circuits Easy upgrades like in case of software FPGA provide a flexible platform for implementing digital computing A rich set of macros and I/Os supported (multipliers, block RAMS, ROMS, high-speed I/O) A wide range of applications from prototyping (to validate a design before ASIC mapping) to high performance spatial computing

How are FPGAs Used? Prototyping Reconfigurable hardware Ensemble of gate arrays used to emulate a circuit to be manufactured Get more/better/faster debugging done than with simulation Reconfigurable hardware One hardware block used to implement more than one function Special-purpose computation engines Hardware dedicated to solving one problem (or class of problems) Accelerators attached to general-purpose computers (e.g., in a cell phone!)

Major FPGA Vendors SRAM-based FPGAs Xilinx, Inc. Altera Corp. Atmel Lattice Semiconductor Flash & antifuse FPGAs Actel Corp. Quick Logic Corp. Share over 60% of the market

Reconfigurable Logic

Anti-Fuse-Based Approach (Actel)

Actel Logic Module Example Gate Mapping Combinational Block S-R Latch

Actel Routing & Programming

RAM Based Field Programmable Logic - Xilinx

Xilinx FPGA Families Old families High-performance families XC3000, XC4000, XC5200 Old 0.5µm, 0.35µm and 0.25µm technology. Not recommended for modern designs. High-performance families Virtex (0.22µm) Virtex-E, Virtex-EM (0.18µm) Virtex-II, Virtex-II PRO (0.13µm) Virtex-4 (0.09µm) Low Cost Family Spartan/XL – derived from XC4000 Spartan-II – derived from Virtex Spartan-IIE – derived from Virtex-E Spartan-3

FPGA Nomenclature

Device Part Marking

The Xilinx 4000 CLB

Two 4-input Functions, Registered Output and a Two Input Function

5-input Function, Combinational Output

5-Input Functions implemented using two LUTs

LUT Mapping N-LUT direct implementation of a truth table: any function of n-inputs. N-LUT requires 2N storage elements (latches) N-inputs select one latch location (like a memory)

Configuring the CLB as a RAM

Xilinx 4000 Interconnect

Xilinx 4000 Interconnect Details

Xilinx 4000 Flexible IOB

Basic I/O Block Structure

IOB Functionality IOB provides interface between the package pins and CLBs Each IOB can work as uni- or bi-directional I/O Outputs can be forced into High Impedance Inputs and outputs can be registered advised for high-performance I/O Inputs can be delayed

Additional Features in Modern FPGAs

Spartan-3 Xilinx FPGA Block Diagram

CLB Structure

CLB Slice Structure Each slice contains two sets of the following: Four-input LUT Any 4-input logic function, or 16-bit x 1 sync RAM or 16-bit shift register Carry & Control Fast arithmetic logic Multiplier logic Multiplexer logic Storage element Latch or flip-flop Set and reset True or inverted inputs Sync. or async. control

Xilinx Multipurpose LUT (MLUT) 16 x 1 ROM (logic)

5-Input Functions implemented using two LUTs One CLB Slice can implement any function of 5 inputs Logic function is partitioned between two LUTs F5 multiplexer selects LUT

Distributed RAM CLB LUT configurable as Distributed RAM A LUT equals 16x1 RAM Cascade LUTs to increase RAM size Synchronous write Synchronous/Asynchronous read Accompanying flip-flops used for synchronous read Two LUTs can make 32 x 1 single-port RAM 16 x 2 single-port RAM 16 x 1 dual-port RAM

Shift Register Each LUT can be configured as shift register Serial in, serial out Dynamically addressable delay up to 16 cycles For programmable pipeline Cascade for greater cycle delays Use CLB flip-flops to add depth

Shift Register Register-rich FPGA Allows for addition of pipeline stages to increase throughput Data paths must be balanced to keep desired functionality

Carry & Control Logic

Fast Carry Logic Each CLB contains separate logic and routing for the fast generation of sum & carry signals Increases efficiency and performance of adders, subtractors, accumulators, comparators, and counters Carry logic is independent of normal logic and routing resources All major synthesis tools can infer carry logic for arithmetic functions

The Virtex II CLB (Half Slice Shown)

Adder Implementation

Carry Chain

New 18 x 18 Embedded Multiplier Embedded 18-bit x 18-bit multiplier 2’s complement signed operation Multipliers are organized in columns Fast arithmetic functions Optimized to implement multiply / accumulate modules

Design Flow - Mapping Technology Mapping: Schematic/HDL to Physical Logic units Compile functions into basic LUT-based groups (function of target architecture)

Design Flow – Placement & Route Placement – assign logic location on a particular device Routing – iterative process to connect CLB inputs/outputs and IOBs. Optimizes critical path delay – can take hours or days for large, dense designs Challenge! Cannot use full chip for reasonable speeds (wires are not ideal). Typically no more than 50% utilization.

Example: Verilog to FPGA

Memory Types

Single-Port and Dual-Port Distributed RAM

LUT-Based RAMS

LUT-Based RAMS

LUT-Based RAM Modules

LUT-Based RAM Modules

Instantiating LUT-Based RAM Modules module IMemory(output O, input A3, A2, A1, A0, D, WE, WCLK ); defparam //RAM initialization ("0" by default) for functional simulation: U_RAM16X1S.INIT = 16'hC2F5; // Add0=1; Add1=0; Add2=1; Add3=0 //Distributed RAM Instantiation RAM16X1S U_RAM16X1S ( .D(D), // insert input signal .WE(WE), // insert Write Enable signal .WCLK(WCLK), // insert Write Clock signal .A0(A0), // insert Address 0 signal .A1(A1), // insert Address 1 signal .A2(A2), // insert Address 2 signal .A3(A3), // insert Address 3 signal .O(O) // insert output signal ); endmodule

Instantiating LUT-Based RAM Modules defparam U_RAM16X1D.INIT = 16'h0000; //Distributed SelectRAM Instantiation RAM16X1D U_RAM16X1D ( .D(), // insert input signal .WE(), // insert Write Enable signal .WCLK(), // insert Write Clock signal .A0(), // insert Address 0 signal port SPO .A1(), // insert Address 1 signal port SPO .A2(), // insert Address 2 signal port SPO .A3(), // insert Address 3 signal port SPO .DPRA0(), // insert Address 0 signal port DPO .DPRA1(), // insert Address 1 signal port DPO .DPRA2(), // insert Address 2 signal port DPO .DPRA3(), // insert Address 3 signal port DPO .SPO(), // insert output signal .DPO() // insert output signal );

Example of Inferred Memory module Memory_Unit #(parameter word_size=8, address_size=4, memory_size=16)( output [word_size-1:0] data_out, input [word_size-1:0] data_in, input [address_size-1:0] address, input clk, write); reg [word_size-1:0] memory[memory_size-1:0]; assign data_out = memory[address]; always @ (posedge clk) if (write) memory[address] <= data_in; endmodule Synthesizing Unit <Memory_Unit>. Related source file is "Memory.v". Found 16x8-bit single-port RAM <Mram_memory> for signal <memory>. Summary: inferred 1 RAM(s). HDL Synthesis Report Macro Statistics # RAMs : 1 16x8-bit single-port distributed RAM : 1

Example of Inferred Memory module Memory_Unit #(parameter word_size=8, address_size=2 , memory_size=4)( output [word_size-1:0] data_out, input [word_size-1:0] data_in, input [address_size-1:0] address, input clk, write); reg [word_size-1:0] memory[memory_size-1:0]; initial begin memory[0]=1; memory[1]=2; memory[2]=3; memory[3]=4; end assign data_out = memory[address]; always @ (posedge clk) if (write) memory[address] <= data_in; endmodule

Block RAM Most efficient memory implementation Dedicated blocks of memory Ideal for most memory requirements 4 to 104 memory blocks 18 kbits = 18,432 bits per block (16 k without parity bits) Use multiple blocks for larger memories Builds both single and true dual-port RAMs Synchronous write and read (different from distributed RAM)

Block RAM Support of two independent 9 Kb blocks, or a single 18 Kb block RAM. Simple or true dual-port mode can be used. Simple dual-port mode is defined as having one read-only port and one write-only port with independent clocks. 18 or 36-bit wide ports can have an individual write enable per byte. This feature is popular for interfacing to an on-chip microprocessor. All inputs are registered with the port clock and have a setup-to-clock timing specification.

Block RAM A write operation requires one clock edge. A read operation requires one clock edge. All output ports are latched. The state of the output port does not change until the port executes another read or write operation. The default block RAM output is latch mode. The output data path has an optional internal pipeline register. Using the register mode is strongly recommended. This allows a higher clock rate, however, it adds a clock cycle latency of one.

Block RAM

Block RAM Logic Diagram

Block RAM Data Combinations and ADDR Locations

Block RAM Port Aspect Ratios

Dual-Port Bus Flexibility Each port can be configured with a different data bus width Provides easy data width conversion without any additional logic

Simple Dual-Port Mode Allowed Combinations for 9 Kb Block RAM

True Dual-Port Mode Allowed Combinations for 9 Kb Block RAM

18 Kb Block RAM—True Dual-Port Operation

Read & Write Operations Read Operation In latch mode, the read operation uses one clock edge. The read address is registered on the read port, and the stored data is loaded into the output latches after the RAM access time. When using the output register, the read operation will take one extra latency cycle to arrive at the output. Write Operation A write operation is a single clock-edge operation. The write address is registered on the write port, and the data input is stored in memory.

Write Modes Three settings of the write mode determine the behavior of the data available on the output latches after a write clock edge: WRITE_FIRST, READ_FIRST, and NO_CHANGE. The Write mode attribute can be individually selected for each port. The default mode is WRITE_FIRST. WRITE_FIRST outputs the newly written data onto the output bus. READ_FIRST outputs the previously stored data while new data is being written. NO_CHANGE maintains the output previously generated by a read operation.

WRITE_FIRST or Transparent Mode (Default) In WRITE_FIRST mode, the input data is simultaneously written into memory and stored in the data output (transparent write).

READ_FIRST or Read-Before-Write Mode In READ_FIRST mode, data previously stored at the write address appears on the output latches, while the input data is being stored in memory (read before write).

NO_CHANGE Mode In NO_CHANGE mode, the output latches remain unchanged during a write operation.

Conflict Avoidance Block RAM memory is a true dual-port RAM where both ports can access any memory location at any time. When accessing the same memory location from both ports, the user must, however, observe certain restrictions. There are no timing restrictions when both ports perform a read operation. When one port performs a write operation, the other port must not read or write the exact same memory location.

Spartan-3 Block RAM Amounts

Spartan-3 FPGA Family Members

Virtex-II 1.5V Architecture

Virtex-II 1.5V Device CLB Array Slices Maximum I/O BlockRAM (18kb) Multiplier Blocks Distributed RAM bits XC2V40 8x8 256 88 4 8,192 XC2V80 16x8 512 120 8 16,384 XC2V250 24x16 1,536 200 24 49,152 XC2V500 32x24 3,072 264 32 98,304 XC2V1000 40x32 5,120 432 40 163,840 XC2V1500 48x40 7,680 528 48 245,760 XC2V2000 56x48 10,752 624 56 344,064 XC2V3000 64x56 14,336 720 96 458,752 XC2V4000 80x72 23,040 912 737,280 XC2V6000 96x88 33,792 1,104 144 1,081,344 XC2V8000 112x104 46,592 1,108 168 1,490,944

Using Core Generator

Single Port BRAM

Single Port BRAM

Single Port BRAM

Single Port BRAM

Dual Port BRAM

Dual Port BRAM

Dual Port BRAM

Distributed RAM

Distributed RAM

Distributed RAM

Memory Initialization ****************************************************************** ********* Example of Dual Port Block Memory .COE file ********** ****************************************************************** ; Sample memory initialization file for Dual Port Block ;Memory, This .COE file specifies the contents for a block ;memory of depth=16, and width=4. In this case, values ; are specified in binary format. memory_initialization_radix=2; memory_initialization_vector= 1111, 1111, 1111, 1111, 1111, 0000, 0101, 0011, 0000, 1111, 1111, 1111, 1111, 1111, 1111, 1111;

Memory Initialization ****************************************************************** ******** Example of Single Port Block Memory .COE file ********* ****************************************************************** ; Sample memory initialization file for Single Port Block ; Memory, ;v3.0 or later. This .COE file specifies ; initialization values for a block memory of depth=16, and ; width=8. In this case, values are specified in hexadecimal ; format. memory_initialization_radix=16; memory_initialization_vector= ff, ab, f0, 11, 11, 00, 01, aa, bb, cc, dd, ef, ee, ff, 00, ff;

Simulating Memory Generated by CoreGen `timescale 1ns / 1ps module BramTest; reg clka; reg wea; reg [3:0] addra; reg [7:0] dina; wire [7:0] douta; Bram uut ( .clka(clka), .wea(wea), .addra(addra), .dina(dina), .douta(douta) ); initial begin clka = 0; forever #50 clka = ~ clka; end

Simulating Memory Generated by CoreGen initial begin // Initialize Inputs wea = 0; addra = 0; dina = 0; #100 addra=1; #100 addra=2; #100 addra=3; #100 addra=5; #100 addra=6; end endmodule