Class Presentation Of Advance VLSI Course Presented by : Ali Shahabi Major Refrence is : Architecture and Circuit Techniques for a Reconfigurable Memory Block Presentation Date : 2004/12/30 By : Ken Mai, Ron Ho, Elad Alon, Dean Liu,Younggon Kim, Dinesh Patil, and Mark Horowitz
High design complexity High non-recurring engineering costs Need high-volume, high-profit market Hard to modify or fix Custom ASICs are expensive
Growing interest in reconfigurable solutions – FPGAs – Structured ASICs – Coarse-grain architectures Reconfigurable computing characteristics – Low non-recurring engineering costs – Good performance and efficiency – Reconfigurability overheads Reconfigurable computing
FPGA with hardwired blocks CLBs CLB : Configurab44le Logic Block [1]
Coarse-grain architecture Chip multi-processor Compute, memory, interconnect, control Reconfigure tile and global network [1]
Traditional emphasis on compute side – Memory system important FPGAs – Fine grain with sizable overheads – Use CLBs for extra functionality – Slow compared to cutting-edge SRAMs Coarse-grain architectures – Large grain – Low flexibility Current memory systems
Low overhead, fast, reconfigurable memory system Reconfigure along natural SRAM partition boundaries – Add hardwired blocks for extra functionality Modern SRAM circuit techniques – Pulse-mode self-resetting logic – Replica timing paths Design targets – Cache – FIFO Design goal
Reconfigurable memory system Array of homogeneous memory mats Each memory has a port into the interconnect Mat size chosen based on natural partition boundary Small inter-mat control network [1]
Smart Memories chip [2]
Tile floor plan [2]
Sample configuration: caches Mats configured as tag or data Direct mapped or set-associative caches Use inter-mat control network to pass hit/miss [1]
Mats configured as 2-way set-associative cache [2]
Sample configuration: FIFOs Data FIFOs, instruction store, and scratchpad Completely self-contained FIFOs A single FIFO can be <> 1 mat [1]
Multi-porting Some configurations need >1 access per cycle – Cache tag with snooping – FIFOs with independent read and write ports Multi-porting each cell is expensive – Multiple ports not always needed Run memory system faster than processor – Time multiplex single-port – Memory cycle = 10 fan-out of 4 inverter delays
Virtual multi-porting [1]
Mat latency Total mat latency = 2 cycles – 20 FO4 – SRAM access = 1 cycle – Peripheral logic = 1 cycle Fully pipelined – Accepts one access every cycle
Added features [1]
Meta-data [1]
Mat details 2KB SRAM array – 512 x 36b logical, 128 x 144b physical – 32b main data, 4b meta-data – Scan tunable replica bitline
Meta-data bits Cache: valid, dirty, LRU, cache coherence state FIFO: valid Special operations – Gang – Read modify write meta-datadata 4b 32b
Gang operation Can gang set or clear columns of meta-data bits Single cycle operation mask clearset meta-datadata [1]
Gang operation Can gang set or clear columns of meta-data bits Single cycle operation mask clearset meta-datadata [1]
Meta-data bit cell [1]
Meta-data bit cell [1]
Read modify write mdatadata [1]
Read modify write: read mdatadata [1]
Read modify write: modify mdatadata [1]
Read modify write: write mdatadata [1]
RMW decoder circuits [1]
RMW decoder circuits [1]
RMW decoder circuits: read [1]
RMW decoder circuits: modify [1]
RMW decoder circuits: write [1]
RMW decoder circuits: write [1]
Timing [1]
PLA Reconfigurable NOR-NOR PLA 1st NOR plane = ternary-CAM 2nd NOR plane = SRAM [1]
PLA: 1st NOR plane [1]
PLA: normal delay chain [1]
PLA: early reset-off delay chain [1]
PLA: 2nd NOR plane [1]
Pointer logic [1]
Pointer logic Pointer cells are 2-ported For FIFO configurations we add pointer logic 4 pointer/stride pairs – 11b pointer – 4b stride [1]
Write buffer [1]
Write buffer Pipeline writes for single-cycle cache writes On write, data mat stores incoming data in WB Tag check – Cache miss WB entry is invalidated – Cache hit WB entry is allowed to write Writes into data mat on next write On every write, the WB and mat are both active [3]
Comparator [1]
Comparator Maskable comparator – Can mask out any combination of meta-data bits – Can mask out the main data as a chunk Example use: cache tag compare – Want to check valid state of line (in meta-data) – Want to check tag itself (in main data)
Putting it all together [1]
Testchip 0.18µm 6M TSMC 3mm x 3.3mm die 4 memory blocks Low swing crossbar Test vector storage 1.1GHz (10 FO4) 1.8V, room temp. [1]
Testchip mat details 2KB SRAM array – 512 x 36b logical, 128 x 144b physical – 32b main data, 4b meta-data 16 AND-term PLA – 6 inputs, 4 outputs 4 pointer/stride pairs – 11b pointer – 4b stride
Mat area breakdown (mm2) 32% mat area in peripheral logic [1]
Mat power breakdown (mW) 26% power in peripheral logic [1]
Conclusions Reconfigurable memory block – Multiple memory configurations – Performance on par with modern SRAMs – Modest overheads Future uses – Reconfigurable computing – General purpose computing – Designs with shifting memory requirements
Refrence [1] K.Mai et al., “Architecture and Circuit Techniques for a Reconfigurable Memory Block” ISSCC 2004 [2] K. Mai et al., “Smart Memories: a Modular, Reconfigurable Architecture,” Intl. Symp. on Comp. Arch., pp , [3] J. Hennessy and D. Patterson, “Computer Architecture a Quantitative Approach,” 2nd Ed., 1996.