Download presentation
Presentation is loading. Please wait.
Published byLindsey Cain Modified over 8 years ago
1
AM chip schedule Alberto
2
Design activities (17/11/2010) Adapt JTAG and bounday scan to MPW chip Design new CAM cells, Buffer logic New logic majority Opcode interface to new majority New logic for kill? PADs layout
3
Schedule/milestones end of March: –a first version of the boundary scan is available –preliminary layout of the majority logic cell available –first version of the CAM-block available starting beginning of April work on a first integration of the chip end of April: after simulation and debugging… –a final version of the boundary scan is available –a complete layout of the majority logic cell –final design for the CAM-block available starting beginning of May work on the final integration of the chip end of May: successful integration of all pieces –during May test and simulation of the integrated design. during June based on simulation decide what adjustments are needed end of June: completed all simulations of the chip – the final and integrated chip should be completely simulated and approved. early July submission
4
Majority interface component majority port ( CLK : in std_logic; INIT : in std_logic; REQ_LAY0 : in std_logic; FORCE_READ : in std_logic; -- force read regardless of n-miss and layer0 DISABLE_READ : in std_logic; MISS0 : in std_logic; MISS1 : in std_logic; MISS2 : in std_logic; -- signals to write the DISABLE_THIS SRAM cell WL : in std_logic; DISABLE_THIS_SET : in std_logic; DISABLE_THIS_RESET : in std_logic; LAYER_MATCH : in std_logic_vector (7 downto 0); READ_FLAG : in std_logic; PATTERN_MATCH : out std_logic; ); end component;
5
Majority interface majority CLK INIT layer_match MISS0 MISS1 MISS2 Require_layer0 Force_read Disable_read Global cfg signals, change just after clk rising edge Pattern_match Read_flag Disable_this_RESET Disable_this_SET WL (write line) Fisher tree layer_match Signals to write disable_this pattern Layer_match is needed to output the map of fired layers.
6
Majority interface majority CLK INIT layer_match MISS0 MISS1 MISS2 Require_layer0 Force_read Disable_read Global cfg signals, change just after clk rising edge Pattern_match Read_flag Disable_this_RESET Disable_this_SET WL (write line) Fisher tree layer_match Signals to write disable_this pattern Layer_match is needed to output the map of fired layers. WL SET RESET
7
Where to put the Flip Flops? bitline clk_N CAM cells 1 layer CAM cells 1 layer Match line Sense Ampl. Layer-match SR latch Majority logic & flag logic Majority logic & flag logic clk Pattern match clk bitline Match enable ML reset ML MLSA Patt. match Bit line Propag. ML Match? Majority Propag.
8
Where to put the Flip Flops? bitline clk_N CAM cells 1 layer CAM cells 1 layer Match line Layer-match SR latch bitline Match enable reset ML Latch Bit line Propag. ML Match? Current source Match enable R S reset
9
Use 16 blocks of 512 patters ? 3.75mm 3.2mm If die is 3.2x3.75mm Usable area excluding PADs 3040um x 3590um Can fit 3 blocks in height 3*979um = 2937um Can almost fit 6 blocks in width 6*600um = 3600um Depends on block size! Can fit 3*5 blocks. Can fit 3*6 if block smaller than 600 um
10
Use 16 blocks of 512 patters ? 3.75mm 3.2mm If die is 3.2x3.75mm Usable area excluding PADs 3040um x 3590um Can fit 3 blocks in height 3*979um = 2937um Can almost fit 6 blocks in width 6*600um = 3600um Depends on block size! Can fit 3*5 blocks. Can fit 3*6 if block smaller than 600 um
11
Block size April 7th, 201 Full custom block 64x4 (half of 64 patterns) –Width 225.4um * Height 122.4um (68 rows) Pattern block of 64 patterns + buffer –Buffer 1 row ? –8 layers width 450.8um no majority –Height with buffer 124.2um Height of 512 patterns = 993.6um Height of 3x512 patterns = 2980.8um 3.2mm – 160um PADS – 2980.8um = 59.2um or 32 rows Max Pattern block width –(Chip 3.75mm – 160um PADS)/6 = 3590um / 6 = 598um
12
AMchip03 document http://agenda.infn.it/materialDisplay.py?con tribId=0&materialId=0&confId=3021
15
What we have now: Standard Cell 180 nm 5000 pattern/chip for 6-layer patterns, 2500 pattern/chip for 12-layer patterns “A VLSI Processor for Fast Track Finding Based on Content Addressable Memories”, IEEE Transactions on Nuclear Science, Volume 53, Issue 4, Part 2, Aug. 2006 Page(s):2428 - 2433 NEXT: NEW VERSION For both L1 & L2 65 nm technology provides a factor 8 → 20000 patterns/chip Full custom cell provides at least a factor 2 → 40000 patterns/chip 8 layers instead of 12 provides a factor 1,5 → 60000 patterns/chip 1,2 x 1,2 cm^2 2D chip → 80000 patterns/chip With a 2 D chip we gain a factor 30! 1 AMboard: 128 chips → ~10 Mpatterns per board 1 Crate: 16 AMboard → ~160 Mpatterns per crate Current prototype under design: 65nm TSMC, 12mm^2 MPW run, 100 MHz running clock 8000 patterns/chip 8 layers each Layer words of 12 bits + 3 ternary bits variable resolution patterns A. Annovi - ACES 2011 @ CERN 15
16
Pattern efficiency 90% # of patterns in Amchips (barrel only, 45 degress) 65M500M Pattern size r- : 24 pixels, 20 SCT strips z: 36 pixels Pattern size (half size) r- : 12 pixels, 10 SCT strips z: 36 pixels = 342k = 40k Want this A. Annovi - ACES 2011 @ CERN 16
17
Variable resolution AM A. Annovi - ACES 2011 @ CERN finer patterns coarser pattern We can use don’t care on the least significant bit when we want to match the pattern layer @ carser resolution or use all the bits to match it @ finer resolution Patterns with 1 kid are stored at finer precision Layers without “don’t care (DC)” can ignore the hits in the “wrong” side of the layer DCDC coarser pattern 17 With 2 “don’t care” bits per layer gain an effective factor of 5 in patterns
18
A. Annovi - ACES 2011 @ CERN 18 Goal: x30 pattern density but lower power consumption 32 patterns of 8 layers ~ 60 m x 500 m ~ 1 or 2 pixels
19
Tasks in Italy New PADs layout (Stabile, Milano) New CAM cells (Matteo, Frascati) –Complete NAND cell –Evaluate advantages of SRAM transistors –Other cells Clean up project scripts (Francesco, Pisa) Transition to 65nm and new tools Place and route (Francesco, Stabile)
20
JTAG work (Germany) JTAG logic to be looked at after Laura left New features: –Add pins all MPW to boundary scan –Extend JPATT_DATA register to include as bit[0] a disable_pattern bit –Extend register for new busses (8 instead of 6)
21
Majority logic (Fermilab) Current draft from Jim is good and almost complete Features to be added: –Account for individual pattern disable –4 thresholds: disable, 0miss, 1miss, 2miss –Include option require_layer0 Layer0 should be the closest to final AND Design the basic majority logic cell
22
OPCODE work (Germany) Transalate OPCODE output for new majority logic. 5 lines out: –Disable_match, 0miss, 1miss, 2miss –require_layer0 Optional if time allows (coding and testing) –change input OPCODE protocol to single word
23
Kill tree (Fermilab) Optional if time allows (coding and testing) –Try a new scheme for kill tree Short description: –Current scheme encodes the highest priority pattern (encoder) –Then decodes it to set one kill FF –1024 (or N patt) kill lines are then distributed to the patterns Alternative: calculate kill along with priority encoder in a tree like fashion
24
Draft schedule Aim for March submission (tight) –All dates below are my guess for discussion today NAND cell November NOR cell end of December NOR don’t care cell mid January (?) Match line amplifier beginning of February –This item is late in the schedule New PADs layout end of November –Important to check that all pads fit in a single MPW block Project clean up and first place and route end of December Preliminary version of majority mid December –Final end of January New JTAG logic end of December: needed for 1st place and route New OPCODE and kill TREE end of January (mid Feb at latest)
25
Draft schedule during February Put everything together Final place and route Prepare a detailed model of each CAM cell for mixed simulation of a few patterns. –Doing this in February is late, we should start it earlier, but currently uncovered From now till the end: comparison of the implemented model with the C++ model for debugging. –Ilaria Sacco (Pisa) –This item is under staffed, we will need help here Are we missing any important item? First goal get Jim and Hans up and running –Please ask for needed information to start up
26
Outline The Pattern Matching and the Associative Memory (AM) Why more dense AM we get better it is Associative memory architecture How chips are put together: Lamb → AMboard → crate The Tree Search Processor & its location
27
The Event... The Pattern Bank TRACKING WITH PATTERN MATCHING
28
Bingo scorecard Dedicated device - maximum parallelism: Each pattern with private comparator Track search during detector readout The Associative Memory – AM = Bingo Full custom 700 nm: 0,1286Lkpat/chip FPGA 350 nm: 0,1286Lkpat/chip standard cell 180 nm:5,0 6Lkpat/chip new for FTK 90 nm: ~60 8Lkpat/chip new for FTK 65 nm:~1208Lkpat/chip 2 Tiers 65 nm 2,5 D : 240 8Lkpat/chip
29
FF word Layer 1Layer 2 Layer 3Layer 4 HIT Cell 0 Cell 1 Cell 2 Cell 3 Output Bus ONE PATTERN HIT
30
Track fitting using full resolution of the detector Data Organizer (DO) Hits Tracks parameters (d, p T, , z) Roads Associative Memory (AM) Hits Roads + hits Track Fitter (TF) Super Strip (SS) Tracking in 2 steps : find Roads first (Pattern Matching with Associative Memory, AM) then find Tracks inside Road (Fit by TF) Full Resolution Hits Large SS: a lot of fakes + combinatorics inside roads Road Hot point @high occupancy
31
What we have now: Standard Cell 180 m 5000 pattern/chip for 6-layer patterns, 2500 pattern/chip for 12-layer patterns “A VLSI Processor for Fast Track Finding Based on Content Addressable Memories”, IEEE Transactions on Nuclear Science, Volume 53, Issue 4, Part 2, Aug. 2006 Page(s):2428 - 2433 NEXT: NEW VERSION For both L1 & L2 90 nm technology provides a factor 4 → 10000 patterns/chip Full custom cell provides at least a factor 2 → 20000 patterns/chip 8 layers instead of 12 provides a factor 1,5 → 30000 patterns/chip 1,5 x 1,5 cm**2 2D chip → 60000 patterns/chip Going to 65 nm → 120000 patterns/chip With a 2 D chip we gain a factor 50! 1 AMboard: 128 chips → ~15 Mpatterns per board 1 Crate: 16 AMboard → ~245 Mpatterns per crate 100 MHz running clock
32
Pattern bank Add encoder kill Bus0[17:0] Bus1[17:0] Bus2[17:0] Bus3[17:0] Bus4[17:0] Bus5[17:0]
33
Power consumption Old Chip: corr. Factor1,8 Watt 180 nm 1,8 V Core New chip 90 nm 1 V Core1/(1,8*1,8)0,56 Watt Frequency 40 MHz New chip 100 MHz100/401,39 Watt Area 1x1 cm**2 New chip 4 cm**24/15,56 Watt New: Pre-match feature1/3 (1/2)1,85 (2,78) Watt Per crate 16 x 128 = 2048 chips3,8 (5,7) kW IF the pre-match feature save at least 1/3, new 2D chip (1,85 W) ~ old chip (1,8 W) ANY OTHER IDEA TO GAIN IN POWER INCREASES THE POTENTIALITY TO GROW IN THE THIRD DIRECTION we would like to be 4 funding agencies involved:
34
Annovi, 27-09-2010 34 Concentrate now on 2013-2015 (17-19 pile-up events) Consider evolution up to 2019 (41,5 pile-up events << simulated 75 ev) → Intermediate chip! 2020 comes much later and will profit of a very advanced technology……. Sim with 75 pile-up events after 2020! 17,6 pile-up ev. @2.6 10 33 19,0 pile-up ev. @ 10 34 LHC Schedule
35
Our Schedule 1.TSMC 65 nm, low power, available as mini@sic (Vcc_core=1,2 V). 2.65 nm mini@sic 22,5 k€/block; 90 nm mini@sic 18,6 k€/block. 3."variable resolution" gives good results → early production of AM04 4.we missed the 90nm 2010 September run 5.We propose to move directly to a 65 nm prototype. 6.This is a preliminary schedule to produce new LAMBs for 2013: (1) Mini@sic submission: spring or october 2011. (2) delivery:~february 2012 (3) tested ~June 2012 (4) MPW submission:from June 2012 (5) Delivery:from November 2012 (6) Tested:from February 2013 (7) MPW Production from February 2013 (8) Delivery from July 2013 (9) mounted on new Lambsfrom autumn 2013
36
Costs 2 blocks Mini@sic: payed by Italy MPW run: TSMC 2010: 12 mm^2 80 kUSD → 6,7 kUSD/mm^2 UMC 2010: 4 mm x 4 mm70 k€ → 4,37 k € /mm^2 12 mm^2 ~ 1/8 AMchip03 area in CDF → 7500 patterns/chip → 960 kpatterns/AMBoard With 2 blocks 160 kUSD→ ~2 Mpatterns/AMBoard In 2012 could cost less – Academia Sinica can help on prize. Italy – Germany – USA – Academia Sinica (reduction). For 2013: small production = 8+2 AMBoards = 1280 chips. How many wafers? How much for a wafer? we would like to be 4 funding agencies, especially for final step: Whole wafer Mask @time when a large area chip is needed: UMC 2010 90 nm:555 kUSD TSMC 2010 65 nm: 1300-900 kUSD TSMC 2010 65 nm MLM 650-950 kUSD
37
add_in add_out Pipelines of AM chips AMchip Control = GLUE
38
AM INDI AMTOP Bus0 Bus1 Bus3 Bus2 AMBOTTOM Bus0 Bus5 Bus1Bus3 Bus2 Bus4 Bus5 PAT_ADD_IN [17:0] PAT_ADD_OUT [17:0] REV_EN add_in add_out LAMB
39
AM GLUE FIFOS RECEIVERs & DRIVERs (ROAD bus + 6 HIT buses) LAMB CONNECTORs VME INTERFACE ROADCONNECTOR HITCONNECTOR FPGA I/O control PIPELINE REGISTERs INDI HIT [17:0] ADD OUT [30:0] TRACKs 6 bus (108 bits!) Four 8- chips (top- bottom) pipeline
40
LAMB Standard cell chip 40 MHz clock FPGA for Roads FTK AMBoard P3 serial LVDS Control FPGA for SS Input CDF AMBoard with 4 LAMBs Complementary Functions in the AUX board 16 AMBoards per “core” crate → 8 core crates in the system
41
AM0+TSP+DO+TF+HW CPU vme AM1+TSP+DO+TF +HW AM2+TSP+DO+TF +HW AM3+TSP+DO+TF +HW AM4+TSP+DO+TF +HW AM5+TSP+DO+TF +HW AM6+TSP+DO+TF +HW AM7+TSP+DO+TF +HW 11LayFit+HW AM10+….. AM11+….. AM12+….. AM13+…… AM14+….. AM8+….. AM9+…... 11LayFit+ HW final AM15+….. 11LayFit+ HW final 11LayFit+HW LAMB Standard cell chip 40 MHz clock FPGA for Roads AMBoard P3 serial LVDS Control FPGA for SS Input AUX card Connectors for Hits LVDS Cables DO+TF+HW HWTF DO INPUT FIFOs HWTF DO HWTF DO HWTF DO Connectors for tracks output Interface SSMAP Processing Unit
42
The whole system: Data Formatter + 8 core crates
43
6 18-bit buses, hit rate: 40MHz/bus input bandwidth of 4 Gbit/s 1/2 AM Divide into sectors with overlaps Pixel barrelSCT barrel Pixel disks 6-12 Logical Layers: full coverage IEEE Trans. Nucl. Sci. 51, 391 (2004) Overlaps require hits in a small region to be sent to two neighboring AMs Goal: High Lum 8 sectors 8 9U VME crates for the FTK core 1/2 AM
44
Whatever is the power of the AM we can build, we can do better with the TSP
45
Algorithm: NIM A287 (1990) 436-438 http://www.pi.infn.it/~paola/Tree_search_algorithm.pdf Tree Search Processor: NIM A 287, 431 (1990), http://www.pi.infn.it/~orso/ftk/NIMA287_431.pdf IEEE Toronto, Canada, November 8-14 1998 http://www.pi.infn.it/~paola/TSP_v14.pdf 1 2 3 4 THIN ROAD FAT ROAD Found by AM (default SS for example) 1234 5678 Depth 0 Depth 1 Depth 2 PATTER N BLOCK PARENT PATTERN
46
The AM chip for each found road could provide: 1)The Road IDentifier (address) 2)The Bitmap : one bit per layer, saying which SSs are empty & which are full (11 bits: 11101111111 eg.) 3)4 more bits for each layer, Sub-SS, saying which of the 4 SS subdivisions are empty and which are full (4 bits 8 Layers). Higher resolution SS (sub-ss) to be stored in AM or into a Mini-DO & LSB bits should be provided to TSP Example: 2-Level TSP → divide by 4 each SS
47
Conclusions The application at future Instantaneus Luminosities will require AM extremely performing Even if extremely performing, the AM work could be refined by the TSP that could fit in the same package with the AM chip in a 2.5 D technology. This actually is NOT true any more, probably, before 2020 The AM could be used for both L1 and L2 applications Any AM pattern capacity increase would be an important advantage for both L1 and L2 tracking systems
48
BACKUP
49
New AMchip features Alberto Annovi INFN Frascati
50
Outline Use of patterns Variable size patterns New input busses Disabling patterns –Increase effective production yield Annovi, 27-09-201050
51
Annovi, 27-09-201051 The Event... The Pattern Bank Pattern matching
52
Annovi, 27-09-201052 1.Find low resolution track candidates called “roads”. Solve most of the pattern recognition 2.Then fit tracks inside roads. Thanks to 1 st step it is much easier Tracking with ~offline quality Super Bin (SB) Tracking in 2 steps Critical parameter: SS size Affects: - Number of patterns for given efficiency: cost - Number of found roads: workload for next step Critical parameter: SS size Affects: - Number of patterns for given efficiency: cost - Number of found roads: workload for next step
53
Pattern efficiency Annovi, 27-09-201053 90% # of patterns in Amchips (barrel only, 45 degress) 65M500M Pattern size r- : 24 pixel, 20 SCT 36 pix z Pattern size r- : 12 pixel, 10 SCT 36 pix z = 342k = 40k Want this
54
Efficiency curve Annovi, 27-09-201054 # of pattern in Amchips (barrel only, 45 degress) Need many patterns for little efficiency ?? Super Bins are discrete Edge effects give lots of patterns with little coverage
55
Annovi, 27-09-201055 TSP simulation & varying-resolution pattern banks Guido Volpi & Roberto Vitillo - Pisa Depth 0 Depth 1 Depth 2 PARENT PATTERN FAT ROAD Thin ROAD AM resolution TSP resolution We do have now a structured “pattern bank”, where each thin road is connected to its parent pattern in FTKsim. Ongoing tests for TSP algo after the RoadFinder (AMsim) in FTKsim; we have studied the bank composition and AM FAKE roads. AM Fake road is a AM matched pattern whose kids do not match the event Low probability to fire AM patterns: few kids (1 or 2): big advantage to match it at TSP resolution! All blank Half-SS can fire @ AM level as fakes while @ TSP level the fake has good probability to be deleted LOW coverage patterns High probability to fire AM patterns (symmetric): many kids (up to 20 or more): no advantage to match it at TSP resolution! More than one kid can fire @ TSP level. Low probability to be a fake AM road HIGH coverage patterns KID PATTERN @Depth 0 PARENT @Depth 1
56
Annovi, 27-09-201056 We can use don’t care on the least significant bit when we want to match the pattern layer @ AM resolution or use all the bits to match it @ TSP resolution Test of AM patterns: 1.all single kid patterns @ TSP resolution 2.For all few kid patterns use don’t care only for layers where both Half-SS are used by kids AM resolution (don’t care ) TSP resolution (care) to exclude the right half in these layers Guido Volpi & Roberto Vitillo - Pisa All AM roads AM roads with at least 1 matched kid Fake AM roads # of kids WH @10 34 How to implement “variable resolution” in the AMchip AM pattern distribution vs Number of kids Majority of patterns with a single Kid AM & TSP Pattern Bank for 23 ev. pileup # of kids
57
AM with care/don’t care Annovi, 27-09-201057 TSP38000 AM@TSP28000 AM@DC44000 AM342000 Care/don’t care very effective to reduce the number of roads. Area cost on the chip approx. 1 extra cell for each DC bit. Now 15 cells/layers. With 1 DC bit area increases by 1/15 ~ 7%. For comparison going to TSP resolution would require 3x patterns. # of kids
58
Number of busses Currently we have 6 input busses New AMchip should handle 8 layers IBL will require 2 busses for higher b/w External SCT layers needs half b/w Current package constraint max 7 input busses 3 options: implement 2 of them to be selected online Annovi, 27-09-201058
59
8 Layers vs 7 buses (option 1) Annovi, 27-09-201059 Pattern bank with 8 matching layers 8 internal buses Internal register that feeds 8 busses Input register for 7 busses Demultiplex based on MSB Ex tr a PixPix PixPix PiXPiX SCTSCT SCTSCT SCT 2 & 3
60
IBL: 7 Layers vs 7 buses Annovi, 27-09-201060 Internal register that feeds 8 busses IBLIBL IBLIBL PixPix PiXPiX SCTSCT SCTSCT SCT 2 & 3 Input register for 7 busses Demultiplex based on MSB IBL @ double bandwidth. Either double internal clock, or special logic. Take the logical OR of 2 layers. Both layers store the IBL super bin. Distribute 50% data to each layer. Layer matches if any of 2 IBL layers match Special IBL layer: OR of 2 layers
61
IBL: 8 Layers vs 7 buses Annovi, 27-09-201061 Internal register that feeds 8 busses IBLIBL IBLIBL PixPix PiXPiX ?????? SCTSCT SCT 2 & 3 Input register for 7 busses Demultiplex based on MSB IBL @ double bandwidth. Either double internal clock, or special logic. Take the logical OR of 2 layers. Both layers store the IBL super bin. Distribute 50% data to each layer. Layer matches if any of 2 IBL layers match IBL with double clock
62
Amchip 03 yields AMchip03 prototype 2004 –1cm^2 MPW yield 35% AMchip03 production 2005 –1cm^2 pilot run yield 70% Large fraction of failures due to single pattern defect. Add one register to disable bad patters –Will allow to use all chips with a single (or few) pattern defects. Area cost small :1 flip-flop/pattern (not /layer) Annovi, 27-09-201062
63
Changes to AMChip specifications Amchip 03 specs: –http://www- cdf.fnal.gov/publications/cdf7339_amchip03_s pecs.pshttp://www- cdf.fnal.gov/publications/cdf7339_amchip03_s pecs.ps New features –Add 1 or 2 don’t care bits/layer –Increase input busses to 7 with multiplexing & special handling of IBL –Add disable FF for each pattern Annovi, 27-09-201063
64
BACKUP Annovi, 27-09-201064
65
Two possible Approaches to expand into the third direction VIPRAM - Vertically Integrated Pattern Recognition Associative Memory Ted/Jim/Aida/Ray/Gregory/Simon/Silvia/Marcel/Gary/Mel/Bob… FNAL/ANL/UC/Tezzaron/… 1. “Identical Tier” 3D architecture (actually 2.5 D?) 2.“True 3D” Implementation
66
Trying to define a collaboration Italy-USA for DOE application to Generic R&D funds (ATLAS FTK - Fermilab CMS, both interested)
67
All equal tiers: put them in pipeline as done on the board
68
The 3D IO Wrapper must be designed and fabricated around the 2D AMchip to ensure that all tiers act as a single chip as shown in Figure 5. Even for prototyping purpose, it is not possible to simply take an existing, fabricated AMchip and place it inside a rectangular doughnut- shaped 3D IO Wrapper. There are several ways to address this. First, the 2D AMchip could be redesigned in a 3D process like Tezzaron/Chartered, and then the 3D IO Wrapper could be designed around it. This method has no obstacles to its 3D fabrication. However, it does require the redesign of the AMchip. Second, the CMOS UMC process could be used for 3D development even though UMC does not have a 3D process. This method requires no redesign of the AMchip, but it does require UMC to be willing to participate in a “Via Middle” process in which after a certain number of fabrication steps, the wafers are shipped to a “Via Middle company” (e.g. Tezzaron) where the first steps of the Through Silicon Via process are started. Then the wafers are shipped back to UMC where the 2D processing is completed. Finally, UMC ships the completed wafers to the Via Middle Company where 3D processing is completed. Not all companies are willing to participate in a Via Middle process.
69
The True 3D: 1 tier/ Layer + 1 control tier Control Tier Tier 4 Tier 3 Tier 2 Tier 1
70
CAM in 2D
71
Very high density of patterns
72
Advantages 2D chip: ready soon with ~best technology (65 nm today, 40 or better in 2020), 1 single mask, probably enough for LVL2, could allow 2,5D True 3D: less consuming Tiers, much larger banks useful for LVL1? Less latency compared to pipelined Tiers. True 3D: Important if we need much larger banks than provided by 2D. COSTS? Fermilab proposes “True 3D” as a phase I R&D
73
EVEN MORE – Phase II Adding more planes? Could we include DO – TF and HW? All planes that fit well in a 2,5 D scheme All of them well known and testable on FPGA before! AMchip Flexible TSP Logic-FPGA like ? Memories for TSP MINIDO? DO + TF + HW ? Integration of VLSI chips with FPGA and RAMs
74
Conclusions They present 2 phases: “true 3D” first, Integration with FPGA and memories second. We think that in a short time scale it is important to understand the power of 2D design: density of patterns available/needed. For LVL2 seems ok 2D pushed at best technology. Consumption We could try the 2D chip to be used as 2.5 D as Phase I On a longer time scale, try the “True 3D” as Phase II
75
Amchip04 with umc90 std cells UMC90 FSD0A_A standard cells library Our custom standard cells: single_layer search_line Tools used Synopsis DC D-2010.03-SP1-1 (synthesis) Cadence SoC Encounter v07.10-s219_1 (placement, routing) Synopsis PT D-2010.03-SP1-1 (timing analysis) Custom scripts (manual place)
76
Basic bank structure......... 8x Input Bus Matched patterns Buffer 32x Patterns (row) 32x Majority - Manual placement - Majority row has it's own clock tree
77
Basic bank structure......... 8x Input Bus Matched patterns Buffer 32x Patterns (row) 32x Majority - Manual placement - Majority row has it's own clock tree A pattern is a row: 8x single_layer cells Each cell match a 15bit bus
78
Basic bank structure......... 8x Input Bus Matched patterns Buffer 32x Patterns (row) 32x Majority - Manual placement - Majority row has it's own clock tree Majority logic: If X out of 8 bus match the pattern is matched. X is programmable via JTAG
79
CLK Match Line Match_reg BL [15:4] BL_N [15:4] XXXXXX DATA XXXXX XXX ZZZZZZ Mlpre_n slpre slpre_t BL [3:0] BL_N [3:0] XXXXXX DATA XXXXX XXX MLSA_res SEN
80
CLK Match Line Match_reg BL [15:4] BL_N [15:4] XXXXXX DATA XXXXX XXX ZZZZZZ Mlpre_n slpre slpre_t BL [3:0] BL_N [3:0] XXXXXX DATA XXXXX XXX MLSA_res SEN All this signals are inputs to the single_layer pattern cell for activate the match. Relative timing is critical! Generated in each Buff module By global “read” signals
81
512 patterns bank 16 x 32 pattern blocks are manually placed to build a 512 patterns bank. Horizontal and vertical gaps are left for power grid.
82
All logic placed The pattern bank occupies most of the area. All the other control logic scale very weakly with the number of patterns. We could try to fill the chip with a bigger column of patterns (~800), but is not critical for this mini@sic prototype to have a bigger bank.mini@sic
83
Logic scheme
84
Power grid Power distribution is done by two big horizontal stripes and two thinner vertical stripes. We are waiting a feedback from IMEC about this power grid design.
85
512 patt AMCHIP04 routed First results of routing (wroute, clock tree routed first, no post- routing optimization) are reasonable: - routing is simple and consistent with our plans in the bank area (vertical buses, horizontal output) - no critical congestions in other areas
86
Timing Analysis We have working skeleton scripts for static timing analysis A first look at the timing with PrimeTime showed some various setup and hold violations No post-route optimization was done, buffer optimization in this step might remove most of the violations Global signals running through all the patterns coloumn have setup violation Force a better routing of the column area Manually optimize buffer usage Split the column in two shorted columns Some optimization and re-routing is needed, but no critical flaws are detected
87
Full Custom Associative Memory Core With respect to standard cell design of the memory chip we want to: Increase memory density Reduce power consumption
88
CAM model Simple schematic of a CAM with 4 words having 3 bits each. The schematic shows individual core cells, differential searchlines, and matchline sense amplifiers (MLSAs) CAM core cells for (a) 10-T NOR-type CAM and (b) 9-T NAND-type CAM. The cells are shown using SRAM-based data-storage cells. For simplicity, the figure omits the usual SRAM access transistors and associated bitlines. `
89
NAND Type SRAM Cell
90
NAND Type SRAM Cell Layout NAND Cell dimensions: 2.8 micron height 3.8 micron width
91
NOR Type SRAM Cell
92
NOR Type SRAM Cell Layout NOR Cell dimensions: 2.8 micron height 3.62 micron width
93
MatchLine Sense Amplifier (MLSA) Positive feedback differential sense amplifier Matchline discharge transistor Output inverter Amplifier resetting transistors Amplifier resetting transistor
94
MatchLine Sense Amplifier Layout MLSA dimensions: 2.8 micron height 7.3 micron width
95
NOR Type Matchline Model The main feature of the NOR matchline is its high speed of operation. In the slowest case of a one-bit miss in a word, the critical evaluation path is through the two series transistors in the cell that form the pulldown path.
96
NAND Type Matchline Models A feature of the NAND matchline is that a miss stops signal propagation such that there is no consumption of power past the final matching transistor in the serial nMOS chain Two drawbacks of the NAND matchline are: a quadratic delay dependence on the number of cells a low noise margin.
97
Selective Precharge Model
98
Selective Precharge
99
Estimated Power Consumption The Associative Memory core estimated power consumption (at 100MHz clock frequency) with NOR cell match line scheme is about 3 A. The core power supply is 1V. Associative memory core (60000 pattern) running at 100MHz clock frequency with Selective Precharge matchline scheme We have obtained an 80% reduction in power consumption
100
Selective Precharge Timing (all bits match) Matchline precharge MLSA output Search line (Bit line) Precharge MLSA enable Matchline discharge Matchline NOR cell Searchline and Matchline Precharge phase Matchline Evaluation phase Matchline Discharge phase
101
Selective Precharge Timing (NOR bit mismatch) Matchline precharge MLSA output Search line (Bit line) Precharge MLSA enable Matchline discharge Matchline NOR cell Searchline and Matchline Precharge phase Matchline Evaluation phase Matchline Discharge phase
102
Selective Precharge Timing (NAND bit mismatch) Matchline precharge MLSA output Search line (Bit line) Precharge MLSA enable Matchline discharge Matchline NOR cell Searchline and Matchline Precharge phase Matchline Evaluation phase Matchline Discharge phase
103
Layer Layout Width: 67.2 micron Height: 2.8 micron Matchline precharge Transistor NAND cells NOR cells MLSA and Matchline discharge transistor
104
Timing
105
Conclusions I have completed the layout of the full layer The obtained layout is quite compact The estimated memory core power consumption is reduced about 80% with respect to a NOR type matchline model To do: Complete the remaining full custom part (Search line precharge of the NOR cell and the MLSA Vref) Complete the layer simulation with Montecarlo analysis Simulation of the full associative memory chip
106
Annovi, 27-09-2010106 Milestone #9: Specify system size..1×10 34 and 3×10 33 Concentrate now on 2013-2015 (17-19 pile-up events) 2020 comes much later and will profit of a very advanced technology……. Sim with 75 pile-up events after 2020! 17,6 pile-up ev. @2.6 10 33 19,0 pile-up ev. @ 10 34
107
Annovi, 27-09-2010107 Using the variable resolution in a new AM chip for 10 34 WH events @10**34 (# of pile-up events = 23) Banks coverage ~ 95% 8.0 MPat @TSP → 2,80 MPat @ AM level (35%) per region (barrel only) 20 MPat @ TSP → 7 MPat @ AM level (35%) per region (all detector) Using TSP resolution in the AM bank for AM patterns with 1,2,3 kids: 3600 goes down to 1325 roads/AMboard → gaining a factor ~ 3! For a full detector FTK: less than 4000 roads/AMboard @AM out with a limit of 8000. less than 2000 roads/AMboard @TSP out with a limit of 4000. Guido Volpi & Roberto Vitillo - Pisa FTK Demonstrator with old chip, barrel only: running now on 17,6 pile-up events to understand DATA FLOW → however we consider it a test, It is not necessary to have large margins for 2013. Even a small AMchip (12 mm 2 ) @ 65 nm (MPW 80 k€) with variable resolution implemented, could do it, even without the TSP. Very low consumption DATA FLOW (Option A) assuming 16 AMboards in a core crate (numbers are for barrel only – a factor ~2,5 has to be applied for “all detector”): 3600 roads/AMboard of which 733 have a kid match at TSP level → 80% fakes
108
Annovi, 27-09-2010108 180 nm 90 nm NEXT YEAR – MAY BE MARCH Mini-asic COULD be 90 or 65 nm THE AMCHIP04 PROTOTYPE Design: L.Sartori (Ferrara) M.Beretta (LNF) E. Bossini, F. Crescioli, I.Sacco (Pisa) Test: A.Lanza (Pavia) 90 nm miniasic
109
The FTK CHALLENGING PART: the NEW AMCHIP & the TSP Where we can stack the TSP? In the AUX board just after the AMBoard? In the AMBoard itself? In the Lamb to reduce early the # of roads? Even better in the AMchip 2.5 D! LAMB Standard cell chip 40 MHz clock FPGA +TSP?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.