Outline The Pattern Matching and the Associative Memory (AM) Why more dense AM we get better it is Associative memory architecture How chips are put together: Lamb → AMboard → crate The Tree Search Processor & its location
The Event TRACKING WITH PATTERN MATCHING The Pattern Bank ...
The Associative Memory – AM = Bingo Dedicated device - maximum parallelism: Each pattern with private comparator Track search during detector readout Bingo scorecard Full custom 700 nm: 0,128 6L kpat/chip FPGA 350 nm: 0,128 6L kpat/chip standard cell 180 nm: 5,0 6L kpat/chip new for FTK 90 nm: ~60 8L kpat/chip new for FTK 65 nm: ~120 8L kpat/chip 2 Tiers 65 nm 2,5 D: 240 8L kpat/chip
A schematic drawing of the AM ONE PATTERN Layer 1 Layer 2 Layer 3 Layer 4 Cell 0 word FF word word word Cell 1 FF Output Bus Cell 2 FF Cell 3 FF HIT HIT HIT HIT
More powerful is the AM better it is. WHY? Tracking in 2 steps: find Roads first (Pattern Matching with Associative Memory, AM) then find Tracks inside Road (Fit by TF) Hits Associative Memory (AM) Data Organizer (DO) Hits Roads Hot point @high occupancy Super Strip (SS) Roads + hits Track Fitter (TF) Tracks parameters (d, pT, , h, z) Track fitting using full resolution of the detector Full Resolution Hits Large SS: a lot of fakes + combinatorics inside roads Road Road size: a parameter to balance the AM size & the DO-TF workload
Which banks we would like to have What we have now: Standard Cell 180 mm pattern/chip for 6-layer patterns, 2500 pattern/chip for 12-layer patterns “A VLSI Processor for Fast Track Finding Based on Content Addressable Memories”, IEEE Transactions on Nuclear Science, Volume 53, Issue 4, Part 2, Aug. 2006 Page(s):2428 - 2433 90 nm technology provides a factor 4 → 10000 patterns/chip Full custom cell provides at least a factor 2 → 20000 patterns/chip 8 layers instead of 12 provides a factor 1,5 → 30000 patterns/chip 1,5 x 1,5 cm**2 2D chip → 60000 patterns/chip Going to 65 nm → 120000 patterns/chip With a 2 D chip we gain a factor 50! 1 AMboard: 128 chips → ~15 Mpatterns per board 1 Crate: 16 AMboard → ~245 Mpatterns per crate 100 MHz running clock NEXT: NEW VERSION For both L1 & L2
The CDF final AMchip architecture Pattern bank Add encoder kill Bus0[17:0] Bus1[17:0] Bus2[17:0] Bus3[17:0] Bus4[17:0] Bus5[17:0]
Power consumption Old Chip: corr. Factor 1,8 Watt 180 nm 1,8 V Core New chip 90 nm 1 V Core 1/(1,8*1,8) 0,56 Watt Frequency 40 MHz New chip 100 MHz 100/40 1,39 Watt Area 1x1 cm**2 New chip 4 cm**2 4/1 5,56 Watt New: Pre-match feature 1/3 (1/2) 1,85 (2,78) Watt Per crate 16 x 128 = 2048 chips 3,8 (5,7) kW IF the pre-match feature save at least 1/3, new 2D chip (1,85 W) ~ old chip (1,8 W) ANY OTHER IDEA TO GAIN IN POWER INCREASES THE POTENTIALITY TO GROW IN THE THIRD DIRECTION we would like to be 4 funding agencies involved:
LHC Schedule → Intermediate chip! 17,6 pile-up ev. @2.6 1033 19,0 pile-up ev. @ 1034 Sim with 75 pile-up events after 2020! Concentrate now on 2013-2015 (17-19 pile-up events) Consider evolution up to 2019 (41,5 pile-up events << simulated 75 ev) → Intermediate chip! 2020 comes much later and will profit of a very advanced technology……. 9 Annovi, 27-09-2010
Our Schedule TSMC 65 nm, low power, available as mini@sic (Vcc_core=1,2 V). 65 nm mini@sic 22,5 k€/block; 90 nm mini@sic 18,6 k€/block. "variable resolution" gives good results → early production of AM04 we missed the 90nm 2010 September run We propose to move directly to a 65 nm prototype. This is a preliminary schedule to produce new LAMBs for 2013: (1) Mini@sic submission: spring or october 2011. (2) delivery: ~february 2012 (3) tested ~June 2012 (4) MPW submission: from June 2012 (5) Delivery: from November 2012 (6) Tested: from February 2013 (7) MPW Production from February 2013 (8) Delivery from July 2013 (9) mounted on new Lambs from autumn 2013
Costs 2 blocks Mini@sic: payed by Italy MPW run: TSMC 2010: 12 mm^2 80 kUSD → 6,7 kUSD/mm^2 UMC 2010: 4 mm x 4 mm 70 k€ → 4,37 k € /mm^2 12 mm^2 ~ 1/8 AMchip03 area in CDF → 7500 patterns/chip → 960 kpatterns/AMBoard With 2 blocks 160 kUSD → ~2 Mpatterns/AMBoard In 2012 could cost less – Academia Sinica can help on prize. Italy – Germany – USA – Academia Sinica (reduction) . For 2013: small production = 8+2 AMBoards = 1280 chips. How many wafers? How much for a wafer? we would like to be 4 funding agencies, especially for final step: Whole wafer Mask @time when a large area chip is needed: UMC 2010 90 nm: 555 kUSD TSMC 2010 65 nm: 1300-900 kUSD TSMC 2010 65 nm MLM 650-950 kUSD
Packaging chips together in the LAMB add_in add_out Pipelines of AM chips AMchip Control = GLUE
AMTOP Bus0 Bus1 Bus3 Bus2 AMBOTTOM Bus5 Bus4 add_in add_out LAMB AM INDI AMTOP Bus0 Bus1 Bus3 Bus2 AMBOTTOM Bus5 Bus4 PAT_ADD_IN [17:0] PAT_ADD_OUT REV_EN add_in add_out LAMB
6 bus (108 bits!) GLUE AM INDI Four 8-chips (top-bottom) pipeline FPGA VME INTERFACE ROAD CONNECTOR AM INDI Four 8-chips (top-bottom) pipeline FPGA I/O control FIFOS TRACKs ADD OUT [30:0] RECEIVERs & PIPELINE LAMB DRIVERs REGISTERs CONNECTORs (ROAD bus + CONNECTOR 6 HIT buses) HIT [17:0] HIT
Packaging LAMBs together in the AMBoard Complementary Functions in the AUX board Standard cell chip LAMB Control FPGA FPGA for Roads 40 MHz clock FPGA for SS Input P3 serial LVDS CDF AMBoard with 4 LAMBs FTK AMBoard 16 AMBoards per “core” crate → 8 core crates in the system
LAMB Processing Unit Packaging AMBoards inside a crate We need 8 of AM0+TSP+DO+TF+HW CPU vme AM1+TSP+DO+TF+HW AM2+TSP+DO+TF+HW AM3+TSP+DO+TF+HW AM4+TSP+DO+TF+HW AM5+TSP+DO+TF+HW AM6+TSP+DO+TF+HW AM7+TSP+DO+TF+HW 11LayFit+HW AM10+….. AM11+….. AM12+….. AM13+…… AM14+….. AM8+….. AM9+…... 11LayFit+ HW final AM15+….. Packaging AMBoards inside a crate Processing Unit AUX card Hits LVDS Cables Connectors for DO+TF+HW HW TF DO INPUT FIFOs tracks output Interface SSMAP LAMB Standard cell chip 40 MHz clock FPGA for Roads AMBoard P3 serial LVDS Control SS Input We need 8 of such crates Why?
The whole system: Data Formatter + 8 core crates Track data ROB Raw data ROBs ~Offline quality Track parameters Pixels & SCT 50~100 KHz event rate RODs S-links Core Crate HITS Data Formatter (DF) cluster finding split by layer overlap regions 8x h-f towers DO T F AM brd HW Second stage
6-12 Logical Layers: full h coverage FTK: 8 processors working in parallel because of Input bandwidth IEEE Trans. Nucl. Sci. 51, 391 (2004) 1/2 f AM Divide into f sectors with overlaps 1/2 f AM Pixel barrel SCT barrel Pixel disks 6-12 Logical Layers: full h coverage 6 18-bit buses, hit rate: 40MHz/bus input bandwidth of 4 Gbit/s Goal: High Lum 8 f sectors 8 9U VME crates for the FTK core Overlaps require hits in a small region to be sent to two neighboring AMs
Whatever is the power of the AM we can build, we can do better with the TSP
The Tree Search Processor (TSP): Binary search to go down to better SS resolutions FAT ROAD Found by AM (default SS for example) Depth 0 Depth 1 Depth 2 PATTERN BLOCK PARENT 1 2 3 4 5 6 7 8 Algorithm: NIM A287 (1990) 436-438 http://www.pi.infn.it/~paola/Tree_search_algorithm.pdf Tree Search Processor: NIM A 287, 431 (1990), http://www.pi.infn.it/~orso/ftk/NIMA287_431.pdf IEEE Toronto, Canada, November 8-14 1998 http://www.pi.infn.it/~paola/TSP_v14.pdf THIN ROAD 1 2 3 4
Example: 2-Level TSP → divide by 4 each SS Higher resolution SS (sub-ss) to be stored in AM or into a Mini-DO & LSB bits should be provided to TSP Example: 2-Level TSP → divide by 4 each SS The AM chip for each found road could provide: The Road IDentifier (address) The Bitmap : one bit per layer, saying which SSs are empty & which are full (11 bits: 11101111111 eg.) 4 more bits for each layer, Sub-SS, saying which of the 4 SS subdivisions are empty and which are full (4 bits 8 Layers).
Conclusions The application at future Instantaneus Luminosities will require AM extremely performing Even if extremely performing, the AM work could be refined by the TSP that could fit in the same package with the AM chip in a 2.5 D technology. This actually is NOT true any more, probably, before 2020 The AM could be used for both L1 and L2 applications Any AM pattern capacity increase would be an important advantage for both L1 and L2 tracking systems
BACKUP
LAMB The FTK CHALLENGING PART: the NEW AMCHIP & the TSP +TSP? Standard cell chip 40 MHz clock FPGA +TSP? Where we can stack the TSP? In the AUX board just after the AMBoard? In the AMBoard itself? In the Lamb to reduce early the # of roads? Even better in the AMchip 2.5 D!