LHCb upgrade Workshop, Oxford, 07.12.2010 Xavier Gremaud (EPFL, Switzerland)

Slides:

Advertisements

Similar presentations

Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.

Advertisements

Track Trigger Designs for Phase II Ulrich Heintz (Brown University) for U.H., M. Narain (Brown U) M. Johnson, R. Lipton (Fermilab) E. Hazen, S.X. Wu, (Boston.

ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Memory (RAM) Organization Each location is addressable Addresses are binary numbers Addresses used at different granularities –each bit is possible, but.

Router Architecture : Building high-performance routers Ian Pratt

What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.

Team Morphing Architecture Reconfigurable Computational Platform for Space.

DRAM. Any read or write cycle starts with the falling edge of the RAS signal. –As a result the address applied in the address lines will be latched.

Double buffer SDRAM Memory Controller Presented by: Yael Dresner Andre Steiner Instructed by: Michael Levilov Project Number: D0713.

FF-1 9/30/2003 UTD Practical Priority Contention Resolution for Slotted Optical Burst Switching Networks Farid Farahmand The University of Texas at Dallas.

Programmable logic and FPGA

EE 122: Router Design Kevin Lai September 25, 2002.

©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 18, 2002 Topic: Main Memory (DRAM) Organization – contd.

RICH Data Flow Jianchun Wang. 2 HPD Readout Electronics 944 HPDs 163 channels / HPD 1 FE Hybrid / HPD ~160 FEMs 6 FE Hybrids / FEM 1-13 Cables / FEM 20.

1 Tell10 Guido Haefeli, Lausanne Electronics upgrade meeting 10.February 2011.

Connecting LANs, Backbone Networks, and Virtual LANs

Chapter 4 Queuing, Datagrams, and Addressing

Final Year Project A CMOS imager with compact digital pixel sensor (BA1-08) Supervisor: Dr. Amine Bermak Group Members: Chang Kwok Hung

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Lecture 21 Last lecture Today’s lecture Cache Memory Virtual memory

Straw electronics Straw Readout Board (SRB). Full SRB - IO Handling 16 covers – Input 16*2 links 400(320eff) Mbits/s Control – TTC – LEMO – VME Output.

The Track-Finding Processor for the Level-1 Trigger of the CMS Endcap Muon System D.Acosta, A.Madorsky, B.Scurlock, S.M.Wang University of Florida A.Atamanchuk,

Chapter 10: Input / Output Devices Dr Mohamed Menacer Taibah University

Survey of Existing Memory Devices Renee Gayle M. Chua.

Router Architecture Overview

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER

April 9th 2009Jacques Lefrancois1 Data transmitted Now from FEPGA to SEQPGA to CROCNow from FEPGA to SEQPGA to CROC 12bits of ADC +8 bits trigger+1 bit.

PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.

NA62 Trigger Algorithm Trigger and DAQ meeting, 8th September 2011 Cristiano Santoni Mauro Piccini (INFN – Sezione di Perugia) NA62 collaboration meeting,

Vladimír Smotlacha CESNET High-speed Programmable Monitoring Adapter.

Computer Architecture Lecture 32 Fasih ur Rehman.

Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University

McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.

Transfering Trigger Data to USA15 V. Polychonakos, BNL.

McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Chapter 16 Connecting LANs, Backbone Networks, and Virtual LANs.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Ch 8. Switching. Switch  Devices that interconnected with each other  Connecting all nodes (like mesh network) is not cost-effective  Some topology.

Exploiting Parallelism

Edge Detection. 256x256 Byte image UART interface PC FPGA 1 Byte every a few hundred cycles of FPGA Sobel circuit Edge and direction.

LHCb VELO Upgrade Strip Chip Option: Data Processing Algorithms Giulio Forcolin, Abdul Afandi, Chris Parkes, Tomasz Szumlak* * AGH-Krakow Part I: LCMS.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.

A Super-TFC for a Super-LHCb (II) 1. S-TFC on xTCA – Mapping TFC on Marseille hardware 2. ECS+TFC relay in FE Interface 3. Protocol and commands for FE/BE.

TELL40 VELO time ordering Pablo Vázquez, Jan Buytaert, Karol Hennessy, Marco Gersabeck, Pablo Rodríguez P. Vazquez (U. Santiago)112/12/2013.

COMP541 Memories II: DRAMs

1 Level 1 Pre Processor and Interface L1PPI Guido Haefeli L1 Review 14. June 2002.

LKr readout and trigger R. Fantechi 3/2/2010. The CARE structure.

Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.

Specificities of Calorimeter data The zero-suppress in ECAL/HCAL is 2D : 3X3 cluster => keep the 8 cells around a seed with pt > pt threshold. This cannot.

Copyright ©: Nahrstedt, Angrave, Abdelzaher1 Memory.

Distortion Correction ECE 6276 Project Review Team 5: Basit Memon Foti Kacani Jason Haedt Jin Joo Lee Peter Karasev.

August 4th 2008Jacques Lefrancois1 Digital specification Keep signal treatment ( dynamic pedestal subtraction)Keep signal treatment ( dynamic pedestal.

FPGA based signal processing for the LHCb Vertex detector and Silicon Tracker Guido Haefeli EPFL, Lausanne Vertex 2005 November 7-11, 2005 Chuzenji Lake,

Off-Detector Processing for Phase II Track Trigger Ulrich Heintz (Brown University) for U.H., M. Narain (Brown U) M. Johnson, R. Lipton (Fermilab) E. Hazen,

COMP541 Memories II: DRAMs

High Rate Event Building with Gigabit Ethernet

Digital readout architecture for Velopix

TELL1 A common data acquisition board for LHCb

SLP1 design Christos Gentsos 9/4/2014.

COMP541 Memories II: DRAMs

Vertex 2005 November 7-11, 2005 Chuzenji Lake, Nikko, Japan

Computer Architecture

Wavelet “Block-Processing” for Reduced Memory Transfers

Chapter 4 Network Layer Computer Networking: A Top Down Approach 5th edition. Jim Kurose, Keith Ross Addison-Wesley, April Network Layer.

Network Processors for a 1 MHz Trigger-DAQ System

The LHCb Level 1 trigger LHC Symposium, October 27, 2001

New DCM, FEMDCM DCM jobs DCM upgrade path

TELL1 A common data acquisition board for LHCb

Multiprocessors and Multi-computers

Presentation transcript:

LHCb upgrade Workshop, Oxford, Xavier Gremaud (EPFL, Switzerland)

 Data flow  Input data format  Time reordering  Clusterization  Output format  Conclusion Xavier Gremaud, EPFL

Xavier Gremaud, EPFL Split the GBT data in 2x40b Time Reordering Reconstruct the Super Pixel Packet (SPP) 80b wide Linker 0, assemble data from 2 SPP data stream Clusterization + ToT correction (subtraction) (maybe lookup table based calibration) Clusterization + ToT correction (subtraction) (maybe lookup table based calibration) Data from two column processors

Xavier Gremaud, EPFL Linker 1, assemble data from 3 GBT, 64b->128b Linker 2, assemble data from 2x3 GBT, 128b->256b Linker 3, assemble data from 2x2x3 GBT, 256b->512b Linker 4, assemble data from 2x2x2x3 GBT, 512b MEP assembly (note : average event is only bit word long) External memory 2x256b Ethernet framer 512b

 For 1 link : 80b/25ns = 3.2 Gb/s  For 24 links : 77 Gb/s  The 80b wide GBT word is divided into two 40b data streams which are filled by the column processor (fixed position in the 80b data word) Xavier Gremaud, EPFL

 The RAM space is divided in 512 equally sized memory blocks (space reserved for data arriving in random order)  RAM location defined with LSBs of BxID (BCNT) Note: The total memory space required is: max. time delay allowed * the max. event size allowed (space for every event has to be reserved!) Xavier Gremaud, EPFL

 In the current FPGA EP4SGX530 (largest Altera Stratix IV device) «only» 64x144kB memory blocks are available.  Choosing a time reorder buffer of 512 events deep and 8 word event size occupies 48 memory blocks (maximum size reached!)  Note: There are no other large memories required for the other processing steps. Conclusion:  Each GBT link is restricted to 8 SPP (Super Pixel Packets) smaller than 64bit.  For the total pixel chip, the maximum number of SPPs is 5x8=40/event.  Time reorder is possible for up to =498 events Xavier Gremaud, EPFL

 Clusterization requires to split up the SPP format (for example two isolated pixels can be in the same SPP)!  Most obvious approach for clusterization is to use one seeding pixel and search for possible neighbours.  Very difficult to perform “perfect” clusters, average time per cluster is limited to 25ns if done in a pipeline, otherwise 25ns for the complete event!  The 16b seeding hit address is reconstructed from the 12b address, the 4b row header and the 4b hitmap.  An additional link source id is required to identify data from 24 different GBT links (+5bit) Xavier Gremaud, EPFL

 The principal goal of the clusterization is data reduction, “perfect” clustering like for Tell1 is not possible anymore. Additional processing in a CPU is required to finish:  Forming clusters over boundaries of GBT links  Combining separated clusters  Forming clusters for events with too high pixel count (see illustration next slide) Xavier Gremaud, EPFL The cluster form depend of the seeding hit, which is the first hit. One “normal cluster” can be split in two clusters.

 To pipeline the cluster search, only one cluster per pipeline step is formed.  One pipeline step takes 25ns (2-300Mhz processing frequency)  In average the hottest region has 2..4 pixels “only” per event and per GBT ( pixel per chip)!  The cluster search is performed by searching neighbors from the first hit in the data. Each consecutive pipeline stage has the identical function.  The total number of clusters that can be formed is limited by the number of pipeline stages Xavier Gremaud, EPFL

Xavier Gremaud, EPFL  The cluster size is restricted to multiple of bytes! (Data processing on the FPGA but also on the CPU becomes very difficult otherwise)  The expected data reduction from clustering taking for 50% 1-hit and 50% 2-hit clusters is order of 14%. Q: Is it worth while doing “not perfect” clustering for 14% data reduction? Q: Does the CPU take advantage from such clusters? Q: Does anybody know an other feasible clustering approach? With clusterizationWithoutData reduction 1 hit29b => 32b25b => 32b 0% 2 hits36b => 40b50b => 56b 28.5% 3 hits43b => 48b75b => 80b 40.0% 4 hits50b => 56b100b => 104b 46.1% 5 hits57b => 64b125b => 128b 50.0% 6 hits68b => 72b150b => 156b 53.8%

 After the 24 links are linked together, the data are put in a MEP format to reduce the data before the DDR3 SDRAM.  The Bcnt appears only once per event (small data reduction can be expected) (-12bit) Xavier Gremaud, EPFL

 The real challenge of the data processing is not to spend more than 25ns per event! Pipelining is required everywhere!  Time reordering for 512 events reaches the limit of the FPGA internal memory.  ToT calculation from BCnt and timestamp is no problem. Calibration per pixel is impossible!  No more real data reduction (zero suppression) like in TELL1.  Small reduction from removing BCNT (-12-bit / SPP)  Small increase from source ID (+5-bit / cluster)  Small decrease from clustering (-14%)  Largest reduction due to not fully loaded GBT links from furthest pixel chips from the beam.  Long time average reduction due to empty bunch crossings Xavier Gremaud, EPFL

 Very wide buses require large multiplexers for padding (eg a 512-bit bus requires for byte padding a multiplexer of 512x64 (32K connections)). Maybe at some stage in the processing the padding has to be reduced to 32- bit minimal size.  Clusterization useful and fast enough? Need some test with real data and a distribution of the cluster sizes Xavier Gremaud, EPFL

 Implementation of the processing including clustering in VHDL  Simulation of the processing with MC data  Place and route of the design to get better idea of possible processing frequency and resource management Xavier Gremaud, EPFL