Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Associative Memory Chip

Similar presentations


Presentation on theme: "The Associative Memory Chip"— Presentation transcript:

1 The Associative Memory Chip
Saverio Citraro – University of Pisa & INFN Hi, I’m going to talk about the Associative Memory Chip. That is the key tool for real-time image analysis in our European FP7 project This project has received funding from the European Union’s FP7 for research, technological development and demonstration under grant agreement n FP7-PEOPLE-2012-IAPP 3rd Workshop on the Architecture of Smart Cameras June 30 - July 1, 2014, Pisa

2 Outline The Associative Memory (AM) chip Overview
AM chip working principles Internal Architecture Applications: High Energy Physics application Image Pre-Filter Image Understanding Conclusion I will show you first of all an overview about the chip, after I will explain the working principles and the internal architecture. And we will see some applications. Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

3 AM chip overview ASIC designed by INFN Made of CAM Matrix
AM detects coincidences inside large independent data streams Was used for High Energy Physics Experiment CDF at Fermilab Will be used for High Energy Physics Experiment ATLAS at CERN We are working to improve features We would like to use this Hardware for Embedded system and real time applications The Associative memory chip is an ASIC designed by INFN. It is made of a big matrix of CAM cells. And detects coincidence in a huge amount of data in independent data stream. This chip was used in the CDF experiment at Fermilab and will be used in the ATLAS experiment at Cern, both in the high energy physics field. We are working continuously to improve the device features, we would like also to use this hardware for embedded systems and real time applications. Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

4 How could be used? Let me do this Example:
Hypothetic Situation (just to explain the idea) Hundreds of measurements from different sensors like thermometers, anemometers, barometers . AM searches for coincidences between the different input data streams. Training Phase: for a growing tornado. Analysis Phase: Acquire data and look the response. In order to improve the efficiency of the system, we can update the significant data in the memory, learning from the “past experience”. The aim is to find coincidences in a crowd of data. Let me do an example about an hypothetic situation, just to make the idea. So we have a huge amount of data that come from different sensors like for example thermometers, anemometers and so on. The AM searches the coincidences between specific data flowing in these data streams. To be specific we search the coincidences that indicate a growing tornado. In the training phase we evaluate all the possible tornado conditions and we fill the memory with all these possible coincidences before taking data. In analysis phase we acquire the data e look the response. In order to improve the efficiency of the system, we can update the memory, learning from the “past experience”. Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

5 Internal Architecture
CAM cell Matrix CAM cells grouped 1 Pattern = 8 CAMs Each Word = 16 bit 128K Patterns Each CAM cell has a dedicate Flip-Flop All CAMs check their data with the data bus in one clock cycle Check 100 T Word/s Let’s see the architecture. The AM is similar, but more complex then a CAM. The content addressable memory, is a memory that accepts in input a data and gives in output all the addresses where that data is stored. Our AM instead is made of multiple CAM cells. 8 CAM cells are grouped in one pattern that perform the coincidence of the CAM outputs. Each CAM receives data from an input bus made of 16 bit. In one clock cycle all CAM cells check their stored content with the data on the bus. So one chip performs, 100 tera comparisons per second!! A powerful feature is that each CAM has a FlipFlop that stores the result of the match. In this way the matches that contribute to the coincidence can happens a different time. The INPUT buses could be asynchronous: the matches are independent! Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

6 Working Example OUTPUT OUTPUT Coincidence Coincidence Match Match
Now I show you an animation to clarify what I said. The input are arriving independently to each other on different data streams. In each clock cycle the input data is compared with 128K CAM cells. If the match is positive the relative flip flop is set. When in a row the number of flip flop are all set the pattern address is read out. Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

7 Majority and Readout The Patterns have a Majority Logic
The Majority Threshold is Programmable If the Matches are more then the threshold Readout address We have also another interesting feature, each pattern has a majority logic with a programmable threshold, and we can admit a single or multiple failure. If the number of matches inside the pattern are more then the threshold, the address of the pattern is readout to the output. Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

8 Variable Pattern resolution
Set a different resolution for each CAM in the Pattern. Each CAM cells could use from 2 up to 6 don’t care bits (LSB). Tune cells width on the input source Save memory space when a lower resolution data is enough to identify the coincidence. Another useful feature is the variable pattern resolution, we can set from 2 up to 6 don’t care bit in the CAM cells. In this way is possible to tune cells width on the data source, and is possible also make more precise pattern in one case and less precise in another. With this strategy is possible to save memory space when a lower resolution data is enough to identify the coincidence. Here for example the red pattern has in the first channel a cam that use 6 don’t care bit and is less precise than the other. Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

9 Some other Features IP from Silicon Creations:
SerDes 2.4 Gbit/s Input bandwidth = 8 bus x 2.4 Gbit/s = 19.2 Gbit/s Output bandwidth = 2.4 Gbit/s Internal Logic Speed 100 MHz Power supply: 2.5V I/O 1.2V Std cell logic 1V Full Custom cell Total Power consumption 2.5W Final Package: HSBGA 529, 23 mm2 A big difference between our chip and typical memory is the input and output bandwidth. We use an ip core made by Silicon Creation that is a serdes LVDS at 2.4 gibt/s. The total input bandwidth is 8 bus times 2.4gbit/s that is 19gbit/s. in out we have 2.4gbit/s. the internal clock is 100Mhz. And the power consumption at full operation is roughly 2.5W. In the picture is the summary of the evolution of the chip, that started in nineteen ninety with a chip with only 128 patterns made with a 0.7mm technology and now we are developing a chip at 65nanometers with 128K patterns with serial I/O Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

10 High Energy Physic Applications
This is a product of two proton bunches collision; the Associative Memory does: Analyzes thousand of tracks in each event to decide in few microseconds which one are good enough to be stored. Solves the huge combinatory problem exploiting its parallelism. Provides a full list of high resolution tracks to High Level Trigger algorithms for each event. Event size: 1.5 Mbyte; event rate: 100kHz; Total system bandwidth Gbyte/s Now let’s see how this hardware is used in High energy physics. Each collision between two proton bunches produces thousand of particle tracks. The system in tens of microseconds has to decide if inside this crowd of tracks is hidden an interesting event. The Associative memory chip solves the huge combinatory problem, that is the most timing consuming task. The chip Provides a full list of high resolution tracks to High Level Trigger algorithms for each event. To give and idea we manage 1.5 Megabyte of data for each event with a rate of 100 k Hz. For a total of 150 gigabyte per secondo. Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

11 Applications outside of HEP:
AM chip in the low level AM chip in the high level Now we want to understand how the AM chip could works outside the High energy physics field. I will propose a pre filter application and after how I will summarize the interesting features to undestardand if is possible to use this idea in the high level. Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

12 Image pre-filter: Michela Del Viva algorithm (previous talk)
Store in the memory the significant patterns (Es. 3x3) to recognize the image. B/W … =512 patterns B/W + time = 128 M patterns Reconstruct the image accepting only patterns that match the stored ones The result is a compressed image, where is possible to recognize the relevant features 4 grey level: 2000 stored patterns 1/64 di 1 chip = 50 mW Here I want to show you that is possible to use the AM chip to implement the filter algorithm explained in the previous talk. It is possible to reconstruct the image using only the significant patterns stored in the memory in a training phase. We build small arrays of pixels (3x3 for static images or 3x3x3 for movies) that becomes the relevant patterns stored in the memory. The images are reconstructed accepting only patterns that match the memory content. I report here the result of two simulations of the algorithm applied to the same picture on the left. in the picture in the center are used only 16 stored patterns to filter the image, 3% of the total, in the second image are used patterns with 4 grey levels and Patterns are stored (1% of the total). In this case we use only a little fraction of the memory available in a single chip and the power consumption is quite low, so that function can be easily added to a low power application like a smart camera. Similarly with 1 Million of patterns (1% of the total) that corresponds to 8 Am chips, we can analyze movies. I cannot show simulation study for movie analysis because it is too much timing consuming, we will do it using the HW. Accepting only these 16 stored patterns: Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

13 AM chip in the high level
CAM based, high density and high mach rate Independence between input Majority Logic Don’t care bits High bandwidth We are studying the possibility to use this HW in the “Image Understanding” phase Binary classification Machine learning Lets summarize our idea. We have a Memory based on CMA cells, with high density and high match rate. The input busses are independent to each other. We can use a majority logic and don’t care bits to be flexible in the comparison and we have an high bandwidth. We are understanding if with this feature is possible to implement directly on hardware high level algorithm. Like Binary classification or in general Machine learning. Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

14 Conclusion AM chip Hardware architecture Useful Features
HEP Application Outside HEP Applications So in conclusion we saw the AM chip internal architecture with all the useful features, we saw a quick example in the high energy physics applications. After we saw the image pre filter application and I described a general approach to a decision problem. Saverio Citraro – University of Pisa & INFN WASC14 Conference June /15

15 Are there any questions?
Thank You! Are there any questions? Thanks, are there any questions?

16 Backup

17 Variable Pattern resolution, Memory space
The coincidence to be find: 28,0°< T < 28,9° AND P =1010 mBar ecc. AND V=5 m/s Without don’t care bits I have to store: 28,0° mBar 5 m/s … 28,1° mBar 5 m/s … 28,2° mBar 5 m/s … 28,9° 1010 mBar m/s … With don’t care bits I can store: 28,X° 1010 mBar 5 m/s …

18 ATLAS silicon detector

19 Pattern Matching: ~25 TB/s
FTK Architecture ~380 ROLs @2 Gb/s 100 GB/s 8x128 Slinks @6Gb/s ~800 GB/s Only here: 750 Slinks ~200 GB/s Pattern Matching: ~25 TB/s - 5 ATCA crates (input and output data formatting) - 8 VME Crates: - 128 Processing Unit (track reconstruction) - Total of ~1500 FPGAs, ~8200 AMChips

20 Pattern matching

21 Processing Unit


Download ppt "The Associative Memory Chip"

Similar presentations


Ads by Google