Content Addressable Memories

Slides:



Advertisements
Similar presentations
IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.
Advertisements

A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
Chapter 5 Internal Memory
Dr. Rabie A. Ramadan Al-Azhar University Lecture 3
NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
EXTERNAL COMMUNICATIONS DESIGNING AN EXTERNAL 3 BYTE INTERFACE Mark Neil - Microprocessor Course 1 External Memory & I/O.
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
Chapter 8 Hardware Conventional Computer Hardware Architecture.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
Fast Filter Updates for Packet Classification using TCAM Authors: Haoyu Song, Jonathan Turner. Publisher: GLOBECOM 2006, IEEE Present: Chen-Yu Lin Date:
1 ENTS689L: Packet Processing and Switching Buffer-less Switch Fabric Architectures Buffer-less Switch Fabric Architectures Vahid Tabatabaee Fall 2006.
ClassiPI A Classifier for next generation Content and Policy based Switches SwitchOn Networks Inc. Sundar Iyer, Ajay Desai, Ajay Tambe, Ajit Shelat.
CS 268: Lecture 12 (Router Design) Ion Stoica March 18, 2002.
Counters and Registers
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Router Architectures An overview of router architectures.
Paper Review Building a Robust Software-based Router Using Network Processors.
Sarang Dharmapurikar With contributions from : Praveen Krishnamurthy,
LSU 10/22/2004Serial I/O1 Programming Unit, Lecture 5.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.
Hardware Implementation of Fast Forwarding Engine using Standard Memory and Dedicated Circuit Kazuya ZAITSU, Shingo ATA, Ikuo OKA (Osaka City University,
Modular SRAM-based Binary Content-Addressable Memories Ameer M.S. Abdelhadi and Guy G.F. Lemieux Department of Electrical and Computer Engineering University.
IT253: Computer Organization
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
CA-RAM: A High-Performance Memory Substrate for Search-Intensive Applications Sangyeun Cho, J. R. Martin, R. Xu, M. H. Hammoud and R. Melhem Dept. of Computer.
8279 KEYBOARD AND DISPLAY INTERFACING
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #14 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”
MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
Class 09 Content Addressable Memories Cell Design and Peripheral Circuits.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Introduction to Microprocessors
1 Power-Efficient TCAM Partitioning for IP Lookups with Incremental Updates Author: Yeim-Kuan Chang Publisher: ICOIN 2005 Presenter: Po Ting Huang Date:
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Computer Architecture Lecture 24 Fasih ur Rehman.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Associative Memory “Remembering”? – Associating something with sensory cues Cues in terms of text, picture or anything Modeling the process of memorization.
A Dynamic Longest Prefix Matching Content Addressable Memory for IP Routing Author: Satendra Kumar Maurya, Lawrence T. Clark Publisher: IEEE TRANSACTIONS.
High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.
CS 740: Advanced Computer Networks IP Lookup and classification Supplemental material 02/05/2007.
8279 KEYBOARD AND DISPLAY INTERFACING
COMP541 Memories II: DRAMs
Parallel tree search: An algorithmic approach for multi- field packet classification Authors: Derek Pao and Cutson Liu. Publisher: Computer communications.
Packet Switch Architectures The following are (sometimes modified and rearranged slides) from an ACM Sigcomm 99 Tutorial by Nick McKeown and Balaji Prabhakar,
07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Computer Architecture Chapter (5): Internal Memory
IP Address Lookup Masoud Sabaei Assistant professor Computer Engineering and Information Technology Department, Amirkabir University of Technology.
Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.
Behrouz A. Forouzan TCP/IP Protocol Suite, 3rd Ed.
COMP541 Memories II: DRAMs
Chapter 2 Memory and process management
Instructor Materials Chapter 5: Ethernet
IP Routers – internal view
CS 31006: Computer Networks – The Routers
Scalable Memory-Less Architecture for String Matching With FPGAs
Implementing an OpenFlow Switch on the NetFPGA platform
IP Addressing Introductory material
Jason Klaus, Duncan Elliott Confidential
Ameer M.S. Abdelhadi*, Guy G.F. Lemieux+, and Lesley Shannon*
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Authors: A. Rasmussen, A. Kragelund, M. Berger, H. Wessing, S. Ruepp
Authors: Ding-Yuan Lee, Ching-Che Wang, An-Yeu Wu Publisher: 2019 VLSI
Presentation transcript:

Content Addressable Memories Vahid Tabatabaee Fall 2007

References Title: Network Processors Architectures, Protocols, and Platforms Author: Panos C. Lekkas Publisher: McGraw-Hill Kostas Pagiamtzis, Ali Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J of Solid-State Circuits vol. 41, No.3, March 2006. NetLogic MicroSystems Application Note, “Intradevice Configuration of Network Search Engines”. NetLogic MicroSystems Application Note, “High Performance Layer 3 Forwarding”. IDT White Paper, “Taking Packet Processing to the Next Level”.

Classification and Search Engines Classification engine receives streams of packets as its input. It applies a set of application-specific sorting rules and policies continuously on the packets. It ends up compiling a series of new parallel packet streams in queues of packets.ored. For classification the NP should consult a memory bank, a lookup table or even a data base where the rules are stored. Search engines are used for consultation of a lookup table or a database based on rules and policies for the correct classification. Search engines are mostly based on associative memory, which is also known as CAM

What is CAM? Content Addressable Memory is a special kind of memory! Read operation in traditional memory: Input is address location of the content that we are interested in it. Output is the content of that address. In CAM it is the reverse: Input is associated with something stored in the memory. Output is location where the associated content is stored.

CAM for Routing Table Implementation CAM can be used as a search engine. We want to find matching contents in a database or Table. Example Routing Table Source: http://pagiamtzis.com/cam/camintro.html

Simplified CAM Block Diagram The input to the system is the search word. The search word is broadcast on the search lines. Match line indicates if there were a match btw. the search and stored word. Encoder specifies the match location. If multiple matches, a priority encoder selects the first match. Hit signal specifies if there is no match. The length of the search word is long ranging from 36 to 144 bits. Table size ranges: a few hundred to 32K. Address space : 7 to 15 bits. Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006

CAM Memory Size Largest available around 18 Mbit (single chip). Rule of thumb: Largest CAM chip is about half the largest available SRAM chip. A typical CAM cell consists of two SRAM cells. Exponential growth rate on the size Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006

CAM Basics The search-data word is loaded into the search-data register. All match-lines are pre-charged to high (temporary match state). Search line drivers broadcast the search word onto the differential search lines. Each CAM core compares its stored bit against the bit on the corresponding search-lines. Match words that have at least one missing bit, discharge to ground. Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006

Type of CAMs Binary CAM (BCAM) only stores 0s and 1s Applications: MAC table consultation. Layer 2 security related VPN segregation. Ternary CAM (TCAM) stores 0s, 1s and don’t cares. Application: when we need wilds cards such as, layer 3 and 4 classification for QoS and CoS purposes. IP routing (longest prefix matching). Available sizes: 1Mb, 2Mb, 4.7Mb, 9.4Mb, and 18.8Mb. CAM entries are structured as multiples of 36 bits rather than 32 bits.

CAM Advantages They associate the input (comparand) with their memory contents in one clock cycle. They are configurable in multiple formats of width and depth of search data that allows searches to be conducted in parallel. CAM can be cascaded to increase the size of lookup tables that they can store. We can add new entries into their table to learn what they don’t know before. They are one of the appropriate solutions for higher speeds.

CAM Disadvantages They cost several hundred of dollars per CAM even in large quantities. They occupy a relatively large footprint on a card. They consume excessive power. Generic system engineering problems: Interface with network processor. Simultaneous table update and looking up requests.

CAM structure The comparand bus is 72 bytes wide bidirectional. The result bus is output. Command bus enables instructions to be loaded to the CAM. It has 8 configurable banks of memory. The NPU issues a command to the CAM. CAM then performs exact match or uses wildcard characters to extract relevant information. There are two sets of mask registers inside the CAM.

CAM structure There is global mask registers which can remove specific bits and a mask register that is present in each location of memory. The search result can be one output (highest priority) Burst of successive results. The output port is 24 bytes wide. Flag and control signals specify status of the banks of the memory. They also enable us to cascade multiple chips.

CAM Features CAM Cascading: We can cascade up to 8 pieces without incurring performance penalty in search time (72 bits x 512K). We can cascade up to 32 pieces with performance degradation (72 bits x 2M). Terminology: Initializing the CAM: writing the table into the memory. Learning: updating specific table entries. Writing search key to the CAM: search operation Handling wider keys: Most CAM support 72 bit keys. They can support wider keys in native hardware. Shorter keys: can be handled at the system level more efficiently.

CAM Latency Clock rate is between 66 to 133 MHz. The clock speed determines maximum search capacity. Factors affecting the search performance: Key size Table size For the system designer the total latency to retrieve data from the SRAM connected to the CAM is important. By using pipeline and multi-thread techniques for resource allocation we can ease the CAM speed requirements. Source: IDT

Packet Search Speed Requirements Source: IDT Source: IDT article in CommsDesign: http://www.commsdesign.com/showArticle.jhtml?articleID=16501972

Management of Tables Inside a CAM It is important to squeeze as much information as we can in a CAM. Example from Netlogic application notes: We want to store 4 tables of 32 bit wide IP destination addresses. The CAM is 128 bits wide. If we store directly in every slot 96 bits are wasted. We can arrange the 32 bit wide tables next to each other. Every 128 bit slot is partitioned into four 32 bit slots. These are 3rd, 2nd, 1st, and 0th tables going from left to right. We use the global mask register to access only one of the tables. MASK 3 00000000 FFFFFFFF MASK 2 MASK 1 MASK 0

Example Continued We can still use the mask register (not global mask register) to do maximum prefix length match.

Table Aggregation We can use tag bits to aggregate multiple tables in a single CAM. Example: We want to use a single CAM (NL85721) for IPV4 packet classification and forwarding. We want to filter packets based on other parameters such as VPN. We can have an undesired match when we want to do a classification. CAM word 0 does not match but the dest. address matches CAM word 1 Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf

Tag bits to avoid undesired matches Tag bits can be used to differentiate between tables. Tag bits should not be masked. For packet classification tag bit is 0 and for packet forwarding it is 1. Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf

Vertically Oriented Table Aggregation We can use validity bits to support multiple tables with different number of entries. We need one validity bit for each table. When the validity bit in a slot is 1 the corresponding table has a valid entry. In the comparand register, only the validity bit of the table that is under search should be 1. Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf

System Design Issues (multiple searches) For deep packet inspection, several searches must occur simultaneously. For example: MAC table, IP table, rules table, flow-management table. Question: Do we use 4 CAMs or just 1 CAM with 4 partitions. If we use only 1 CAM: Some tables are very large and some small. This approach wastes expensive partitions. If we use 4 CAMs: It does suffer when smaller tables do not justify using separate CAMs. The overall cost also increases since we have to use separate SRAM too.

System Design Issues (shorter and longer search keys) We showed how we can implement 36 bit search tables in a 72 bit wide CAM. This approach reduces the speed to half since we need to search two time for each key. Some CAMS are hardwired to support both 36 and 72 bit wide search keys but they are more expensive. For longer search keys the are two choices: We can use double data rate (DDR) bus and load meaningful bits at both the rising and dropping edge of the clock. We can double the clock frequency of the that loads the comparands.

System Design Issues (simultaneous update and search) CAMs can not be updated in a location while searching at the same time. When we do update packets can not be forwarded and they are back logged. We can have a backup CAM for update while search is done on the other CAM. Some designs offer a third port for table maintenance without inhibiting search operations (SiberCore is an example). Increases pin count, board real estate, signals to be routed on the board.

System Design Issues (CIDR table update) Recall that CIDR works based on the longest prefix match (LPM). CAM segments are created based on the prefix length. Some empty slots are left in each segment to accommodate new entries. If a segment is suddenly filled up, the table must be taken offline to reshuffle the entries. A read and write operation is needed for each entry that must be relocated. We may need a read and write for the mask word too. Source: http://www.netlogicmicro.com/pdf/cidr_white_paper.pdf

CIDR table update: worst case analysis What is the worst case scenario: All segments but one are full A new entry may need up to 31 move operations. Each move requires 4 clock cycles for total of 4 x 31 = 124 clock cycles We have 3000 routing updates per second 3000 x 124 = 372000 clock cycles per second If the NP clock rate is 100 MHz the cycle time is 10 nsec How much time the update consumes: 372000 cycles x 10 nsec per cycle = 3.72 msec In OC-192 rate, we have around 20 to 30 MPPS Therefore, 74,400 to 111,600 packets will not be classified and should be discarded.

Reproaches against CAM based search engines (POWER) There is a misnomer that power consumption of CAM increases! It does not make sense to compare power consumptions of 2Mb CAM clocked at 66 MHz and capable of 66 Msps with 9Mb CAM clocked at 150 MHZ capable of 125 Msps. Power consumption is result of multiple factors such as: Semiconductor manufacturing process. Number of searches per second. Storage density. The smaller the process the larger the capacity; it can also cause drop in the power supply and increase in the clock rate. 0.18μ process 50% less power than 0.25μ and 30% further improvement in 0.15μ. The absolute power consumption is increasing, because: Larger table. Wider search key for deep packet classification. Increased wire speed. Make sure to consider worst case scenarios not the data sheet values.

Reproaches against CAM based search engines Table maintenance and management is a software related problem. Third port (Synchronous Maintenance Interface [SMI]) for SiberCore CAMs is an interesting way of having table maintenance without affecting of the ongoing search processes. Sort-free CAM that do not need partitioning CAMs. Density and footprint (Not a real issue) example: The three members in the family, the CYNSE10512, 10256, and 10128, provide address tables of 512k, 256k, and 128k entries (18 Mbits, 9 Mbits, and 4.5 Mbits), respectively. All three devices are housed in 388-contact BGA packages. Price: $75, $135, $275 1,000,000 entry IPV4 can be handled in two 18Mbits CAM.

Reproaches against CAM based search engines Inflexibility with Table Configurations: This is a real issue Some applications need flexible table sizes and width More research and development needed. Price In absolute terms they are expensive. They are sophisticated complex products that are indispensable in most designs. So they should be expensive!