EMOMA- Exact Match in One Memory Access

Slides:



Advertisements
Similar presentations
A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
Advertisements

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
A Memory-Efficient Reconfigurable Aho-Corasick FSM Implementation for Intrusion Detection Systems Authors: Seongwook Youn and Dennis McLeod Presenter:
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
Fast Filter Updates for Packet Classification using TCAM Authors: Haoyu Song, Jonathan Turner. Publisher: GLOBECOM 2006, IEEE Present: Chen-Yu Lin Date:
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
An Efficient Hardware-based Multi-hash Scheme for High Speed IP Lookup Department of Computer Science and Information Engineering National Cheng Kung University,
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Memory-Efficient Regular Expression Search Using State Merging Department of Computer Science and Information Engineering National Cheng Kung University,
1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.
Packet Classification using Rule Caching Author: Nitesh B. Guinde, Roberto Rojas-Cessa, Sotirios G. Ziavras Publisher: IISA, 2013 Fourth International.
Packet Classification Using Multi-Iteration RFC Author: Chun-Hui Tsai, Hung-Mao Chu, Pi-Chung Wang Publisher: COMPSACW, 2013 IEEE 37th Annual (Computer.
Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.
Leveraging Traffic Repetitions for High- Speed Deep Packet Inspection Author: Anat Bremler-Barr, Shimrit Tzur David, Yotam Harchol, David Hay Publisher:
A Hybrid IP Lookup Architecture with Fast Updates Author : Layong Luo, Gaogang Xie, Yingke Xie, Laurent Mathy, Kavé Salamatian Conference: IEEE INFOCOM,
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
1 Power-Efficient TCAM Partitioning for IP Lookups with Incremental Updates Author: Yeim-Kuan Chang Publisher: ICOIN 2005 Presenter: Po Ting Huang Date:
SwinTop: Optimizing Memory Efficiency of Packet Classification in Network Author: Chen, Chang; Cai, Liangwei; Xiang, Yang; Li, Jun Conference: Communication.
Research on TCAM-based OpenFlow Switch Author: Fei Long, Zhigang Sun, Ziwen Zhang, Hui Chen, Longgen Liao Conference: 2012 International Conference on.
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
OpenFlow MPLS and the Open Source Label Switched Router Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan,
Updating Designed for Fast IP Lookup Author : Natasa Maksic, Zoran Chicha and Aleksandra Smiljani´c Conference: IEEE High Performance Switching and Routing.
A Fast Regular Expression Matching Engine for NIDS Applying Prediction Scheme Author: Lei Jiang, Qiong Dai, Qiu Tang, Jianlong Tan and Binxing Fang Publisher:
Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
Packet Classification Using Dynamically Generated Decision Trees
Hierarchical packet classification using a Bloom filter and rule-priority tries Source : Computer Communications Authors : A. G. Alagu Priya 、 Hyesook.
LOP_RE: Range Encoding for Low Power Packet Classification Author: Xin He, Jorgen Peddersen and Sri Parameswaran Conference : IEEE 34th Conference on Local.
Practical Multituple Packet Classification Using Dynamic Discrete Bit Selection Author: Baohua Yang, Fong J., Weirong Jiang, Yibo Xue, Jun Li Publisher:
IP Address Lookup Masoud Sabaei Assistant professor Computer Engineering and Information Technology Department, Amirkabir University of Technology.
LightFlow : Speeding Up GPU-based Flow Switching and Facilitating Maintenance of Flow Table Author : Nobutaka Matsumoto and Michiaki Hayashi Conference:
Scalable Multi-match Packet Classification Using TCAM and SRAM Author: Yu-Chieh Cheng, Pi-Chung Wang Publisher: IEEE Transactions on Computers (2015) Presenter:
Fundamental Structures of Computer Science II
Reorganized and Compact DFA for Efficient Regular Expression Matching
Author: Heeyeol Yu; Mahapatra, R.; Publisher: IEEE INFOCOM 2008
Chapter 2 Memory and process management
Updating SF-Tree Speaker: Ho Wai Shing.
A DFA with Extended Character-Set for Fast Deep Packet Inspection
Dynamic Hashing (Chapter 12)
2018/6/26 An Energy-efficient TCAM-based Packet Classification with Decision-tree Mapping Author: Zhao Ruan, Xianfeng Li , Wenjun Li Publisher: 2013.
The Variable-Increment Counting Bloom Filter
Hashing CENG 351.
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Statistical Optimal Hash-based Longest Prefix Match
SigMatch Fast and Scalable Multi-Pattern Matching
Parallel Processing Priority Trie-based IP Lookup Approach
Scalable Memory-Less Architecture for String Matching With FPGAs
Binary Prefix Search Author: Yeim-Kuan Chang
2019/1/3 Exscind: Fast Pattern Matching for Intrusion Detection Using Exclusion and Inclusion Filters Next Generation Web Services Practices (NWeSP) 2011.
Memory-Efficient Regular Expression Search Using State Merging
Virtual TCAM for Data Center Switches
Packet Classification Using Coarse-Grained Tuple Spaces
A Small and Fast IP Forwarding Table Using Hashing
Scalable Multi-Match Packet Classification Using TCAM and SRAM
A New String Matching Algorithm Based on Logical Indexing
Module 12a: Dynamic Hashing
2019/5/2 Using Path Label Routing in Wide Area Software-Defined Networks with OpenFlow ICNP = International Conference on Network Protocols Presenter:Hung-Yen.
Power-efficient range-match-based packet classification on FPGA
Authors: A. Rasmussen, A. Kragelund, M. Berger, H. Wessing, S. Ruepp
A Hybrid IP Lookup Architecture with Fast Updates
High Performance Pattern Matching using Bloom–Bloomier Filter
A SRAM-based Architecture for Trie-based IP Lookup Using FPGA
Authors: Ding-Yuan Lee, Ching-Che Wang, An-Yeu Wu Publisher: 2019 VLSI
2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos.
MEET-IP Memory and Energy Efficient TCAM-based IP Lookup
Towards TCAM-based Scalable Virtual Routers
Packet Classification Using Binary Content Addressable Memory
2019/11/12 Efficient Measurement on Programmable Switches Using Probabilistic Recirculation Presenter:Hung-Yen Wang Authors:Ran Ben Basat, Xiaoqi Chen,
Presentation transcript:

EMOMA- Exact Match in One Memory Access 2019/4/20 EMOMA- Exact Match in One Memory Access Author: Salvatore Pontarelli, Pedro Reviriego, Michael Mitzenmacher Publisher/Conf.: DOI 10.1109/TKDE.2018.2818716, IEEE Transactions on Knowledge and Data Engineering Presenter: 林鈺航 Date: 2018/10/03 Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C. 1 CSIE CIAL Lab

Introduction Both CAM and TCAM are costly in terms of circuit area and power and therefore alternative solutions based on hashing techniques using standard memories are widely used. In addition to reducing the circuit complexity and power consumption, the use of hash based techniques provides additional flexibility that is beneficial to support programmability in software defined networks. Some tables used for packet processing are necessarily large and need to be stored in the external memory, limiting the speed of packet processing. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Introduction When a single external memory is used, the time needed to complete a lookup depends on the number of external memory accesses. If we use a Bloom filter for each possible choice of hash function to track which elements have been placed by each hash function, a location in the external memory need only to be accessed when the corresponding Bloom filter returns that the key can be found in that location. EMOMA uses the internal memory to store an approximate compressed information about the entries stored in the external memory to reduce the number of external memory accesses . National Cheng Kung University CSIE Computer & Internet Architecture Lab

Cuckoo Hashing A cuckoo hash table uses a set of 𝑑 hash functions to access a table composed of buckets, each of which can store one or more entries. A given element 𝑥 is placed in one of the buckets ℎ 1 (𝑥), ℎ 2 (𝑥), . . . , ℎ 𝑑 (𝑥)in the table. Search: The buckets ℎ 𝑖 (𝑥), are accessed and the entries stored there are compared with 𝑥; if 𝑥 is found a match is returned. Insertion: The element 𝑥 is inserted in one of the 𝑑 buckets. If all the buckets are initially full, an element 𝑦 in one of the buckets is displaced to make room for 𝑥 and recursively inserted. Removal: The element is searched for, and if found it is removed. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Counting Block Bloom Filters Counting Bloom filters use a counter in each position, the counters associated with the positions given by the k hash functions are incremented during insertion and decremented during removal. Block denotes dividing the table in blocks and using first a block selection hash function ℎ 1 (𝑥) to select a block and then a set of k hash functions 𝑔 1 (𝑥), 𝑔 2 (𝑥), . . . 𝑔 𝑘 (𝑥), to select k positions within that block . block 0 ℎ 1 𝐴 =0 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 ℎ 1 𝐵 =0 National Cheng Kung University CSIE Computer & Internet Architecture Lab

DESCRIPTION OF EMOMA EMOMA is a dictionary data structure that keeps key-value pairs (𝑥, 𝑣 𝑥 ); the structure can be queried to determine the value 𝑣 𝑥 for a resident key 𝑥 (or it returns a null value if 𝑥 is not a stored key), and allows for the insertion and deletion of key-value pairs. A counting block Bloom filter (CBBF) is used to determine the hash function that needs to be used to search for an element. As long as the response of CBBF is always correct (no false positive case), all searches require exactly one access to the external memory. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Structures A CBBF that tracks all elements in the cuckoo table, the associated Bloom filter is stored on-chip and the counters are stored off-chip. A cuckoo hash table to store the elements and associated values, the table is stored off-chip and accessed using two hash functions ℎ 1 (𝑥) and ℎ 2 (𝑥). The same ℎ 1 (𝑥) used in CBBF block selection means that when inserting an element 𝑦 on the CBBF, the only other entries stored in the table that can produce a false positive in the CBBF are also in bucket ℎ 1 𝑦 . single table National Cheng Kung University CSIE Computer & Internet Architecture Lab

Structures A small stash used to store elements and their values that are pending insertion or that have failed insertion. In double table, to have the same number of buckets, the size of each of the tables should be half that of the single table, and the CBBF has also half the number of blocks, but to maintain the same amount of on-chip memory, the size of the block in the CBBF is double that of the single-table case. double table National Cheng Kung University CSIE Computer & Internet Architecture Lab

Search The element is compared with the elements in the stash. On a match, the value 𝑣 𝑥 associated with that entry is returned. Otherwise, the CBBF is checked by accessing position ℎ 1 (𝑥) and checking if the bits given by 𝑔 1 (𝑥) , 𝑔 𝑘 (𝑥) , . . . 𝑔 𝑘 (𝑥) are all set to one (a positive response) or not (a negative response). On a negative response, we read bucket ℎ 1 (𝑥) in the hash table and 𝑥 is compared with the elements stored there. On a match, the value 𝑣 𝑥 associated with that entry is returned, and otherwise we return a null value. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Insertion New element x The insertion algorithm will perform up to t iterations, were in each iteration an element from the stash is attempted to be placed. Step 1 Place 𝑥 in the stash Step 2 Select bucket ℎ 1 or ℎ 2 Step 3 Randomly select new element Select cell in the bucket yes Insert 𝑥 and update data structures Step 4 Iterations≤t no Element in the Stash? Step 5 yes no End National Cheng Kung University CSIE Computer & Internet Architecture Lab

Step 2 (1/2) Are there empty cells in ℎ 1 (𝑥) and ℎ 2 (𝑥) ? Select bucket ℎ 1 or ℎ 2 Are there empty cells in ℎ 1 (𝑥) and ℎ 2 (𝑥) ? Is the element 𝑥 being inserted a false positive on the CBBF? Does inserting 𝑥 in the CBBF create false positives for elements stored in bucket ℎ 1 (𝑥) ? National Cheng Kung University CSIE Computer & Internet Architecture Lab

Step 2 (2/2) Are there empty cells in ℎ 1 (𝑥) and ℎ 2 (𝑥) ? Select bucket ℎ 1 or ℎ 2 Are there empty cells in ℎ 1 (𝑥) and ℎ 2 (𝑥) ? Is the element 𝑥 being inserted a false positive on the CBBF? Does inserting 𝑥 in the CBBF create false positives for elements stored in bucket ℎ 1 (𝑥) ? National Cheng Kung University CSIE Computer & Internet Architecture Lab

Select cell in the bucket Step 3 Select cell in the bucket If there are empty cells in the bucket, select one of them randomly. If all cells are occupied the selection is done among elements that are not locked as follows: with probability P select randomly among elements that create the fewest locked elements when moved (elements inserted with ℎ 2 will never create false positives). National Cheng Kung University CSIE Computer & Internet Architecture Lab

Step 4 Insert 𝑥 and update data structures Put element in the stash and update CBBF if it was inserted using ℎ 2 , and may unlock elements that are no longer false positives on the CBBF. The number of iterations is increased by one before proceeding to the next step. Element in cell? no yes Put element in the stash and update CBBF 𝑥 creates FPs no yes Place FPs on the stash Insert x in CBBF if bucket is ℎ 2 Put x in cell and remove x from the stash Increase iterations by one National Cheng Kung University CSIE Computer & Internet Architecture Lab

About Insertion The number of iterations affects the time for an insertion process as well as the size of the stash that is needed. Generally, the more iterations, the longer an insertion can take, but the smaller a stash required. Elements may be left in the stash at the end of the insertion process. If the stash ever fails to have enough room for elements that have not been placed, the data structure fails. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Deletion If the element is found it is removed from the table, and otherwise a response indicating the element is not in the table can be returned. If any counter reaches zero, the corresponding bit in the bit (Bloom filter) representation of the CBBF is cleared. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Differences of EMOMA versus a standard cuckoo hash Elements that are false positives in the CBBF are “locked” and can only be inserted in the cuckoo hash table using the second hash function. This reduces the number of options to perform movements in the table. Insertions in the cuckoo hash table using the second hash function can create new false positives for the elements in bucket ℎ 1 (𝑥) that require additional movements. Those elements have to be placed in the stash and re-inserted into the second table. This means that, in contrast to standard cuckoo hashing, the stash occupancy can grow during an insertion. Therefore, the stash needs to be dimensioned to accommodate those elements in addition to the elements that have been unable to terminate insertion. 除了無法終止插入的元素之外,還需要確定存儲的大小以容納這些元素。 National Cheng Kung University CSIE Computer & Internet Architecture Lab

Evaluation of EMOMA Since all search operations are completed in one memory access, the main performance metrics for EMOMA are the memory occupancy that can be achieved before the data structure fails (by overflowing the stash on an insertion) and the average insertion time of an element. We present simulations showing the behavior of the stash with respect to the k and P parameters, then present simulations to evaluate the stash occupancy when the EMOMA structure works at a high load (95%) under dynamic conditions (repeated insertion and removal of elements). National Cheng Kung University CSIE Computer & Internet Architecture Lab

Parameter selection(1/3) Table that can hold 32K elements was used, the number of bit selection hash functions k used in the CBBF was set to four, and four bits per element were used for the CBBF while P varied from 0 to 1. The maximum number of iterations for each insertion t is set to 100. The result shows that in most cases it is beneficial to move elements that create the least number of false positives but a purely greedy strategy can lead to unfortunate behaviors. Recall P:select randomly among elements that create the fewest locked elements when moved (elements inserted with ℎ 2 will never create false positives). National Cheng Kung University CSIE Computer & Internet Architecture Lab

Parameter selection(2/3) We set P = 0.99 and we varied k from 1 to 8, the best values were k = 3, 4 when the double-table implementation is used and k = 3 when a single table is used. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Parameter selection(3/3) In all cases, the maximum stash size observed is fairly small, the results suggest that the single-table option is better, especially for large table sizes. Maximum stash size Single:9, 14, 16 Double:9, 18, 33 National Cheng Kung University CSIE Computer & Internet Architecture Lab

Dynamic behavior at maximum load We load the hash table to 95% memory occupancy, and then perform 16M replacement operations. The replacement first randomly selects an element in the EMOMA structure and removes it, then randomly creates a new entry and inserts it. Each data point gives the maximum stash occupancy observed over the 10 trials over the last 1M trials. The experiments show that both implementations reliably maintain a stable stash size under repeated insertions and removals, and the single-table implementation provides better performance than the double-table. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Insertion Time The average insertion time depends both on t, the maximum number of iterations allowed for a single insertion, and on the load of the EMOMA structure. The average insertion time increases when the load increases to a point where the table is almost full. insert yes Element in the Stash? Iterations≤t yes no no End Larger t allows for smaller stash sizes, as fewer elements are placed in the stash because they have run out of time when being inserted, but the corresponding drawback is an increase in the average insertion time.

Insertion Time We report the maximum stash occupancy observed over 100 trials at maximum load, for t = 50, 100, 500, and 1000, and for a table size of 8M elements. Higher values of t allow a smaller stash, and with the same value of t, the single-table configuration requires fewer elements in the stash than the double-table configuration. National Cheng Kung University CSIE Computer & Internet Architecture Lab

HARDWARE FEASIBILITY We leverage the reference design available for the SUME NetFPGA to implement our scheme. The reference design contains a MicroBlaze (the Xilinx 32-bit soft-core RISC microprocessor) that is used to control the blocks implemented in the FPGA the using the AXI-Lite bus. The microprocessor can be used to perform the insertion procedures of the EMOMA scheme, writing the necessary values in the CBBF, in the stash, and in the external memories. National Cheng Kung University CSIE Computer & Internet Architecture Lab

HARDWARE FEASIBILITY Our goal here is to determine the hardware resources that would be used by an EMOMA scheme for query operations. We select a key of 64 bits and an associated value of 64 bits. Therefore, each bucket of four cells has 512 bits. A bucket can be read in one memory access as a DRAM burst access provides precisely 512 bits. The main table has 524288 (512K) buckets of 512 bits requiring in total 256Mb of memory. The stash is realized implementing on the FPGA a 64x64 bits Content Addressable Memory with the write port connected to the AXI bus and the read port used to perform the query operations. 128*4 National Cheng Kung University CSIE Computer & Internet Architecture Lab

HARDWARE FEASIBILITY For the CBBF we used k = 4 hash functions and a memory size of 524288 (512K) words of 16 bits. The memory of the CBBF uses two ports: the write port is connected to the AXI bus and the read port is used for the query operations. National Cheng Kung University CSIE Computer & Internet Architecture Lab