Measurement Algorithms: Bloom Filters and Beyond George Varghese University of California, San Diego.

Slides:



Advertisements
Similar presentations
Bitmap algorithms for flow counting – Internet Measurement Conference, October 2003 Bitmap Algorithms for Counting Active Flows on High Speed Links Cristian.
Advertisements

New Directions in Traffic Measurement and Accounting Cristian Estan (joint work with George Varghese)
A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
New Directions in Traffic Measurement and Accounting Cristian Estan – UCSD George Varghese - UCSD Reviewed by Michela Becchi Discussion Leaders Andrew.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Network Algorithms, Lecture 4: Longest Matching Prefix Lookups George Varghese.
Fast Firewall Implementation for Software and Hardware-based Routers Lili Qiu, Microsoft Research George Varghese, UCSD Subhash Suri, UCSB 9 th International.
OpenSketch Slides courtesy of Minlan Yu 1. Management = Measurement + Control Traffic engineering – Identify large traffic aggregates, traffic changes.
Detecting Evasion Attacks at High Speeds without Reassembly Detecting Evasion Attacks at High Speeds without Reassembly George Varghese J. Andrew Fingerhut.
A Fast and Compact Method for Unveiling Significant Patterns in High-Speed Networks Tian Bu 1, Jin Cao 1, Aiyou Chen 1, Patrick P. C. Lee 2 Bell Labs,
M. Waldvogel, G. Varghese, J. Turner, B. Plattner Presenter: Shulin You UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Electrical and Computer Engineering.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Introspective Networks George Varghese University of California, San Diego.
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
BTrees & Bitmap Indexes
Informed Content Delivery Across Adaptive Overlay Networks J. Byers, J. Considine, M. Mitzenmacher and S. Rost Presented by Ananth Rajagopala-Rao.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Reverse Hashing for High-speed Network Monitoring: Algorithms, Evaluation, and Applications Robert Schweller 1, Zhichun Li 1, Yan Chen 1, Yan Gao 1, Ashish.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
“On Scalable Attack Detection in the Network” Ramana Rao Kompella, Sumeet Singh, and George Varghese Presented by Nadine Sundquist.
Automated Worm Fingerprinting Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage Manan Sanghi.
Packet Classification George Varghese. Original Motivation: Firewalls Firewalls use packet filtering to block say ssh and force access to web and mail.
ANOMALY DETECTION AND CHARACTERIZATION: LEARNING AND EXPERIANCE YAN CHEN – MATT MODAFF – AARON BEACH.
Algorithms for Network Security
Internet Quarantine: Requirements for Containing Self-Propagating Code David Moore et. al. University of California, San Diego.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Computer Networks Layering and Routing Dina Katabi
NET-REPLAY: A NEW NETWORK PRIMITIVE Ashok Anand Aditya Akella University of Wisconsin, Madison.
George Varghese (based on Cristi Estan’s work) University of California, San Diego May 2011 Internet traffic measurement: from packets to insight.
Tracking Port Scanners on the IP Backbone Tao Ye Sprint Burlingame, CA Avinash Sridharan University of Southern California.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Programmable Data Planes COS 597E: Software Defined Networking.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Fast and deterministic hash table lookup using discriminative bloom filters  Author: Kun Huang, Gaogang Xie,  Publisher: 2013 ELSEVIER Journal of Network.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.
New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* Joint work with: Dawn Song*, Phillip Gibbons ¶,
CEDAR Counter-Estimation Decoupling for Approximate Rates Erez Tsidon (Technion, Israel) Joint work with Iddo Hanniel and Isaac Keslassy ( Technion ) 1.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
Design of a System for Real- Time Worm Detection Bharath Madhusudan, John Lockwood Department of Computer Science and Engineering Washington University,
CEDAR Counter-Estimation Decoupling for Approximate Rates Erez Tsidon Joint work with Iddo Hanniel and Isaac Keslassy Technion, Israel 1.
Click to add Text Automated Worm Fingerprinting Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Department of Computer Science and Engineering.
Vladimír Smotlacha CESNET Full Packet Monitoring Sensors: Hardware and Software Challenges.
Wire Speed Packet Classification Without TCAMs ACM SIGMETRICS 2007 Qunfeng Dong (University of Wisconsin-Madison) Suman Banerjee (University of Wisconsin-Madison)
Packet Classification on Multiple Fields 참고 논문 : Pankaj Gupta and Nick McKeown SigComm 1999.
A Dynamic Packet Stamping Methodology for DDoS Defense Project Presentation by Maitreya Natu, Kireeti Valicherla, Namratha Hundigopal CISC 859 University.
Mapping Internet Sensors with Probe Response Attacks Authors: John Bethencourt, Jason Franklin, Mary Vernon Published At: Usenix Security Symposium, 2005.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
1 LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams Qun Huang and Patrick P. C. Lee The Chinese.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
Scalable High Speed IP Routing Lookups Scalable High Speed IP Routing Lookups Authors: M. Waldvogel, G. Varghese, J. Turner, B. Plattner Presenter: Zhqi.
1 Very Fast containment of Scanning Worms By: Artur Zak Modified by: David Allen Nicholas Weaver Stuart Staniford Vern Paxson ICSI Nevis Netowrks ICSI.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Workpackage 3 New security algorithm design ICS-FORTH Ipswich 19 th December 2007.
D 陳怡安 R 解巽評 R 高榮泰 IEEE/ACM TRANSACTIONS ON NETWORKING OCTOBER 2006 Cristian Estan, George Varghese, Member, IEEE, and Michael Fisk.
High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.
CS 740: Advanced Computer Networks IP Lookup and classification Supplemental material 02/05/2007.
Protocol Layering Chapter 11.
Automated Worm Fingerprinting Authors: Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Publish: OSDI'04. Presenter: YanYan Wang.
1 ECE 526 – Network Processing Systems Design System Implementation Principles I Varghese Chapter 3.
Packet Classification Using Multidimensional Cutting Sumeet Singh (UCSD) Florin Baboescu (UCSD) George Varghese (UCSD) Jia Wang (AT&T Labs-Research) Reviewed.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Hierarchical packet classification using a Bloom filter and rule-priority tries Source : Computer Communications Authors : A. G. Alagu Priya 、 Hyesook.
Internet Quarantine: Requirements for Containing Self-Propagating Code
Module 11: File Structure
Data Streaming in Computer Networking
Chapter 6 Delivery & Forwarding of IP Packets
Hash Functions for Network Applications (II)
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

Measurement Algorithms: Bloom Filters and Beyond George Varghese University of California, San Diego

1.Basic: stateless, transparent. Tools: protocol design (e.g., soft-state) 2. Active: customizable, re-configurable Tools: Code Safety (e.g., sandboxing) 3. Introspective: pattern detection/response Tools: Streaming algorithms, statistical inference (e.g. Bloom Filters, sampling) Network Evolution? Hawkeye enables introspection for measurement & security

What is Introspection? Detecting patterns in data traffic, either in real- time or based on packet logs. Examples:  Measurement Introspection: Identify resource usage patterns for better resource management  Security Introspection: Identify attack patterns to mitigate or prevent attacks.  Fault Introspection: Identify fault or anomaly patterns to allow automated fault repair. Motivated by market pull and technology push

Market Pull Better ROI: Optimize network resources (BGP policy, OSPF weights, light up fibers, add bandwidth) based on resource usage patterns. Better security: Allowing organization to be open for business during mass or targeted attacks is major differentiator. Better Fault Detection: Many performance anomalies can be detected by better measurement primitives (e.g., Goldman-Sachs) Customer Site 1 Customer Site 3 Customer Site 2 reroute or add B/W

Technology Push: Streaming Algorithms and Hardware Gates Algorithms: Recent major thrust in streaming algorithms in database, web analysis, theory, networks Hardware: Memory accesses remain expensive (< 100) and SRAM not scaling as fast as number of connections (< 32 Mbits), but gates are plentiful. Mapping: Many randomized streaming algorithms (e.g., Bloom Filters, Min-wise hashing) developed to find patterns in disk logs map well to network ASICs. Opportunity: Invent or adapt streaming algorithms for networking patterns.

Concerns about Network Introspection Speed: Can hardware run fast enough?  Recall IP lookups in 1990’s, surprisingly complex things (branch predictors, TCP Offload) being done routinely today.  Most of the algorithms described below are being implemented at 24 Gbps in Hawkeye Inflexible: Hardware not easy to change.  Design hardware to identify useful “primitive” patterns that can be combined. (Exactly what Hawkeye does)  Network Processors can offer flexibility & speed. End-to-end argument: Not simple, stateless core.  Not required for correctness of basic forwarding, but only as an optimization or value-add.

Introspection as Pattern Detection Within Packet Patterns: Prefix matches, classification, signature detection (e.g., Code Red Payload) Across Packet Patterns: Scheduling, Timing, Membership Checks Heavy-hitters, large flows, partial completion, counting flows S1 S2 S5S2S1 ROUTER

Pattern Detection Algorithm Requirements Low memory: On-chip SRAM limited to around Mbits. Not constant but is not scaling with number of concurrent conversations. May need to replicate. Small processing: For wire-speed at 40 Gbps, using 40 byte packets, have 8 nsec. Using 1 nsec SRAM, 8 memory accesses. Factor of 30 in parallelism buys 240 accesses.

Talk Outline Part 1: Motivation Part 2: Basic Patterns and Algorithms (membership checks, heavy-hitters, many flows, partial completion) Part 3: Combining patterns to solve useful application problems Part 4: Conclusions.

Pattern 1: Membership Check Membership Check: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on that belong to a pre-specified set (e.g., black list) S1 S6S2S5S2S8 Set contains only S2, S5 B. Bloom, Comm. ACM, July 1970

Field Extraction Equal to 1 ? Equal to 1 Equal to 1 ? BitMap Hash 1 Hash 2 Hash 3 Stage 1 Stage 2 Stage 3 ALERT ! If all bits are set Membership Check via Bloom Filter Set

Trivial Bloom Filter Analysis Assume set of size Bound probability that a flow F not in set gets through 4 stages of size each. Why trouble?: F can pass a stage if it hashes to a bit set by some real member of the set. Single stage probability: At most 1000/10,000 buckets can have set bits. Thus probability F passing a stage is less than 1000/10,000 = 0.1 Multistage probability: To be branded, F must be unlucky in all 6 stages with a probability of no more than which is very small. Can play with numbers

Accurate Bloom Filter Analysis Assume set of size Bound probability that a flow F not in set gets through 4 stages of size each. Previous analysis ignores bit collisions Single stage probability: Probability of F passing a stage is s = (1 – (1-1/10,000)^1000) = 1 – e^{-0.1} Multistage probability: To be branded, F must be unlucky in all 6 stages with a probability of no more than s 6 which is very small.

Applications Replacement for a hash table: useful when storage is important, identifiers are long, false positives are acceptable, & membership check suffices Example 1: String Matching: exact strings of up to 4000 strings of 40 bytes each using only on-chip SRAM. Example 3: Reporting

Example 1: String Matching A0 A1 An String Database to Block A2 ST0 ST1 ST2 STn Anchor Strings Multi Stage Filter Hash Function Sushil Singh, G. Varghese, J. Huber, Sumeet Singh, Patent Application

String Matching Continued: String Grouping A0 A1 An A2 ST0 ST1 ST2 STn Hash Function Hash Bucket-0 Hash Bucket-1 Hash Bucket-m

String Matching Continued: Bit Trees A2 A8 A11 ST2 ST8 ST11 A17 ST LOC L1 A8 A11 ST8 ST11 A2 ST2 A17 ST17 L1 L2 L3 ST8 ST11 ST2 ST LOC L2 LOC L3 Strings in a single hash bucket

Example 3: Scalable Reporting (Carousel, NSDI 2010) Problem: When a worm breaks out, how do we report all infected machines. Logging packets w. pattern can result in millions of sources and many duplicates Solution: Use a sampled Bloom filter and a more bit. Start with no sampling. Any source IP in a worm packet is reported and placed in Bloom filter of size B to suppress duplicates. Stop when B are reported and set “more” bit. Recursive Solution: If “more” bit repeat algorithm twice for LSB of Hashed (SourceIP) = 0 and 1. If still more repeat it four times. Nearly optimal solution.

Timed Bloom Filters Question: How can we add notion of time to Bloom Filters without lots of memory? Solution: Use 2 Bloom Filters, Old and New.  Insert: Insert into New  Search: Search in both New and Old  Age every T seconds: Old := New; New:= Empty Property: Any entry not refreshed for 2T is deleted. An entry refreshed within T is in. U.S Patent Application: Paul Owen & Andy Fingerhut et al

Pattern 2a: Heavy-hitters with Threshold Heavy-hitters: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on a link that send more than a threshold T (say 1% of the traffic) on a link. S1 S6S2S5S2 Source S2 is 30 percent of traffic sequence Estan,Varghese, ACM TOCS 2003

Field Extraction Equal to T? Counters Hash 1 Hash 2 Hash 3 Stage 1 Stage 2 Stage 3 ALERT ! If all counters above threshold T Heavy Hitters with Multistage Filters Increment

Multistage filters in Action Grey = other flows Yellow = small flow Green = large flow Stage 1 Stage 3 Stage 2 Counters Threshold...

Multistage Filter Analysis Assume 1 percent threshold. Bound probability that a flow F of 0.1 % or less gets through 6 stages of size 1000 each. Why trouble?: F can fall into a ``hot'' bucket if and only the sum of traffic of all other flows in that bucket is more than 0.9 % Single stage probability: At most 100/0.9 = 111 buckets that can be over 0.9 % before we bring on F. Thus probability F falls in a ``hot'' bucket is less than 111/1000 = Multistage probability: To be branded, F must be unlucky in all 6 stages with a probability of no more than which is very small. Thus at most 1000 false positives with very high probability.

Pattern 2b: Top K heavy-hitters Heavy-hitters: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on a link that are the top K talkers on a link. S1 S2S5S2 Source S2 and S1 are top K talkers Bonomi, Prabhakar, Zhang, Wu, Cisco Internal

Two simpler proposals SIFT (Prabhakar) and Sample-and-Hold (Estan-Varghese) both suggest sampling a packet with small probability p. Once sampled, place in CAM and watch all packets Idea: large flows are more likely to be sampled, and then we get exact counts Problem: CAM quickly gets muddied with mice and then elephants can be lost.

Elephant Traps: Sample and Recycle

Pattern 3: Partial Completion Partial Completion: In a measurement interval, detect the flows (e.g., destinations) which have several Start Packets (e.g., SYN) without the corresponding End (e.g., FIN). Destination X has 3 partial completions in sequence SYN x SYN Y SYN z FIN Y SYN x FIN Z Kompella,Singh,Varghese, IMC 2003

Field Extraction Equal to T? Counters Hash 1 Hash 2 Hash 3 Stage 1 Stage 2 Stage 3 ALERT ! If all counters above threshold Partial Completion Filters Increment for SYN, Decrement for FIN

Interval 1Interval 2Interval 3Interval 4 Long Lived Connection SYN y Retransmissions FIN z Retransmissions SYN x FIN x Analysis 1: Benign but Malformed Connections Model benign but malformed connections as adding extra SYN or FIN to an interval with probability 0.5

Greater than 6  Probability of false positives = Probability of false negatives = Analysis 2: using Gaussian approximation Counter Values Probability

Pattern 4: Many Flows Many Flows: In a measurement interval, find if number of tuples exceeds a threshold. S1 S6S2S5S2 6 packets but only 4 distinct sources Estan, Fisk, Varghese, IMC 2003, ACM TONS to appear

Simple Bitmap counting Problem: bitmap takes too much memory to count a large number of flows Hash based on flow identifier F Estimate: based on the number of bits set 1111

Sampled Bitmap counting Problem: inaccurate if too few or too many flows Solution: keep only a sample of the bitmap 11 Estimate: scale up sampled count

Multi-resolution Bitmap counting Solution: multiple bitmaps, each covering a different range Estimate: use first bitmap that has less than 93.1% of its bits set, count, scale 1-10 flows

Scalable Bitmap counting 1-10 flows At time 0, start with scale = 1 Later use with scale = Solution: one bitmap with an additional scale factor that is increased when all bits are set Estimate: count bits, correct, multiply by, scale factor. Can count to 1 million using 15 bit scale factor and 32-bit vector

Scaled Multi-resolution Bitmap counting Solution: multiple bitmaps, each covering a different range but each with a scale factor Estimate: use first bitmap that has less than 93.1% of its bits set, count, scale 1-10 flows Scale = 5 Scale = 2 Scale = 8 F. Shahid et al, U.S. Patent Application

Pattern 4: Concurrent Approximate State Machines State Machine: In a measurement interval, detect the flows which hit a specified state machine (Bloom filter is a special case where state machine is a membership check) Flow X has two packets in B frame BxBx IYIY XY PxPx x BxBx Bonomi, Mitzenmacher, Panigraphy, Singh, Varghese 2006

Concurrent State Machines Implementation: We know 3 good implementations. The best of these uses a good hash table implementation (d-left) and simply substitutes the identifier of a flow with a smaller signature for the flow. Results: For 64 K flows, we need roughly 1 Mbit of memory. Applications: First, for video congestion control. We found good results by dropping B-frames during congestion and then tail-dropping till the next I-frame. Can handle twice the loss rates with same quality. Second, for P2P identification.

Outline of Talk Part 1: Motivation Part 2: Basic Patterns and Algorithms Part 3: Combining base patterns to solve useful application problems (traffic matrix, DoS, worms) Part 4: Conclusions.

Application 1: Traffic Matrix Each entry router uses a multistage filter on traffic to destination prefixes to isolate subnets to which there is large traffic. Aggregating across all entry routers gives the “dominant” part of traffic matrix. ATT reports rule for prefixes. ISP Customer Site 1 Customer Site 3 Customer Site 2 reroute or add B/W

Application 2: DoS Attacks Bandwidth attacks: (e.g.. Smurf). Pound victim with large traffic of certain type.  Use heavy-hitter pattern relative to traffic type (e.g., ICMP) to find attacked destinations Partial Completion attacks: (e.g., TCP SYN- Flood). May not be unusual bandwidth but characterized by partial connections.  Use partial completion pattern as a front-end for Riverhead Guard module in Jaffa.

Application 4: Worm Detection Manual signature extraction: slow and enormous effort for each new worm. Automatic signature extraction of a specific worm by automatically detecting an abstract worm ISP Infected 1 Infected N New Victim Inactive Address Sumeet Singh, G. Varghese, C. Estan, S. Savage, OSDI 2004, more in next class

Abstract Worm Definition and Detection F1, Content Repetition: Payload of worm is seen frequently at router.  Use heavy-hitter pattern with hash H of content as index.  NetSift used large multistage filters. A variant of elephant traps invented by John Huber and Sumeet Singh seems to be the best solution for Hawkeye. F2, Increasing Infection Levels: Same content is disbursed to increasing number of distinct source- destination pairs.  Use many flows pattern with content hash H as index

Hashing Implementation Need a hash function, especially for content, that is easy to compute and random. NetSift used a Rabin hash function but that requires multiplies. For Bloom Filters can make 1 large hash and take portions/stage Much nicer hash function using Galois multiplication (Xor and Shift)

Conclusions Introspection/Pattern detection can be useful for the next generation of networks. Beyond faster-cheaper Can implement base patterns at high speeds. Base patterns can be combined to solve useful application issues (traffic matrix, DoS, worms, etc.) Only scratching surface: need to build a library of patterns.

Introspection at UCSD Ramana KompellaCristian EstanSumeet Singh