© 2009 IBM Corporation Simply Top Talkers Jeroen Massar, Andreas Kind and Marc Ph. Stoecklin IBM Research - Zurich.

Slides:

Advertisements

Similar presentations

Network Aware Forward Caching Presenter: Alexandre Gerber Jeffrey Erman, Mohammad T. Hajiaghayi, Dan Pei, Oliver Spatscheck AT&T Labs Research April 24.

Advertisements

Sketch-based Change Detection Balachander Krishnamurthy (AT&T) Subhabrata Sen (AT&T) Yin Zhang (AT&T) Yan Chen (UCB/AT&T) ACM Internet Measurement Conference.

Credit hours: 4 Contact hours: 50 (30 Theory, 20 Lab) Prerequisite: TB143 Introduction to Personal Computers.

A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:

New Directions in Traffic Measurement and Accounting Cristian Estan – UCSD George Varghese - UCSD Reviewed by Michela Becchi Discussion Leaders Andrew.

OpenSketch Slides courtesy of Minlan Yu 1. Management = Measurement + Control Trafﬁc engineering – Identify large traffic aggregates, traffic changes.

A Fast and Compact Method for Unveiling Significant Patterns in High-Speed Networks Tian Bu 1, Jin Cao 1, Aiyou Chen 1, Patrick P. C. Lee 2 Bell Labs,

Cloud Control with Distributed Rate Limiting Raghaven et all Presented by: Brian Card CS Fall Kinicki 1.

Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research.

Fast, Memory-Efficient Traffic Estimation by Coincidence Counting Fang Hao 1, Murali Kodialam 1, T. V. Lakshman 1, Hui Zhang 2, 1 Bell Labs, Lucent Technologies.

CMPT 225 Sorting Algorithms Algorithm Analysis: Big O Notation.

Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Chapter 8 File organization and Indices.

Counting NATted Hosts. The idea : The ID field in the IP Packet is implemented in many cases as a simple counter. I.E. The host behind a NAT will send.

Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.

Complexity Analysis (Part I)

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.

Skip Lists1 Skip Lists William Pugh: ” Skip Lists: A Probabilistic Alternative to Balanced Trees ”, 1990  S0S0 S1S1 S2S2 S3S3 

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.

A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.

Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.

Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.

Memory Management Last Update: July 31, 2014 Memory Management1.

SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.

Oracle Index study for Event TAG DB M. Boschini S. Della Torre

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

Chapter 16 Methodology – Physical Database Design for Relational Databases.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

CEDAR Counter-Estimation Decoupling for Approximate Rates Erez Tsidon Joint work with Iddo Hanniel and Isaac Keslassy Technion, Israel 1.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.

1 LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams Qun Huang and Patrick P. C. Lee The Chinese.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

Methodology – Physical Database Design for Relational Databases.

Addressing Modes Chapter 6 S. Dandamudi To be used with S. Dandamudi, “Introduction to Assembly Language Programming,” Second Edition, Springer,

A Bandwidth Estimation Method for IP Version 6 Networks Marshall Crocker Department of Electrical and Computer Engineering Mississippi State University.

TCAM –BASED REGULAR EXPRESSION MATCHING SOLUTION IN NETWORK Phase-I Review Supervised By, Presented By, MRS. SHARMILA,M.E., M.ARULMOZHI, AP/CSE.

Calculating frequency moments of Data Stream

@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.

Flow sampling in IPFIX: Status and suggestion for its support Maurizio Molina,

Gorilla: A Fast, Scalable, In-Memory Time Series Database

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Introduction toData structures and Algorithms

Complexity Analysis (Part I)

Frequency Counts over Data Streams

The Variable-Increment Counting Bloom Filter

Chapter 12: Query Processing

Pyramid Sketch: a Sketch Framework

Optimal Elephant Flow Detection Presented by: Gil Einziger,

Advanced Topics in Data Management

Objective of This Course

Approximate Frequency Counts over Data Streams

Slides adapted from Donghui Zhang, UC Riverside

Heavy Hitters in Streams and Sliding Windows

By: Ran Ben Basat, Technion, Israel

A flow aware packet sampling mechanism for high speed links

Lu Tang , Qun Huang, Patrick P. C. Lee

Complexity Analysis (Part I)

Complexity Analysis (Part I)

Presentation transcript:

© 2009 IBM Corporation Simply Top Talkers Jeroen Massar, Andreas Kind and Marc Ph. Stoecklin IBM Research - Zurich

© 2010 IBM Corporation IBM Research – Zurich 2 Motivation and Outline Need to understand and correctly handle dominant aspects within the overall flow of traffic –Top-k Problem –Optimize peering relationships (top autonomous systems) –Analyze congestion cases (top hosts) Technical challenge –Compute sorted top-n views from very large numbers of flow information records –No possibility to store individual counters per aspect components This presentation –Propose and analyze simple techniques to compute top-k listings for single and composed traffic aspects –Show that the techniques are suitable for in-memory aggregation databases

© 2010 IBM Corporation IBM Research – Zurich 3 Related Work E. D. Demaine, A. Lopez-Ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In Proceedings of the 10th Annual European Symposium on Algorithms, pages 348–360, C. Estan and G. Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., 21(3):270–313, Metwally A, Agrawal D, El Abbadi A (2005) Efficient computation of frequent and top-k elements in data streams. In: Proceeding of the 2005 international conference on database theory (ICDT05), Edinburgh, UK, pp 398–412 X. Dimitropoulos, P. Hurley, and A. Kind: Probabilistic Lossy Counting: An Efficient Algorithm for Finding Heavy Hitters," ACM SIGCOMM Computer Communication Review, Jan X. Dimitropoulos, M. Stoecklin, P. Hurley, and A. Kind: The Eternal Sunshine of the Sketch Data Structure," Elsevier Computer Networks, 2008.

© 2010 IBM Corporation IBM Research – Zurich 4 Top-k Problem Large number of unique items per observation period –Large alphabet –Counters and usage arrays dont fit into memory How to cut back sorted list? –Correct ordering –Correct volume information for each item –Reduce the cost of inserting new items (that anyway would not make it into the sorted top-k list) …… Primary observation period (e.g. 5 minutes) …… List of counters and usage variation information (ie, arrays) for all unique items (ie, colors) during observation period … … … … … … … … Secondary observation periods (e.g. hours, weeks, months, years)

© 2010 IBM Corporation IBM Research – Zurich 5 One Threshold Scheme (max1) read next Item record update item in list entry exists for item in list? create item in list yes start take next primary list start |IP list| > max1 add or update entries in list cut list back to max1 yes Merging primary list into secondary list Primary list can grows up to unique number of items Secondary lists is limited by max1

© 2010 IBM Corporation IBM Research – Zurich 6 Two Threshold Scheme (max1, max2) read next Item record update item in list entry exists for item in list? |list| > max2 cut list back to max1 add item in list yes start take next primary list start |IP list| > max1 add or update entries in list cut list back to max1 yes Merging primary list into secondary list Secondary lists is limited by max1 Primary list is limited by max2 Cut back is below max2 to reduce costly cut back operations (incl. sorting)

© 2010 IBM Corporation IBM Research – Zurich 7 Two Threshold Scheme (max1, max2) with Heuristic read next Item record update item in list entry exists for item in list? |list| > max2 cut list back to max1 add item in list |list| > max1 accept record? yes start Use heuristic to avoid insertion of record information that is likely to have no impact on top items take next primary list start |IP list| > max1 add or update entries in list cut list back to max1 yes Merging primary list into secondary list add item in list Secondary lists is limited by max1 Primary list is limited by max2 Cut back is below max2 to reduce costly cut back operations (incl. sorting)

© 2010 IBM Corporation IBM Research – Zurich 8 Heuristics 0.Exact max1,max2 style: No Heuristic Short duration (< 1 second) Low Packet count in one flow ( < 4 pkts) Bytes < total_bytes_seen/total_flows_seen (averaging) These heuristics are very simple and more importantly cheap to implement memory and cpu wise.

© 2010 IBM Corporation IBM Research – Zurich 9 Data Set We picked three days of data collected at one of the IBM datacenters Did an expensive query: calculate the exact top K entries Fortunately there are machines with large amounts of memory and number of cores… Then we applied the defined heuristics to the data set, which run in near real-time as we dont need a lot of memory or cpu power to store all the entries One of the biggest advantages of course is that one can almost stay in-cache when one has proper cpus (4 or 8 MiB cache per core) results on next slides…. Note that the dataset represents several Petabytes of network traffic, the error rate in comparison is thus effectively very low…..

© 2010 IBM Corporation IBM Research – Zurich 10 Error rate for Top K without heuristic Amount of bytes that are not seen for top X items

© 2010 IBM Corporation IBM Research – Zurich 11 Error rate for Top K when applying various heuristics

© 2010 IBM Corporation IBM Research – Zurich 12 Conclusion Simple heuristics and thus cheap processing: –Differences are minimal for this datacenter dataset, other datasets, eg for an enterprise network seem to prefer the bytes < avg heuristic a bit more, which makes sense as it is hard to get in the top K unless you can at least beat the average. Two thresholds –max2 = 2 * max1 –Correct order up to k = X * max1

© 2010 IBM Corporation AURORA Questions?