Model-Based Semantic Compression for Network-Data Tables Shivnath Babu Stanford University Bell Laboratories with Minos Garofalakis, Rajeev Rastogi, Avi.

Slides:

Advertisements

Similar presentations

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.

Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Random Forest Predrag Radenković 3237/10

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

Packet Classification using Hierarchical Intelligent Cuttings

A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.

Fast Firewall Implementation for Software and Hardware-based Routers Lili Qiu, Microsoft Research George Varghese, UCSD Subhash Suri, UCSB 9 th International.

Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,

Decision Tree Approach in Data Mining

Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.

Efficient Constraint Monitoring Using Adaptive Thresholds Srinivas Kashyap, IBM T. J. Watson Research Center Jeyashankar Ramamirtham, Netcore Solutions.

A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.

ItCompress: An Iterative Semantic Compression Algorithm

Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

Traffic Engineering With Traditional IP Routing Protocols

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

Network Traffic Measurement and Modeling CSCI 780, Fall 2005.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T.

1 An Information Theoretic Approach to Network Trace Compression Y. Liu, D. Towsley, J. Weng and D. Goeckel.

Coordinated Sampling sans Origin-Destination Identifiers: Algorithms and Analysis Vyas Sekar, Anupam Gupta, Michael K. Reiter, Hui Zhang Carnegie Mellon.

Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†

Sensor Networks Storage Sanket Totala Sudarshan Jagannathan.

The Effects of Systemic Packets Loss on Aggregate TCP Flows Thomas J. Hacker May 8, 2002 Internet 2 Member Meeting.

Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee College of Computing, Georgia Institute of Technology USENIX Security '08 Presented by Lei Wu.

SECURING NETWORKS USING SDN AND MACHINE LEARNING DRAGOS COMANECI –

Network Planète Chadi Barakat

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Issues with Data Mining

What is FORENSICS? Why do we need Network Forensics?

Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Network Aware Resource Allocation in Distributed Clouds.

Master Thesis Defense Jan Fiedler 04/17/98

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Hotspot Detection in a Service Oriented Architecture Pranay Anchuri,

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.

Optimal XOR Hashing for a Linearly Distributed Address Lookup in Computer Networks Christopher Martinez, Wei-Ming Lin, Parimal Patel The University of.

TASHKENT UNIVERSITY OF INFORMATION TECHNOLOGIES Lesson №18 Telecommunication software design for analyzing and control packets on the networks by using.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.

Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.

A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.

StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:

Unconstrained Endpoint Profiling Googling the Internet Ionut Trestian, Supranamaya Ranjan, Alekandar Kuzmanovic, Antonio Nucci Reviewed by Lee Young Soo.

CIS664 KD&DM SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables S. Babu, M. Garofalakis, R. Rastogi Presented by Uroš Midić Apr.

Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection Sailesh Kumar Sarang Dharmapurikar Fang Yu Patrick Crowley Jonathan.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.

Bigtable: A Distributed Storage System for Structured Data

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Data Transformation: Normalization

Architecture and Algorithms for an IEEE 802

Distributed Network Traffic Feature Extraction for a Real-time IDS

Applying Control Theory to Stream Processing Systems

Data Streaming in Computer Networking

RE-Tree: An Efficient Index Structure for Regular Expressions

Optimal Configuration of OSPF Aggregates

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Supporting Fault-Tolerance in Streaming Grid Applications

File Transfer Issues with TCP Acceleration with FileCatalyst

Lu Tang , Qun Huang, Patrick P. C. Lee

Presentation transcript:

Model-Based Semantic Compression for Network-Data Tables Shivnath Babu Stanford University Bell Laboratories with Minos Garofalakis, Rajeev Rastogi, Avi Silberschatz NRDM, Santa Barbara, CA, May 25, 2001

Introduction The data is important for running big enterprises effectively –Application and user profiling –Capacity planning and provisioning, determining pricing plans The data needs to be stored, analyzed, and (often) shipped across sites Networks create massive, fast-growing relational-data tables –Switch/router-level network performance data –SNMP and RMON data –Packet and flow traces (Sprint IP backbone gigabytes/day) –Call Detail Records (AT&T million records/day) –Web-server logs (Akamai billion log-lines/day)

Compressing Massive Tables Good compression is essential –Optimizes storage, I/O, network bandwidth over the lifetime of the data –Can afford “intelligent” compression Example table: network flow measurements (simplified) Protocol Duration Bytes Packets http 12 20K 3 http 16 24K 5 http 15 20K 8 http 19 40K 11 http 26 58K 18 ftp K 24 ftp K 35 ftp 18 80K 15

Compressing Massive Tables: A New Direction in Data Compression Semantic compression –Exploiting data characteristics and dependencies improves compression ratio significantly –Capturing aggregate data characteristics ties in with enterprise data monitoring and analysis Benefits of lossy compression schemes – Enables trading precision for performance (compression time and storage) – Tradeoff can be adjusted by user (flexible) Several generic compression techniques and tools (e.g., Huffman, Lempel-Ziv, Gzip) Syntactic: operate at byte-level, view table as a large byte-string Lossless: do not support lossless and lossy compression

SPARTAN: A Model-Based Semantic Compressor New compression paradigm: Model-Based Semantic Compression (MBSC) –Extract data mining models from table –Derive compression plan using the extracted models Use models to represent data succinctly Use models to drive other model building Compress different data partitions using different models SPARTAN system implements a specific instantiation of MBSC –Key idea: Classification and Regression Trees (CaRTs) can capture cross-column dependencies and eliminate entire data columns – Lossless and lossy compression (within user-specified error bounds)

SPARTAN: Semantic Compression with Classification and Regression Trees (CaRTs) error=0 error<=3 A compact CaRT can eliminate an entire column by prediction Protocol http ftp Duration Bytes 20K 24K 20K 40K 58K 100K 300K 80K Packets error = 0 Protocol = httpProtocol = ftp yesno yesno Packets > 10 Bytes > 60K Protocol = http error <= 3 Duration = 15 Duration = 29 yesno Packets > 16 Outlier: Packets=11, Duration = 19

SPARTAN Compression Problem Formulation Given: Data table over set of attributes X and per-attribute error tolerances Find: Set of attributes P to be predicted using CaRTs such that: –Overall storage cost (CaRTs + outliers + materialized columns) is minimized –Each attribute in P is predicted within its specified tolerance –A predicted attribute should not be used to predict another attribute -- otherwise errors will compound Non-trivial problem – Space of possible CaRT predictors is exponential in number of attributes

Two Phase Compression Planning Phase -- Come up with a compression plan Compression Phase -- Scan the data and compress it using the plan

SPARTAN Architecture: Planning Phase DependencyFinder Random sample of input table Error tolerance vector X1 X2 X3 X4 [e1,e2,e3,e4] Semantic-compression Plan

SPARTAN’s DependencyFinder Input: Random sample of input table T Output: A Bayesian Network (BN) over T’s attributes Structure of BN: Neighbors are the “strongly” related attributes Goal: Identify strong dependencies among attributes to prune the (huge) search space of possible CaRT models Education Profession Employer Income

SPARTAN Architecture: Planning Phase DependencyFinderCartSelector X1X2 X3 X4 Random sample of input table Error tolerance vector X1 X2 X3 X4 [e1,e2,e3,e4] Semantic-compression Plan

SPARTAN’s CaRTSelector Hard optimization problem: strict generalization of Weighted Maximum Independent Set (WMIS) (NP-hard) Two solutions: –Greedy heuristic –New heuristic based on WMIS approximation algorithms Heart of SPARTAN’s semantic-compression engine Output: Subset of attributes P to be predicted (within tolerance) and corresponding CaRTs Uses Bayesian Network constructed by DependencyFinder

Maximum Independent Set (MIS) CaRTSelector Exploits mapping of WMIS to CaRTSelector problem –Hill-climbing search that proceeds in iterations –Start with set of predicted attributes (P) empty; all attributes materialized (M) –Each iteration improves earlier solution by moving a selected subset of nodes from M to P Map to a WMIS instance and use solution “Weight” of a node (attribute) = materializationCost – predictionCost –Stop when no improvement is possible Number of CaRTs built (n = #attributes) –Greedy CaRTSelector: O(n) –MIS CaRTSelector : O(n^2) in the worst case, O(n logn) “on average”

SPARTAN Architecture: Planning Phase DependencyFinderCartSelector CartBuilder BuildCaRT [{X1,X2}->X3,e3] X3=15 X3=29 yesno X2 > 16 Outlier: X2=11, X3=19 X1X2 X3 X4 Random sample of input table Error tolerance vector X1 X2 X3 X4 [e1,e2,e3,e4] Semantic-compression Plan X1 X2 X3 X4 PM RowAggregator

Experimental Results: Summary SPARTAN system has been tested over several real data sets Full details are in – S. Babu, M. Garofalakis, R. Rastogi. SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. SIGMOD 2001 Better compression ratios compared to Gzip and Fascicles –factors up to 3 (for 5-10% error tolerances for numeric attributes) –20-30% on average for 1% error for numeric attributes Small sample sizes are effective for model-based compression –50KB is often sufficient

Conclusions MBSC: A novel approach to massive-table compression SPARTAN: a specific instantiation of MBSC –Uses CaRTs to eliminate significant fractions of columns by prediction –Uses a Bayesian Network to identify predictive correlations and drive the selection of CaRTs –CaRT-selection problem is NP-hard –Two heuristic-search-based algorithms for CaRT-selection Experimental evidence for effectiveness of SPARTAN’s model-based approach

Future Direction in MBSC: Compressing Continuous Data Streams Networks generate continuous streams of data –E.g., packet traces, flow traces, SNMP data Applying MBSC to continuous data streams –Data characteristics and dependencies can vary over time –Goal: compression plan should adapt to changes in data characteristics –Models must be maintained online as tuples arrive in the stream Study data mining models with respect to online maintanence –Incremental –Data stream speeds –Parallelism –Trade precision for performance –Eager Vs. Lazy schemes –Compression plan must be maintained with respect to models

Future Direction in MBSC: Distributed MBSC Data collection infrastructure is often distributed –Multiple monitoring points over an ISP’s network –Web servers are replicated for load balancing and reliability Data must be compressed before being transferred to warehouses or repositories MBSC can be done locally at each collection point –Lack of “global” data view might result in suboptimal compression plans More sophisticated approaches might be beneficial –Distributed data mining problem –Opportunity cost of network bandwidth is high -- keep communication overhead minimal

Future Direction in MBSC: Using Extracted Models in other Contexts A crucial side-effect of MBSC -- capturing data characteristics helps enterprise data monitoring and analysis Interaction models (e.g., Bayesian Network) enable event-correlation and root-cause analysis for network management Anomaly detection -- intrusions, (distributed) denial-of-service attacks Network data Data mining models Compression Root-cause analysis Anomaly detection