Preliminary Oral Exam Yu Jin Advisor: Professor Zhi-Li Zhang Tackling Network Management Problems with Machine Learning Techniques.

Preliminary Oral Exam Yu Jin Advisor: Professor Zhi-Li Zhang Tackling Network Management Problems with Machine Learning Techniques

Motivation Effective solutions are required for various problems in large networks: Traffic classification Anomaly detection Host/traffic profiling Trouble-shooting Manual rule-based solutions are either costly or unavailable Scalability is a major issue

Our Solution Abstract into machine learning problems Select advanced machine learning algorithms to solve these problems Address the scalability issue Deliver a system and evaluate the system under real operational environment

Outline Work in the past – Traffic classification Design and implementation of a modular machine learning architecture for flow-level traffic classification Analysis of traffic interaction patterns using traffic activity graphs (TAGs) Visualizing spatial traffic class distribution using colored TAGs Traffic classification using colored TAGs On-going work Customer ticket prediction and trouble-shooting in DSL networks Summary and time table

A Light-Weight Modular Machine Learning Approach to Large-Scale Network Traffic Classification

Motivation Required for: Security monitoring Traffic policing and prioritization Predicting application trends Identifying new applications An interesting research topic Across multiple research areas: machine learning, traffic profiling, social network analysis, system optimization Design a operational system for a large ISP network Learning valuable lessons for solving other practical problems

Challenges Challenges: Scalability: training and operating on 10Gbps links Accuracy: Similar performance as the rule-based classifier Stability: remain accurate without human intervention Versatility: reconfigurable TueWedThuFri --- TCP --- UDP 300k 250k 200k 150k 100k 50k Mon Sat Sun New flow arrival rate (per minute) on a 1Gbps link Accuracy Scalability Versatility Stability

Current solution – Rule based classifier Matching layer 4/layer 7 packet headers with manually defined rules Expensive in operation (flow/packet sampling is required) Expensive in deployment (special hardware is required) Inapplicable if the packet is encrypted, e.g., end-to-end IPSec traffic However, the rule based classifier can provide a good set of “labeled” flow data

A machine learning solution Network 1 Network 2 Network 3 Network 4 Network 0 Training data Raw traffic data training Labeled by rule- based classifier Traffic collection

A modular machine learning architecture A modular architecture for parallelization in both training and operation First level modularization, pre-partition the data by flow features Better accuracy Higher scalability from parallelization Flow partition Flow data Classification on partition 1 Classification on partition 2 Classification on partition m … Prediction on partition 1 Prediction on partition 2 Prediction on partition m By flow size, IP protocol, etc.

Second level modularization From a single k-class classifier to k 1-vs-all classifiers Accelerate training and operation Low memory consumption Parallelization No significant performance loss Classifier j Flow data in partition j Prediction on partition j Flow data in partition j Binary classifier for application class 1 Binary classifier for application class 2 Binary classifier for application class k … max Prediction on partition j Based on the posteriors P(C j | x)

Training a binary classifier Sampling is necessary due to the huge amount of training data (90 million TCP flows plus 85 million UDP flows) Weighted threshold sampling for a more balanced and representative training set if count(Cj) ≤ θ { keep all the flows in Cj; }else{ sample with rate θ /count(Cj); }

Selection of binary classifier Any classifier can be used as a component in our traffic classification architecture Boosting decision stumps (1-level decision trees) Fast Accurate Simple Implicit L-1 regularization TCP flow error rates for different binary classifiers. “Non-linear” classifier Boosting Trees (BTree) has the best performance. Boosting stumps (BStump) is a bit better than L1- Maxent.

Logistic calibration The IID assumption has been violated by the weighted thresholded sampling method Score output from adaboost makes direct combination of binary classification results infeasible We need to calibrate the binary classifiers Address the difference in traffic distributions Convert scores fc(x) to posterior probabilities P(C|x) Reliability diagram for TCP Web

Training Architecture for a binary classifier One binary classifier for each application class One calibrator is trained according to the classification results on a small independent flow samples (s.r.s.) Training data Thresholded sampled Reliability diagram Binary classifier Calibrator Application classifier training feedback

Performance Evaluation - Accuracy Our classifier can reproduce the classification results from the rule- based classifier with high accuracy Direct training of a multi-class classifier has little gain on accuracy according to testing on small samples

Scalability Scalability in training Accuracy increases with the increase of training data size and the number of iterations Less than 1.5 hours with 700K training samples and 640 iterations Scalability in operation Use basic optimization Recall on 2 x 1Gbps links, the new flow arrival rate is 450K/min We achieve 800K/min with a single thread Close to 7M/min with 10 threads, can scale up to 10Gbps links

Evaluation on Stability Temporal stability After two-month, the flow error rates are 3±0.5% for TCP traffic and 0.4±0.04% for UDP traffic After one year, the TCP flow error rate is 5.48±0.54% and the UDP flow error rate is 1.22±0.2%. Spatial stability Train and test at two geolocations

Importance of the Port number Using our architecture, we obtain an error rate of 4.1% for TCP and 0.72% for UDP with only port features (3.13% for TCP and 0.35% for UDP after adding other flow features) We use port graph to visualize and understand machine learned port rules TCP Multimedia use 554 and 5000-5180 UDP Games uses port 88 (Xbox) UDP Chat uses port 6661-6670

Summary We have designed a modular machine learning architecture for large scale ISP/enterprise network traffic classification The system scales up to 10Gbps links and remains accurate for one year on multiple sites without re-training We have conducted evaluation on a large operational ISP network What if the port number and other flow level statistics are unavailable? For example, classifying the end-to-end IPSec traffic.

IPSec traffic  Limited traffic statistics for IPSec traffic  Only 80% accuracy using the proposed machine learning architecture IPSec header Encrypted Inner packet Number of packets, number of bytes, average packet inter-arrival time, average packet size in both directions No port number, protocol,pay load etc. Our solution: Classification on Traffic Activity Graphs

Visualizing and Inferring Network Applications using Traffic Activity Graphs (TAGs)

Traffic Activity Graphs Nodes represent the hosts in the network Edges represent the interaction between these hosts Defined on a set of flows collected in a time period T Help us study the communication patterns of different applications UMN Internet T HTTP port 80/443 Email port 25/993 Gnutella port 6346/6348 G HTTP G Email G Gnutella

Application traffic activity graphs (TAGs) and evolution HTTP 1K to 3KDNS 1K to 3KAOL IM 1K to 3KEmail 1K to 3K

Properties of TAGs We observe difference in terms of basic statistics, such as graph density, average in/out degree, etc. ALL TAGs contain giant connected component (GCC), which accounts for more than 85% of all the edges.

Understanding the Interaction Patterns by Decomposing TAGs Block structures in the adjacency matrices indicate dense subgraphs in TAGs 1 1 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 HTTP EmailAOL IM BitTorrentDNS

TAG Decomposition using Tri- Nonnegative Matrix Factorization Extracting dense subgraphs can be formulated as a co- clustering problem, i.e., cluster hosts into inside host groups and outside host groups, then extract pairs of groups with more edges connected (higher density). This co-clustering problem can be solved by tri-nonnegative matrix factorization algorithm, which minimizes:

Tri-nonnegative matrix factorization ≈ 1 0 0 … 0 0 1 0 … 0 … 0 0 0 … 1 1 1 1 0 0 … 0 0 0 0 1 1 … 0... 0 0 0 0 0 … 1 × × ≈ × × A R H C Adjacency matrix assoc. with TAG Row group membership Indicator matrix Proportional to the subgraph density matrix Column group membership Indicator matrix We identify dense subgraphs based on the large entries in H R is m-by-k, C is r-by-n, hence, the product is a low-rank approximation of A, with rank min(k, r)

Subgraph prototypes Recall inside (UMN) hosts are (likely) service requesters and outside hosts are service providers. Based on the number of inside/outside hosts in each subgraph, we propose three prototypes. In-starBi-meshOut-star One inside client accesses multiple outside servers Multiple inside client accesses one outside servers Multiple inside clients interacts with many outside servers

Characterizing TAGs with subgraph prototypes Different application TAGs contain different types of subgraphs We can distinguish and characterize applications based on the subgraph components What do these subgraphs mean? HTTP Email AOL IMBitTorrent DNS

Interpreting HTTP bi-mesh structures Most star structures are due to popular servers or active clients We can explain more than 80% of the HTTP bi-meshes identified in one day Server correlation driven Server farms Lycos, Yahoo, Google Correlated service providers CDN: LLNW, Akamai, SAVVIS, Level3 Advertising providers: DoubleClick, etc. User interests driven News: WashingtonPost, New York Times, Cnet Media: ImageShack, casalemedia, tl4s2 Online shopping: Ebay, Costco, Walmart Social network: Facebook, MySpace redirection

SIGMETRICS/Performance 2009 How are the dense subgraphs connected? (A) Randomly connected stars (C) Pool (B) Tree: client/server dual role (D) Correlated pool

Summary We introduce the notion of traffic activity graphs (TAGs) Different applications show different interaction patterns We propose a tNMF based graph decomposition method to help understand the formation of application TAGs. Can we classify different application classes based on TAGs?

Traffic classification using Collective Traffic Statistics

Colored TAGs  Different applications are displayed as edges with different colors in TAGs Original TAG Web and FileSharing traffic removed

Characterizing Colored TAGs Clustering effects: edges with the same color tend to cluster together Attractive (A) / Repulsive (R) effects: The collective traffic statistics summarize the spatial distribution of application classes in TAGs

Methodology  A two-step approach Initial edge classification only based on traffic statistics Boot- strapping Graph Calibration Initially labeled TAG Unclassified TAG & traffic statistics Input Prediction output Prediction on the traffic graph Edge color calibration using only colored neighborhood information in the initially labeled traffic graph

Training the Two-Step Model The two-step model can be integrated into the existing traffic classification system easily The bootstrapping step uses traffic statistics associated with each edge to conduct initial classification on the edges Available traffic statistics depend on specific applications The graph calibration step uses collective traffic statistics in the TAG to reinforce/correct the initial labels Collective traffic statistics are encoded as histograms hihi hjhj e ij

Evaluation on Network-Level Traffic classification Packet header information (e.g., port number, TCP flags) is unavailable for bootstrapping, similar to the situation of classifying end-to-end IPSec traffic How accurate can we classify such traffic when the flow-level classification system only achieves 80% accuracy due to the lack of traffic features?

Evaluation on Accuracy  Our graph-based calibration reduces the error rate by 50%!  The classifier remains stable across time and geolocation.

Evaluation on Per-Class Classification The two-step method improve the accuracy for all traffic classes The repulsive rules enable us to improve on the traffic classes with very poor initial labeling.

Evaluation on Real-time Performance We can implement the two-step approach as a real-time system with little additional cost

Evaluation on Flow-Level Traffic Classification We have access to all packet header information How much will the collective traffic statistics improve the overall accuracy of our system?

Evaluation on Accuracy We achieve 15% reduction in errors within a month The F1 scores are improved for most application classes

Summary We introduced the concept of colored TAGs We proposed a two-step model to utilize the spatial distribution of application classes in TAGs (collective traffic statistics) to help improve the classification accuracy The collective traffic statistics help reduce 50% of the errors for classification at the network layer 15% of the errors can be reduced for the flow-level traffic classification using graph calibration

Trouble-shooting in Large DSL Networks (work in progress)

Motivation The current solution for trouble-shooting in DSL networks is reactive and inefficient Potentially lead to churns

Challenges Millions of users A large number of devices on each DSL line which cannot be controlled remotely Many possible locations where the line problem can happen

Methodology Trouble Locator Measure the line condition between the DSL server and the cable modem for each customer Use machine learning techniques to learn the correlation between different line problems and our line measurements Ticket Predictor Maintain periodic measurements for each customer (every Saturday night) Learn the correlation between the measurement history and the potential line problems

Overview of the Proactive Solution Proactively resolve line problems before the customer complains

Planned Trial Evaluate our method in an operational DSL network

Time Table Design of the DSL network trouble-shooting system (Feb. 2010 to Apr. 2010) Implementation of the system and offline evaluation (May. 2010 to Jul. 2010) Trial in an operational DSL network (Aug. 2010 to Oct.2010) Thesis (Nov.2010 to Jan.2011)

Publications before 2010 Aiyou Chen, Yu Jin, Jin Cao, Li (Erran) Li, Tracking Long Duration Flows in Network Traffic, to appear in the 29th IEEE International Conference on Computer Communications - INFOCOM2010 (mini-conference) (acceptance ratio 24.3%). Yu Jin, Esam Sharafuddin, Zhi-Li Zhang, Unveiling Core Network-Wide Communication Patterns through Application Traffic Activity Graph Decomposition, in Proc. of the 2009 ACM International Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS 2009) (acceptance ratio 14.9%). Jin Cao, Yu Jin, Aiyou Chen, Tian Bu, Zhi-Li Zhang, Identifying High Cardinality Internet Hosts, in Proc. of the 28th Conference on Computer Communications (IEEE INFOCOM 2009) (acceptance ratio 19.6%). Yu Jin, Esam Sharafuddin, Zhi-Li Zhang, Identifying Dynamic IP Address Blocks Serendipitously through Background Scanning Traffic, in Proc. of the 3rd International Conference on emerging Networking EXperiments and Technologies (CoNEXT 2007), New York, NY, December 10, 2007 (acceptance ratio 19.5%). Yu Jin, Zhi-Li Zhang, Kuai Xu, Feng Cao, Sambit Sahu, "Identifying and Tracking Suspicious Activities through IP Gray Space Analysis", In Proc. of the 3rd Workshop on Mining Network Data (MineNet'07), San Diego, CA, June 12, 2007 (in conjunction with ACM SIGMETRICS'07). Yu Jin, Gyorgy Simon, Kuai Xu, Zhi-Li Zhang, Vipin Kumar, "Gray's Anatomy: Dissecting Scanning Activities Using IP Gray Space Analysis", in Proc. of the Second Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML07), Boston, MA, April 10, 2007 (in conjunction with USENIX NSDI'07).

Thanks! Questions?

Backup slides

Characteristics of app. TAGs These statistics show difference between various app. TAGs It does not explain the formation of TAGs

TNMF algorithm related Iterative optimization algorithm Group density matrix derivation

Backup for traffic classification

Default flow features

Compare of different algorithms for multi-class classification

Training time for different machine learning algorithms

Selection of flow size for partitioning

Boosting decision stumps tcpflag contains S S - = -1.066S + = -2.523 noyes dstport_low = 443 S - = -0.226S + = 2.139 noyes byte >= 64.5 S - = 0.446S + = -0.202 noyes … t =1 t =2 t =T

Preliminary Oral Exam Yu Jin Advisor: Professor Zhi-Li Zhang Tackling Network Management Problems with Machine Learning Techniques.

Similar presentations

Presentation on theme: "Preliminary Oral Exam Yu Jin Advisor: Professor Zhi-Li Zhang Tackling Network Management Problems with Machine Learning Techniques."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Preliminary Oral Exam Yu Jin Advisor: Professor Zhi-Li Zhang Tackling Network Management Problems with Machine Learning Techniques.

Similar presentations

Presentation on theme: "Preliminary Oral Exam Yu Jin Advisor: Professor Zhi-Li Zhang Tackling Network Management Problems with Machine Learning Techniques."— Presentation transcript:

Similar presentations

About project

Feedback