Download presentation
Presentation is loading. Please wait.
Published byMerilyn Doreen Carson Modified over 8 years ago
1
Preliminary Oral Exam Yu Jin Advisor: Professor Zhi-Li Zhang Tackling Network Management Problems with Machine Learning Techniques
2
Motivation Effective solutions are required for various problems in large networks: Traffic classification Anomaly detection Host/traffic profiling Trouble-shooting Manual rule-based solutions are either costly or unavailable Scalability is a major issue
3
Our Solution Abstract into machine learning problems Select advanced machine learning algorithms to solve these problems Address the scalability issue Deliver a system and evaluate the system under real operational environment
4
Outline Work in the past – Traffic classification Design and implementation of a modular machine learning architecture for flow-level traffic classification Analysis of traffic interaction patterns using traffic activity graphs (TAGs) Visualizing spatial traffic class distribution using colored TAGs Traffic classification using colored TAGs On-going work Customer ticket prediction and trouble-shooting in DSL networks Summary and time table
5
A Light-Weight Modular Machine Learning Approach to Large-Scale Network Traffic Classification
6
Motivation Required for: Security monitoring Traffic policing and prioritization Predicting application trends Identifying new applications An interesting research topic Across multiple research areas: machine learning, traffic profiling, social network analysis, system optimization Design a operational system for a large ISP network Learning valuable lessons for solving other practical problems
7
Challenges Challenges: Scalability: training and operating on 10Gbps links Accuracy: Similar performance as the rule-based classifier Stability: remain accurate without human intervention Versatility: reconfigurable TueWedThuFri --- TCP --- UDP 300k 250k 200k 150k 100k 50k Mon Sat Sun New flow arrival rate (per minute) on a 1Gbps link Accuracy Scalability Versatility Stability
8
Current solution – Rule based classifier Matching layer 4/layer 7 packet headers with manually defined rules Expensive in operation (flow/packet sampling is required) Expensive in deployment (special hardware is required) Inapplicable if the packet is encrypted, e.g., end-to-end IPSec traffic However, the rule based classifier can provide a good set of “labeled” flow data
9
A machine learning solution Network 1 Network 2 Network 3 Network 4 Network 0 Training data Raw traffic data training Labeled by rule- based classifier Traffic collection
10
Page 10 A modular machine learning architecture A modular architecture for parallelization in both training and operation First level modularization, pre-partition the data by flow features Better accuracy Higher scalability from parallelization Flow partition Flow data Classification on partition 1 Classification on partition 2 Classification on partition m … Prediction on partition 1 Prediction on partition 2 Prediction on partition m By flow size, IP protocol, etc.
11
Page 11 Second level modularization From a single k-class classifier to k 1-vs-all classifiers Accelerate training and operation Low memory consumption Parallelization No significant performance loss Classifier j Flow data in partition j Prediction on partition j Flow data in partition j Binary classifier for application class 1 Binary classifier for application class 2 Binary classifier for application class k … max Prediction on partition j Based on the posteriors P(C j | x)
12
Training a binary classifier Sampling is necessary due to the huge amount of training data (90 million TCP flows plus 85 million UDP flows) Weighted threshold sampling for a more balanced and representative training set if count(Cj) ≤ θ { keep all the flows in Cj; }else{ sample with rate θ /count(Cj); }
13
Page 13 Selection of binary classifier Any classifier can be used as a component in our traffic classification architecture Boosting decision stumps (1-level decision trees) Fast Accurate Simple Implicit L-1 regularization TCP flow error rates for different binary classifiers. “Non-linear” classifier Boosting Trees (BTree) has the best performance. Boosting stumps (BStump) is a bit better than L1- Maxent.
14
Logistic calibration The IID assumption has been violated by the weighted thresholded sampling method Score output from adaboost makes direct combination of binary classification results infeasible We need to calibrate the binary classifiers Address the difference in traffic distributions Convert scores fc(x) to posterior probabilities P(C|x) Reliability diagram for TCP Web
15
Page 15 Training Architecture for a binary classifier One binary classifier for each application class One calibrator is trained according to the classification results on a small independent flow samples (s.r.s.) Training data Thresholded sampled Reliability diagram Binary classifier Calibrator Application classifier training feedback
16
Page 16 Performance Evaluation - Accuracy Our classifier can reproduce the classification results from the rule- based classifier with high accuracy Direct training of a multi-class classifier has little gain on accuracy according to testing on small samples
17
Page 17 Scalability Scalability in training Accuracy increases with the increase of training data size and the number of iterations Less than 1.5 hours with 700K training samples and 640 iterations Scalability in operation Use basic optimization Recall on 2 x 1Gbps links, the new flow arrival rate is 450K/min We achieve 800K/min with a single thread Close to 7M/min with 10 threads, can scale up to 10Gbps links
18
Page 18 Evaluation on Stability Temporal stability After two-month, the flow error rates are 3±0.5% for TCP traffic and 0.4±0.04% for UDP traffic After one year, the TCP flow error rate is 5.48±0.54% and the UDP flow error rate is 1.22±0.2%. Spatial stability Train and test at two geolocations
19
Page 19 Importance of the Port number Using our architecture, we obtain an error rate of 4.1% for TCP and 0.72% for UDP with only port features (3.13% for TCP and 0.35% for UDP after adding other flow features) We use port graph to visualize and understand machine learned port rules TCP Multimedia use 554 and 5000-5180 UDP Games uses port 88 (Xbox) UDP Chat uses port 6661-6670
20
Page 20 Summary We have designed a modular machine learning architecture for large scale ISP/enterprise network traffic classification The system scales up to 10Gbps links and remains accurate for one year on multiple sites without re-training We have conducted evaluation on a large operational ISP network What if the port number and other flow level statistics are unavailable? For example, classifying the end-to-end IPSec traffic.
21
IPSec traffic Limited traffic statistics for IPSec traffic Only 80% accuracy using the proposed machine learning architecture IPSec header Encrypted Inner packet Number of packets, number of bytes, average packet inter-arrival time, average packet size in both directions No port number, protocol,pay load etc. Our solution: Classification on Traffic Activity Graphs
22
Visualizing and Inferring Network Applications using Traffic Activity Graphs (TAGs)
23
Traffic Activity Graphs Nodes represent the hosts in the network Edges represent the interaction between these hosts Defined on a set of flows collected in a time period T Help us study the communication patterns of different applications UMN Internet T HTTP port 80/443 Email port 25/993 Gnutella port 6346/6348 G HTTP G Email G Gnutella
24
Application traffic activity graphs (TAGs) and evolution HTTP 1K to 3KDNS 1K to 3KAOL IM 1K to 3KEmail 1K to 3K
25
Properties of TAGs We observe difference in terms of basic statistics, such as graph density, average in/out degree, etc. ALL TAGs contain giant connected component (GCC), which accounts for more than 85% of all the edges.
26
Understanding the Interaction Patterns by Decomposing TAGs Block structures in the adjacency matrices indicate dense subgraphs in TAGs 1 1 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 HTTP EmailAOL IM BitTorrentDNS
27
TAG Decomposition using Tri- Nonnegative Matrix Factorization Extracting dense subgraphs can be formulated as a co- clustering problem, i.e., cluster hosts into inside host groups and outside host groups, then extract pairs of groups with more edges connected (higher density). This co-clustering problem can be solved by tri-nonnegative matrix factorization algorithm, which minimizes:
28
Tri-nonnegative matrix factorization ≈ 1 0 0 … 0 0 1 0 … 0 … 0 0 0 … 1 1 1 1 0 0 … 0 0 0 0 1 1 … 0... 0 0 0 0 0 … 1 × × ≈ × × A R H C Adjacency matrix assoc. with TAG Row group membership Indicator matrix Proportional to the subgraph density matrix Column group membership Indicator matrix We identify dense subgraphs based on the large entries in H R is m-by-k, C is r-by-n, hence, the product is a low-rank approximation of A, with rank min(k, r)
29
Subgraph prototypes Recall inside (UMN) hosts are (likely) service requesters and outside hosts are service providers. Based on the number of inside/outside hosts in each subgraph, we propose three prototypes. In-starBi-meshOut-star One inside client accesses multiple outside servers Multiple inside client accesses one outside servers Multiple inside clients interacts with many outside servers
30
Characterizing TAGs with subgraph prototypes Different application TAGs contain different types of subgraphs We can distinguish and characterize applications based on the subgraph components What do these subgraphs mean? HTTP Email AOL IMBitTorrent DNS
31
Interpreting HTTP bi-mesh structures Most star structures are due to popular servers or active clients We can explain more than 80% of the HTTP bi-meshes identified in one day Server correlation driven Server farms Lycos, Yahoo, Google Correlated service providers CDN: LLNW, Akamai, SAVVIS, Level3 Advertising providers: DoubleClick, etc. User interests driven News: WashingtonPost, New York Times, Cnet Media: ImageShack, casalemedia, tl4s2 Online shopping: Ebay, Costco, Walmart Social network: Facebook, MySpace redirection
32
SIGMETRICS/Performance 2009 How are the dense subgraphs connected? (A) Randomly connected stars (C) Pool (B) Tree: client/server dual role (D) Correlated pool
33
Page 33 Summary We introduce the notion of traffic activity graphs (TAGs) Different applications show different interaction patterns We propose a tNMF based graph decomposition method to help understand the formation of application TAGs. Can we classify different application classes based on TAGs?
34
Traffic classification using Collective Traffic Statistics
35
Colored TAGs Different applications are displayed as edges with different colors in TAGs Original TAG Web and FileSharing traffic removed
36
Characterizing Colored TAGs Clustering effects: edges with the same color tend to cluster together Attractive (A) / Repulsive (R) effects: The collective traffic statistics summarize the spatial distribution of application classes in TAGs
37
Methodology A two-step approach Initial edge classification only based on traffic statistics Boot- strapping Graph Calibration Initially labeled TAG Unclassified TAG & traffic statistics Input Prediction output Prediction on the traffic graph Edge color calibration using only colored neighborhood information in the initially labeled traffic graph
38
Training the Two-Step Model The two-step model can be integrated into the existing traffic classification system easily The bootstrapping step uses traffic statistics associated with each edge to conduct initial classification on the edges Available traffic statistics depend on specific applications The graph calibration step uses collective traffic statistics in the TAG to reinforce/correct the initial labels Collective traffic statistics are encoded as histograms hihi hjhj e ij
39
Evaluation on Network-Level Traffic classification Packet header information (e.g., port number, TCP flags) is unavailable for bootstrapping, similar to the situation of classifying end-to-end IPSec traffic How accurate can we classify such traffic when the flow-level classification system only achieves 80% accuracy due to the lack of traffic features?
40
Evaluation on Accuracy Our graph-based calibration reduces the error rate by 50%! The classifier remains stable across time and geolocation.
41
Evaluation on Per-Class Classification The two-step method improve the accuracy for all traffic classes The repulsive rules enable us to improve on the traffic classes with very poor initial labeling.
42
Evaluation on Real-time Performance We can implement the two-step approach as a real-time system with little additional cost
43
Evaluation on Flow-Level Traffic Classification We have access to all packet header information How much will the collective traffic statistics improve the overall accuracy of our system?
44
Evaluation on Accuracy We achieve 15% reduction in errors within a month The F1 scores are improved for most application classes
45
Summary We introduced the concept of colored TAGs We proposed a two-step model to utilize the spatial distribution of application classes in TAGs (collective traffic statistics) to help improve the classification accuracy The collective traffic statistics help reduce 50% of the errors for classification at the network layer 15% of the errors can be reduced for the flow-level traffic classification using graph calibration
46
Trouble-shooting in Large DSL Networks (work in progress)
47
Motivation The current solution for trouble-shooting in DSL networks is reactive and inefficient Potentially lead to churns
48
Challenges Millions of users A large number of devices on each DSL line which cannot be controlled remotely Many possible locations where the line problem can happen
49
Methodology Trouble Locator Measure the line condition between the DSL server and the cable modem for each customer Use machine learning techniques to learn the correlation between different line problems and our line measurements Ticket Predictor Maintain periodic measurements for each customer (every Saturday night) Learn the correlation between the measurement history and the potential line problems
50
Overview of the Proactive Solution Proactively resolve line problems before the customer complains
51
Planned Trial Evaluate our method in an operational DSL network
52
Time Table Design of the DSL network trouble-shooting system (Feb. 2010 to Apr. 2010) Implementation of the system and offline evaluation (May. 2010 to Jul. 2010) Trial in an operational DSL network (Aug. 2010 to Oct.2010) Thesis (Nov.2010 to Jan.2011)
53
Publications before 2010 Aiyou Chen, Yu Jin, Jin Cao, Li (Erran) Li, Tracking Long Duration Flows in Network Traffic, to appear in the 29th IEEE International Conference on Computer Communications - INFOCOM2010 (mini-conference) (acceptance ratio 24.3%). Yu Jin, Esam Sharafuddin, Zhi-Li Zhang, Unveiling Core Network-Wide Communication Patterns through Application Traffic Activity Graph Decomposition, in Proc. of the 2009 ACM International Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS 2009) (acceptance ratio 14.9%). Jin Cao, Yu Jin, Aiyou Chen, Tian Bu, Zhi-Li Zhang, Identifying High Cardinality Internet Hosts, in Proc. of the 28th Conference on Computer Communications (IEEE INFOCOM 2009) (acceptance ratio 19.6%). Yu Jin, Esam Sharafuddin, Zhi-Li Zhang, Identifying Dynamic IP Address Blocks Serendipitously through Background Scanning Traffic, in Proc. of the 3rd International Conference on emerging Networking EXperiments and Technologies (CoNEXT 2007), New York, NY, December 10, 2007 (acceptance ratio 19.5%). Yu Jin, Zhi-Li Zhang, Kuai Xu, Feng Cao, Sambit Sahu, "Identifying and Tracking Suspicious Activities through IP Gray Space Analysis", In Proc. of the 3rd Workshop on Mining Network Data (MineNet'07), San Diego, CA, June 12, 2007 (in conjunction with ACM SIGMETRICS'07). Yu Jin, Gyorgy Simon, Kuai Xu, Zhi-Li Zhang, Vipin Kumar, "Gray's Anatomy: Dissecting Scanning Activities Using IP Gray Space Analysis", in Proc. of the Second Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML07), Boston, MA, April 10, 2007 (in conjunction with USENIX NSDI'07).
54
Thanks! Questions?
55
Backup slides
56
Characteristics of app. TAGs These statistics show difference between various app. TAGs It does not explain the formation of TAGs
57
TNMF algorithm related Iterative optimization algorithm Group density matrix derivation
58
Backup for traffic classification
59
Default flow features
60
Compare of different algorithms for multi-class classification
61
Training time for different machine learning algorithms
62
Selection of flow size for partitioning
63
Boosting decision stumps tcpflag contains S S - = -1.066S + = -2.523 noyes dstport_low = 443 S - = -0.226S + = 2.139 noyes byte >= 64.5 S - = 0.446S + = -0.202 noyes … t =1 t =2 t =T
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.