Download presentation
Presentation is loading. Please wait.
Published byGary Woods Modified over 9 years ago
1
Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine Learning Techniques Terena Networking Conference 2008 Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc.edu
2
Outline Motivations and objectives Scenario and requirements Existing solutions Well-known ports Payload based (pattern matching) Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions
3
Motivations and objectives Typical method based on well-known ports is no longer valid to identify applications Network administration and management tasks Network dimensioning, capacity planning, network performance evaluation, … QoS monitoring Class-of-Service mapping Quality-of-Service policies Possible way of pricing for QoS
4
Outline Motivations and objectives Scenario and requirements Existing solutions Well-known ports Payload based (pattern matching) Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions
5
Scenario: SMARTxAC SMARTxAC: Traffic Monitoring and Analysis System for the Anella Científica Operative since July 2003 Developed under a collaboration agreement CESCA-UPC Tailor-made traffic monitoring system for the Anella Científica Main objectives Low-cost platform Continuous monitoring of high-speed links without packet loss Detection of network anomalies and irregular usage Multi-user system: Network operators and Institutions Measurement of two full-duplex GigE links Connection between Anella Científica and RedIRIS
6
Measurement scenario
7
Requirements Real-time classification Independent from packet contents High-speed links Without packet loss High accuracy Method implemented in SMARTxAC
8
Outline Motivations and objectives Scenario and requirements Existing solutions Well-known ports Payload based (pattern matching) Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions
9
Well-known ports Characteristics Use of well-known ports from IANA Packet inspection is not needed Computationally lightweight Limitations (especially due to new P2P applications) Dynamic ports HTTP Requests New applications do not register their ports in IANA Consequence: Very low accuracy
10
Well-known ports example
11
Payload based Characteristics Try to find characteristic signatures in packet/flows payloads Very high accuracy Limitations Packet contents are required Computationally expensive Difficult to maintain updated Connection encryption Privacy legislations Consequence: Not a feasible solution in our scenario
12
Machine Learning Subfield of Artificial Intelligence Process that allows computers to extract knowledge (to learn) from examples (training set) Characteristics Packet contents are not required High accuracy Respect the privacy legislations Computationally viable Limitations Difficult training phase Needs to be retrained
13
Supervised learning Classification techniques create knowledge structures that classifies new instances into pre-defined classes. The knowledge learnt can be presented as: Decision tree Flowchart Classifications rules Training dataset: Object: Represented as a vector of features Class: Value to be predicated (label obtained “manually”)
14
Unsupervised learning Clustering methods find out best partition from similarities among the examples Labels are not available for the training phase Clustering methods: K-Means algorithm Incremental algorithm Probability-based
15
Supervised vs Unsupervised learning Supervised methods: Need a complete pre-labeled dataset Better accuracy for predefined classes No detection of new classes Difficult detection of retraining necessity Unsupervised methods: Do not need complete labeled instances Automatic detection of new classes Better accuracy for new classes
16
Feature selection Methods to detect irrelevant or redundant features. Improve the accuracy Reduce the computationally load Wrapper methods Evaluate the performance of different subsets using the ML algorithm for the learning phase. Depends on the ML algorithm e.g. Correlation-based Feature Selection (CFS) Filter methods Make independent assessment based on general characteristics of the data Independent on the ML algorithm e.g. Best-First
17
Outline Motivations and objectives Scenario and requirements Existing solutions Well-known ports Payload based (pattern matching) Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions
18
Proposed method Supervised identification based on C4.5 algorithm Developed by Ross Quinlan as extension of ID3 Based on the construction of a classification tree Feature selection based on maximizing the information gain Training set Actual traffic flows Pairs Feature vector contains relevant characteristics of traffic flows The application of each flow is identified “manually”
19
Features Requirements Real-time extraction Independence from packet contents Feature examples (total: 25) Packets and bytes per flow Flow duration min/avg/max paquet size min/avg/max TCP window size min/avg/max packet interarrival time Packets with flags PUSH, URG, DF, … set Average increase of IPID OS estimation (source and destination) Also ports and protocols (but not in the traditional way) …
20
Training Phase (I) Collection of training traffic Representative of the environment to be monitored Flow aggregation (at transport level) Feature extraction Manual classification of training flows Offline analysis of packet contents Using pattern matching algorithms (e.g. L7-filter) Manual inspection of the rest of flows Alternative Generate artificial traffic under a controlled environment Manual identification is not required Solves encryption and privacy issues
21
Training Phase (II) Construction of the classification tree C4.5 algorithm Input: Classified training flows Output: classification tree (contains flow features only) Software employed: Weka University of Waikato (New Zealand) GNU GPL license Written in Java http://www.cs.waikato.ac.nz/ml/weka
22
Deployment Implementation in SMARTxAC Flow aggregation Real-time feature extraction (requirement) Classification of each flow using the classification tree Computationally lightweight and applicable in real time There is no need to: Analyze packet contents Trust only on port numbers Apply pattern search algorithms Inspect manually the packets But it is required to: Retrain the system occasionally –New applications –Changes on existing ones
23
Accuracy
24
Application breakdown Port-based Machine learning
25
Application breakdown timeseries Port-based Machine learning
26
Outline Motivations and objectives Scenario and requirements Existing solutions Well-known ports Payload based (pattern matching) Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions
27
Summary 1) Collection of the training set Representative flows of the environment to be monitored Alternatively artificially generated 2) Feature extraction from the training flows 3) Manual flow classification → application class Pattern matching and manual inspection It can be simplified if an artificial training set is used in 1) 4) Construction of a C4.5 classification tree E.g. using Weka 5) Deployment of the tree obtained in 4) in the monitoring system 6) Retraining of the system Starting from phase 1)
28
Conclusions Traditional method based on well-known ports Low accuracy due to dynamic ports Identification based on pattern matching Does not feasible in high-speed links due to computation cost Depends on packet content Does not work with encryption Identification based on machine learning Feasible in high-speed links Does not require packet content Experimental result shows accuracy > 95% Requires an occasionally retrain Future work Retraining the method with the new scenario Make the training phase as automatic as possible
29
Thank you for your attention Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.