Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine.
Decision Trees for Server Flow Authentication James P. Early and Carla E. Brodley Purdue University West Lafayette, IN 47907
Nick Duffield, Patrick Haffner, Balachander Krishnamurthy, Haakon Ringberg Rule-Based Anomaly Detection on IP Flows.
Polymorphic blending attacks Prahlad Fogla et al USENIX 2006 Presented By Himanshu Pagey.
 Firewalls and Application Level Gateways (ALGs)  Usually configured to protect from at least two types of attack ▪ Control sites which local users.
Application Identification in information-poor environments Charalampos Rotsos 02/02/20101 What is application identification Current status My work Future.
A Study on Feature Selection for Toxicity Prediction*
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
1 On Constructing Efficient Shared Decision Trees for Multiple Packet Filters Author: Bo Zhang T. S. Eugene Ng Publisher: IEEE INFOCOM 2010 Presenter:
Advanced Broadband Communications Center (CCABA) Universitat Politècnica de Catalunya (UPC) SMARTxAC: A Passive Monitoring and Analysis System for High-Speed.
PBS: Periodic Behavioral Spectrum of P2P Applications Tom Z.J. Fu, Yan Hu, Xingang Shi, Dah Ming Chiu and John C.S. Lui The Chinese University of Hong.
Three kinds of learning
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Part I: Classification and Bayesian Learning
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
Radial Basis Function Networks
Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee College of Computing, Georgia Institute of Technology USENIX Security '08 Presented by Lei Wu.
Automated malware classification based on network behavior
SECURING NETWORKS USING SDN AND MACHINE LEARNING DRAGOS COMANECI –
A fast identification method for P2P flow based on nodes connection degree LING XING, WEI-WEI ZHENG, JIAN-GUO MA, WEI- DONG MA Apperceiving Computing and.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.
A Statistical Anomaly Detection Technique based on Three Different Network Features Yuji Waizumi Tohoku Univ.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Alert Correlation for Extracting Attack Strategies Authors: B. Zhu and A. A. Ghorbani Source: IJNS review paper Reporter: Chun-Ta Li ( 李俊達 )
Traffic Classification through Simple Statistical Fingerprinting M. Crotti, M. Dusi, F. Gringoli, L. Salgarelli ACM SIGCOMM Computer Communication Review,
DPNM, POSTECH 1/23 NOMS 2010 Jae Yoon Chung 1, Byungchul Park 1, Young J. Won 1 John Strassner 2, and James W. Hong 1, 2 {dejavu94, fates, yjwon, johns,
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Appendix: The WEKA Data Mining Software
11 Automatic Discovery of Botnet Communities on Large-Scale Communication Networks Wei Lu, Mahbod Tavallaee and Ali A. Ghorbani - in ACM Symposium on InformAtion,
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
FiG: Automatic Fingerprint Generation Shobha Venkataraman Joint work with Juan Caballero, Pongsin Poosankam, Min Gyung Kang, Dawn Song & Avrim Blum Carnegie.
On the processing time for detection of Skype traffic P.M. Santiago del Río, J. Ramos, J.L. García-Dorado, J. Aracil Universidad Autónoma de Madrid A.
Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,
Firewall Fingerprinting Amir R. Khakpour 1, Joshua W. Hulst 1, Zhihui Ge 2, Alex X. Liu 1, Dan Pei 2, Jia Wang 2 1 Michigan State University 2 AT&T Labs.
Probabilistic Graphical Models for Semi-Supervised Traffic Classification Rotsos Charalampos, Jurgen Van Gael, Andrew W. Moore, Zoubin Ghahramani Computer.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin In First Workshop on Hot Topics in Understanding Botnets,
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Towards Understanding Network Traffic through Whole Packet Analysis Abdulrahman Hijazi Hajime Inoue Ashraf Matrawy P.C. van Oorschot Anil Somayaji.
Some questions -What is metadata? -Data about data.
TCAM –BASED REGULAR EXPRESSION MATCHING SOLUTION IN NETWORK Phase-I Review Supervised By, Presented By, MRS. SHARMILA,M.E., M.ARULMOZHI, AP/CSE.
October 2-3, 2015, İSTANBUL Boğaziçi University Prof.Dr. M.Erdal Balaban Istanbul University Faculty of Business Administration Avcılar, Istanbul - TURKEY.
Consensus Extraction from Heterogeneous Detectors to Improve Performance over Network Traffic Anomaly Detection Jing Gao 1, Wei Fan 2, Deepak Turaga 2,
BotCop: An Online Botnet Traffic Classifier 鍾錫山 Jan. 4, 2010.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
PANACEA: AUTOMATING ATTACK CLASSIFICATION FOR ANOMALY-BASED NETWORK INTRUSION DETECTION SYSTEMS Reporter : 鄭志欣 Advisor: Hsing-Kuo Pao.
Anomaly Detection. Network Intrusion Detection Techniques. Ştefan-Iulian Handra Dept. of Computer Science Polytechnic University of Timișoara June 2010.
High Throughput and Programmable Online Traffic Classifier on FPGA Author: Da Tong, Lu Sun, Kiran Kumar Matam, Viktor Prasanna Publisher: FPGA 2013 Presenter:
2009/6/221 BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure- Independent Botnet Detection Reporter : Fong-Ruei, Li Machine.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
ML in the Routers: Learn from and Act on Network Traffic Bing ietf95, April
Introduction to Machine Learning, its potential usage in network area,
Experience Report: System Log Analysis for Anomaly Detection
P.Demestichas (1), S. Vassaki(2,3), A.Georgakopoulos(2,3)
Machine Learning with Spark MLlib
Automatic cLasification d
Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel
Waikato Environment for Knowledge Analysis
Unknown Malware Detection Using Network Traffic Classification
DDoS Attack Detection under SDN Context
Automatic Discovery of Network Applications: A Hybrid Approach
2019/1/1 High Performance Intrusion Detection Using HTTP-Based Payload Aggregation 2017 IEEE 42nd Conference on Local Computer Networks (LCN) Author: Felix.
Sofia Pediaditaki and Mahesh Marina University of Edinburgh
Transport Layer Identification of P2P Traffic
Internet Traffic Classification Using Bayesian Analysis Techniques
Presentation transcript:

Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine Learning Techniques Terena Networking Conference 2008 Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet,

Outline Motivations and objectives Scenario and requirements Existing solutions  Well-known ports  Payload based (pattern matching)  Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions

Motivations and objectives Typical method based on well-known ports is no longer valid to identify applications Network administration and management tasks  Network dimensioning, capacity planning, network performance evaluation, … QoS monitoring  Class-of-Service mapping  Quality-of-Service policies  Possible way of pricing for QoS

Outline Motivations and objectives Scenario and requirements Existing solutions  Well-known ports  Payload based (pattern matching)  Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions

Scenario: SMARTxAC SMARTxAC: Traffic Monitoring and Analysis System for the Anella Científica  Operative since July 2003  Developed under a collaboration agreement CESCA-UPC  Tailor-made traffic monitoring system for the Anella Científica Main objectives  Low-cost platform  Continuous monitoring of high-speed links without packet loss  Detection of network anomalies and irregular usage  Multi-user system: Network operators and Institutions Measurement of two full-duplex GigE links  Connection between Anella Científica and RedIRIS

Measurement scenario

Requirements Real-time classification Independent from packet contents High-speed links Without packet loss High accuracy Method implemented in SMARTxAC

Outline Motivations and objectives Scenario and requirements Existing solutions  Well-known ports  Payload based (pattern matching)  Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions

Well-known ports Characteristics  Use of well-known ports from IANA  Packet inspection is not needed  Computationally lightweight Limitations (especially due to new P2P applications)  Dynamic ports  HTTP Requests  New applications do not register their ports in IANA Consequence: Very low accuracy

Well-known ports example

Payload based Characteristics  Try to find characteristic signatures in packet/flows payloads  Very high accuracy Limitations  Packet contents are required  Computationally expensive  Difficult to maintain updated  Connection encryption  Privacy legislations Consequence: Not a feasible solution in our scenario

Machine Learning Subfield of Artificial Intelligence Process that allows computers to extract knowledge (to learn) from examples (training set) Characteristics  Packet contents are not required  High accuracy  Respect the privacy legislations  Computationally viable Limitations  Difficult training phase  Needs to be retrained

Supervised learning Classification techniques create knowledge structures that classifies new instances into pre-defined classes. The knowledge learnt can be presented as:  Decision tree  Flowchart  Classifications rules Training dataset:  Object: Represented as a vector of features  Class: Value to be predicated (label obtained “manually”)

Unsupervised learning Clustering methods find out best partition from similarities among the examples Labels are not available for the training phase Clustering methods:  K-Means algorithm  Incremental algorithm  Probability-based

Supervised vs Unsupervised learning Supervised methods:  Need a complete pre-labeled dataset  Better accuracy for predefined classes  No detection of new classes  Difficult detection of retraining necessity Unsupervised methods:  Do not need complete labeled instances  Automatic detection of new classes  Better accuracy for new classes

Feature selection Methods to detect irrelevant or redundant features.  Improve the accuracy  Reduce the computationally load Wrapper methods  Evaluate the performance of different subsets using the ML algorithm for the learning phase.  Depends on the ML algorithm  e.g. Correlation-based Feature Selection (CFS) Filter methods  Make independent assessment based on general characteristics of the data  Independent on the ML algorithm  e.g. Best-First

Outline Motivations and objectives Scenario and requirements Existing solutions  Well-known ports  Payload based (pattern matching)  Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions

Proposed method Supervised identification based on C4.5 algorithm  Developed by Ross Quinlan as extension of ID3  Based on the construction of a classification tree  Feature selection based on maximizing the information gain Training set  Actual traffic flows  Pairs  Feature vector contains relevant characteristics of traffic flows  The application of each flow is identified “manually”

Features Requirements  Real-time extraction  Independence from packet contents Feature examples (total: 25)  Packets and bytes per flow  Flow duration  min/avg/max paquet size  min/avg/max TCP window size  min/avg/max packet interarrival time  Packets with flags PUSH, URG, DF, … set  Average increase of IPID  OS estimation (source and destination)  Also ports and protocols (but not in the traditional way)  …

Training Phase (I)‏ Collection of training traffic  Representative of the environment to be monitored  Flow aggregation (at transport level)  Feature extraction Manual classification of training flows  Offline analysis of packet contents  Using pattern matching algorithms (e.g. L7-filter)  Manual inspection of the rest of flows Alternative  Generate artificial traffic under a controlled environment  Manual identification is not required  Solves encryption and privacy issues

Training Phase (II)‏ Construction of the classification tree  C4.5 algorithm  Input: Classified training flows  Output: classification tree (contains flow features only) Software employed: Weka  University of Waikato (New Zealand)  GNU GPL license  Written in Java 

Deployment Implementation in SMARTxAC  Flow aggregation  Real-time feature extraction (requirement)  Classification of each flow using the classification tree  Computationally lightweight and applicable in real time There is no need to:  Analyze packet contents  Trust only on port numbers  Apply pattern search algorithms  Inspect manually the packets But it is required to:  Retrain the system occasionally –New applications –Changes on existing ones

Accuracy

Application breakdown Port-based Machine learning

Application breakdown timeseries Port-based Machine learning

Outline Motivations and objectives Scenario and requirements Existing solutions  Well-known ports  Payload based (pattern matching)  Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions

Summary 1) Collection of the training set Representative flows of the environment to be monitored Alternatively artificially generated 2) Feature extraction from the training flows 3) Manual flow classification → application class Pattern matching and manual inspection It can be simplified if an artificial training set is used in 1) 4) Construction of a C4.5 classification tree E.g. using Weka 5) Deployment of the tree obtained in 4) in the monitoring system 6) Retraining of the system Starting from phase 1)

Conclusions Traditional method based on well-known ports  Low accuracy due to dynamic ports Identification based on pattern matching  Does not feasible in high-speed links due to computation cost  Depends on packet content  Does not work with encryption Identification based on machine learning  Feasible in high-speed links  Does not require packet content  Experimental result shows accuracy > 95%  Requires an occasionally retrain Future work  Retraining the method with the new scenario  Make the training phase as automatic as possible

Thank you for your attention Questions?