Lightweight Application Classification for Network Management Hongbo Jiang Case Western Reserve University Andrew W. Moore University of Cambridge Zihui Ge Adverplex Inc. Shudng Jin Case Western Reserve University Jia Wang AT&T Labs - Research ACM SIGCOMM Workshop on Internet Network Management (INM) Kyoto, Japan, August 31, 2007 Lightweight Application Classification for Network Management
Why do Network Traffic Classification? Network planning Traffic engineering Accounting and billing Security profiling … Lightweight Application Classification for Network Management
Our Contribution A lightweight application classification scheme based on NetFlow data Evaluation & Sensitivity Analysis Trivial features Derivative features Training-set size Packet sampling Lightweight Application Classification for Network Management
Flow-level Traffic Classification Previous traffic classification use features derived from streams of packets Can achieve good accuracy (e.g., 95%) Have high complexity and cost Commonly available flow-level statistics (Cisco NetFlow, Juniper cflowd, Huawei NetStream,…) Sampling further reduces the cost Lightweight Application Classification for Network Management
Probabilistic Method Example Training Set In Training Probability box Class of membership Object Characteristics Prior Pr = .15 Pr = .33 In Use Probability box Probability of membership (estimate of membership) Prior Object Characteristics ? Pr = .97 Lightweight Application Classification for Network Management
Our Approach (cont.) Features ranked by importance Use Symmetric Uncertainty (based on entropy) (See paper and references therein for details.) Ranked features allows for a sensitivity analysis, and the removal of irrelevant and redundant features. Lightweight Application Classification for Network Management
Evaluation Dataset (not from AT&T!) Netflow Generation Full-duplex 1Gbps access-link; 1000 researchers Data was hand-classified into a number of application classes: e.g. web-browsing, email, FTP, attack, P2P, … Focused on TCP/IP flows only 800,000 simplex TCP/IP application-level flows (97% of traffic by byte-volume) Netflow Generation Software simulation of Cisco NetFlow v5 engine Independent training and test sets Flows randomly assigned to each Lightweight Application Classification for Network Management
Baseline and Derivative Features Category Baseline Derivative + Baseline Application + Baseline + Derivative Features srcIP/dstIP srcPort/dstPort ToS sTime/eTime tcpFlag bytes packets Duration pktSize byteRate pktRate tcpFxxx (syn/ack/fin/rst/psh/urg) Low port High port Accuracy 88.3% 89.1% 91.4% Comparison: Port based: 50-70%, Packet based: 95% Lightweight Application Classification for Network Management
Highly Relevant Features Refers to specific privileged services and protocols Differentiate Email and FTP from Web-browsing Compact features Lightweight Application Classification for Network Management
Reducing Feature Complexity Runtime: 600x (s) Runtime: 1x (s) Accuracy remains high even after removing irrelevant and redundant features. Lightweight Application Classification for Network Management
Reducing Training Set Size More features may lead-to noise (insufficiently representative) Lightweight Application Classification for Network Management
Impact of Packet Sampling NetFlow characteristic: Observed flow-count will decrease as sampling rate decreases Packet sampling has little impact on accuracy Lightweight Application Classification for Network Management
Conclusion & Future Works Application Classification can be done with Flow-level (NetFlow) information Trivially-derived features improve accuracy Packet sampling have minimal impact Future works NetFlow v9?? Other M-L methods? Lightweight Application Classification for Network Management
Thanks Lightweight Application Classification for Network Management