Network Service Identification through Hypergraph Clustering Li PU, Boi FALTINGS @ EPFL 2011.02.15 @ UESTC
Introduction In a corporate network What applications are installed in the client machines? What services are installed in the servers? The administrator needs an in-depth overview 2 2
Introduction Network traffic is collected from client machines by Nexthink solution 3 3
Introduction Picture from NEXThink Finder Source User Application Port Destination >1000 computers, >500 applications, >1 million TCP/UDP sessions Simplify the data for a network administrator to look at 4 4
Introduction Group the ports according to the functionality. ports group = service file shareing: 139, 445 antivirus: 1281, 2967, … malware Reduce the number of groups need to skim over Assumption: one port belongs to exactly one service 5 5
Network Service A group of ports can be identified as a network service. They might be inconsecutive Evidence: Application Port Destination 6 6
Hypergraph Model What is a good partition? A hyperedge contains one or more vertices. Each hyperedge has a weight Network service identification Vertex partition What is a good partition? Break as less (weighted) hyperedges as possible Isolate small services as well as large services 7 7
Cut-based Partitioning Minimize the cut of the partition Different cuts for hypergraph: 1) Number of broken hyperedges [karypis2002multilevel] Designed for VLSI applications, not suitable for general hypergraph partitioning or network service identification 8 8
Cut-based Partitioning 2) Normalized hypergraph cut [zhou2007learning] First convert a hypergraph into a simple graph Then compute the simple normalized cut 9 9
Cut-based Partitioning 3) Non-pairwise hypergraph cut Best partition only depends on the weight, but not on the hyperedge degree 10 10
Determining Number of Clusters To minimize hypergraph cut, there is a trivial solution where only one cluster exists We need to find the best number of clusters The graph modularity [newman2004finding] is extended for hypergraph 11 11
Partitioning Algorithm Build a hierarchy clustering tree by hypergraph cut Bottom-up : agglomerative approach Determine the threshold by hypergraph modularity 12 12
Results We compare the following methods: Hierarchy clustering based on HC0 (SetE1) Hierarchy clustering based on NHC2 (SetE2) K-means Synthetic dataset and Nexthink dataset collected from a real company are used 13 13
Results Synthetic data ( = max service size/min service size, = noisy data rate) All performance are averaged over 100 runs with randomly generated dataset Observations: HC0 is the best with or without unbalanced service sizes The results are sensitive to noises Q2 is very similar to PWF even if it is unsupervised 14 14
Results Synthetic data: performance on individual clusters α = 2 K-means size 3.000, 3.000, 3.000, 6.000, 6.000, 6.000 precision 0.753, 0.800, 0.779, 0.886, 0.919, 0.879 recall 0.992, 0.971, 0.953, 0.944, 0.921, 0.913 F-score 0.815, 0.838, 0.810, 0.891, 0.898, 0.864 α = 2 HC0 size 3.000, 3.000, 3.000, 6.000, 6.000, 6.000 precision 1.000, 1.000, 1.000, 1.000, 1.000, 1.000 recall F-score α = 8 K-means size 1.000, 1.100, 1.300, 7.500, 7.600, 7.900 precision 0.641, 0.679, 0.690, 0.952, 0.935, 0.935 recall 1.000, 1.000, 0.997, 0.873, 0.860, 0.915 F-score 0.745, 0.774, 0.779, 0.889, 0.868, 0.905 α = 8 HC0 size 1.000, 1.100, 1.300, 7.500, 7.600, 7.900 precision 1.000, 1.000, 1.000, 1.000, 1.000, 0.969 recall 1.000, 1.000, 1.000, 1.000, 1.000, 1.000 F-score 1.000, 1.000, 1.000, 1.000, 1.000, 0.982 15 15
Results Real data - collected by NEXThink in the client’s corporate network HC0 HC0 K-means Ports, applications, destinations 2 TCP139, TCP445 system 10.130.10.111, 10.130.10.107, 10.130.10.226,10.130.10.98, … 3 TCP3464, TCP3466 nvdkit.exe, radconct.exe, radstgms.exe, … 10.130.10.94, 10.144.0.5, 10.136.0.5, 10.60.15.5,10.140.1.5, 10.20.3.8, ... 5 TCP2967, UDP1281, UDP2967, UDP38293 rtvscan.exe, savroam.exe 10.130.10.98, 10.144.0.5, 10.136.0.5, 10.2.0.5, 10.60.15.5, 10.20.3.8, ... Ports, applications, destinations k1 TCP2638 vlaknagl.exe, novoterm.exe, tpmeritve.exe 10.21.49.7 k2 UDP2638 novoterm.exe, tpmeritve.exe 10.255.255.255 k3 TCP50000 corporateebankmain.exe, commonupdt.exe, corporateebank.exe Ports, applications, destinations k1 TCP2638, TCP8290, TCP16384 vlaknagl.exe, novoterm.exe, tpmeritve.exe, hpqscnvw.exe, agentservice.exe, ... 10.21.49.7, 10.136.10.2,10.0.21.105, 10.130.11.86,10.100.0.15 k2 UDP2638, UDP138, TCP2869 novoterm.exe, tpmeritve.exe, system, svchost.exe, wmpnetwk.exe, ... 10.255.255.255, 10.136.0.36, 10.200.21.74, 10.200.255.255, 10.136.0.30, ... k3 TCP50000, TCP40000, TCP1233 corporateebankmain.exe, commonupdt.exe, corporateebank.exe, mmc.exe, java.exe, ... 10.21.49.7, 10.130.10.111, 10.150.31.8 16 16
Discussion HC0 produces better results than NHC2 on the synthetic data. But is it suitable for all hypergraph structure? (given the fact that NHC2 is easier to deal with) The Nexthink dataset is interesting. Can we play with it a bit more? Where can we find killer applications of community detection techniques? (for both simple graph and hypergraph) 17 17
Thank you