Workshop on Data Mining in Networks (DaMNet) @ ICDM 2015 Efficient and Time Scale-Invariant Detection of Correlated Activity in Communication Networks Brian Thompson and James Abello Workshop on Data Mining in Networks (DaMNet) @ ICDM 2015
Detecting Correlated Activity in Communication Networks Problem Description Setup: A communication network: a set of entities that interact with one another at specific moments in time Goal: Identify times and parts of the network with an unexpectedly high concentration of recent activity Challenges: Scalability – data accumulates, need concise representation Efficiency – high data rate, time-sensitive information Variability – entities have different temporal dynamics
Network Representation Given a set of entities: and a stream of pairwise interactions: Muthu Rebecca Paul Danfeng Hanghang Node 1 Node 2 Timestamp Muthu Rebecca 8:30 AM Paul 9:00 AM Danfeng 9:15 AM Hanghang 2:00 PM
Network Representation For each pair of nodes (could be directed or undirected), we extract a time sequence: Muthu Rebecca t1 t2 t3 t4 t5
Network Representation We can visualize the network like this: Muthu Danfeng Paul Rebecca Hanghang
Related Work Time series analysis Sequence of "summary graphs" t = 1 t = 2 t = 3 t = 4
Time-scale Bias Q: What is the "right" time scale? A: In a heterogeneous network, there is none! The result of any analysis depends on the length of the "time blocks," a phenomenon we call time-scale bias Time Sequence 1: Time Sequence 2: √ 6 4 Block length = 10 min.: 2 2 3 3 3 3 3 2 3 3 ? 1 1 1 1 1 ? 10 9 10 Block length = 30 min.: 8 √ 3 1 1
Our Approach Use a streaming stochastic model to concisely represent communication between each node pair Define a notion of "recent" communication that is time-scale invariant Apply statistical tests to detect sets of nodes with an unexpectedly high concentration of recent activity
Stochastic Model A renewal process Φ generates a sequence of events with inter-arrival times sampled independently at random from the same positive distribution xmin xmax Inter-Arrival Time Distribution Time sequence: t1 t2 t3 t4 t5 inter-arrival time = 𝑡 3 − 𝑡 2
Stochastic Model Given an observed time sequence corresponding to communication between a pair of nodes in the network, we can estimate the parameters of the renewal process that is most likely to have generated it Streaming parameter estimation means efficient updates xmin xmax Inter-Arrival Time Distribution Time sequence: t1 t2 t3 t4 t5 inter-arrival time = 𝑡 3 − 𝑡 2
Recency A natural choice for recency function is the age of a renewal process, the time elapsed since the last event, denoted Age Φ 𝑡 : However, the most frequent communicators will always seem "recent," overshadowing others' behavior: Age Φ 𝑡 𝑇 Φ : t1 t2 t3 t4 t5 t Router Traffic: { Correlated Activity: t
Recency We define recency using the probability integral transform: Rec Φ 𝑡 =1− 𝐹 Φ Age ∗ Age Φ 𝑡 where 𝐹 Φ Age ∗ is the limit distribution of the Age function Rec Φ normalizes the Age function with respect to the node pair's typical frequency of activity, i.e. Rec Φ ∼Uniform 0,1
Recency We define the recency of a set of processes Φ 1 ,…, Φ 𝑛 at time 𝑡 using the Kolmogorov-Smirnov test: Rec Φ 1 ,…, Φ 𝑛 𝑡 =1− 𝑝 𝐾𝑆 Rec Φ i 𝑡 ∥Uniform 0,1 𝐻0 : i.i.d. samples from Uniform 0,1 The p-value, 𝑝 𝐾𝑆 , is the probability of getting a max distance at least as large as 𝑑 𝐾𝑆 under 𝐻 0
The L-CORE Algorithm Local algorithm for detecting CORrelated Events For a given set of node pairs 𝐸, maintain the IAT distribution of communication between each pair Every time there is communication activity: Update the corresponding IAT distribution Output Rec 𝐸 𝑡 and the most recent node pairs 𝒖 𝟓 𝒖 𝟓 𝒖 𝟐 𝒖 𝟐 0.8 1.0 𝑅𝑒𝑐 𝐸 𝑡 =𝟎.𝟗𝟒𝟑 𝑢 1 ⟼ 𝒖 𝟐 , 𝒖 𝟑 , 𝒖 𝟓 𝒖 𝟏 0.3 0.9 𝒖 𝟒 𝒖 𝟑 𝒖 𝟑
The G-CORE Algorithm Global algorithm for detecting CORrelated Events Construct a graph on 𝒰, with 𝑤 𝑢, 𝑢 ′ = Rec 𝑢, 𝑢 ′ 𝑡 Initialize a disjoint set data structure on the nodes Run a variant of the Union-Find algorithm, keeping track of the subgraphs with highest recency 0.9 0.75 0.7 0.1 0.5 0.3 𝒖 𝟐 𝒖 𝟐 Node set 𝑹𝒆𝒄(𝒕) .42 𝒖 𝟏 𝒖 𝟏 𝑢 1 , 𝑢 2 0.900 .97 𝒖 𝟑 𝒖 𝟑 𝒖 𝟏 , 𝒖 𝟐 , 𝒖 𝟑 0.973 𝑢 1 , 𝑢 2 , 𝑢 3 0.973 𝑢 4 , 𝑢 5 0.500 𝒖 𝟒 , 𝒖 𝟓 0.500 𝒖 𝟓 𝒖 𝟓 𝑢 1 𝑢 2 𝑢 3 𝑢 4 𝑢 5 .90 .90 .50 𝑢 1 , 𝑢 2 , 𝑢 3 , 𝑢 4 , 𝑢 5 0.421 𝒖 𝟒 𝒖 𝟒
Complexity Let 𝑛= 𝒰 , and let 𝑚 be the number of node pairs that have ever communicated. Streaming model: 𝑂(𝑚) space, 𝑂(1) update per event L-CORE: 𝑂 𝐸 time per event, where 𝐸 is the set of node pairs of interest G-CORE: worst-case 𝑂 𝑛⋅𝑚 time Heuristic G-CORE: 𝑂 𝑚⋅ 1 𝛿 +𝛼 𝑚,𝑛 time, where 0<𝛿<1 is a precision parameter and 𝛼 𝑚,𝑛 is the inverse Ackermann function, a small constant in practice
Robustness to Time Scale Simulation on network of 200 nodes, 100 of which have a period of increased activity across outgoing edges Our approach achieves high accuracy and precision in heterogeneous networks with high temporal variability Methods based on discretizing time only perform well when the activity rate matches the time scale of analysis Z-Score: number of standard deviations above the mean of the previous values. Daily: 10 edges, normal rate 1/day, 12 hours of correlated activity at 10x normal rate. Varying parameters: random # of edges 10-100, normal activity rate 1/min – 1/30days, correlated activity at 5-10x normal rate. (using a logarithmic distribution to encourage sampling from the full range of time scales)
Anomaly Detection in IP Traffic LBNL network trace, ~9 million packets sent between ~3000 nodes during a 1-hour time span Compare to total traffic and labeled "scanning activity" Scanning at 12:07 due to DNS and NBNS lookups Peak in Rec 𝑡 at 12:26 was not flagged by the analysts since the sequence of IP addresses was not monotonic Rec(t)
Change Detection in Email Enron corpus, ~5000 emails sent between ~1000 Enron employees over a 2-year span Compare L-CORE to GraphScope [Sun et al., KDD '07] The algorithms identify similar change points, but our approach has shorter detection latency
Event Detection in Proximity Data RealityMining Bluetooth proximity, ~100k interactions between ~100 mobile devices at MIT over 9 months L-CORE has spikes in correlated activity on days 92-94 G-CORE shows an unusually large set of nodes with many recent pairwise interactions This corresponds to the last 3 days of the Fall semester Reality Mining dataset, Eagle et al.; ~100 nodes, ~3k edges, ~100k events Day 100, 12pm Day 93, 6pm
Summary of Contributions A formal definition of recency that is time-scale invariant L-CORE: a streaming local algorithm for detecting correlated recent activity among a fixed set of entities G-CORE: an efficient global algorithm that simultaneously detects sets of entities exhibiting correlated activity in disparate parts of the network Applications to a variety of real-world domains
Future Work In contexts where additional meta-data is available, incorporate geospatial, textual, or other attributes Generalize our model to support multiple-recipient emails, broadcast messages (e.g. tweets), or bipartite graphs (e.g. purchase networks, recommender systems) Related Projects Infer pairwise influence between nodes based on the times of their respective activity (MILCOM) Discover functional communities in social media (DMSN) Detect cascades when textual content is not available
Acknowledgments Part of this work was conducted at Lawrence Livermore National Lab, under the guidance of Tina Eliassi-Rad This project was partially supported by a DHS Career Development Grant, under the auspices of CCICADA, a DHS Center of Excellence My travel was partially funded by DIMACS, the Center for Discrete Math & Theoretical Computer Science
