Download presentation
Presentation is loading. Please wait.
1
Workshop on Data Mining in Networks (DaMNet) @ ICDM 2015
Efficient and Time Scale-Invariant Detection of Correlated Activity in Communication Networks Brian Thompson and James Abello Workshop on Data Mining in Networks ICDM 2015
2
Detecting Correlated Activity in Communication Networks
Problem Description Setup: A communication network: a set of entities that interact with one another at specific moments in time Goal: Identify times and parts of the network with an unexpectedly high concentration of recent activity Challenges: Scalability – data accumulates, need concise representation Efficiency – high data rate, time-sensitive information Variability – entities have different temporal dynamics Detecting Correlated Activity in Communication Networks
3
Detecting Correlated Activity in Communication Networks
Network Representation Given a set of entities: and a stream of pairwise interactions: Muthu Rebecca Paul Danfeng Hanghang Node 1 Node 2 Timestamp Muthu Rebecca 8:30 AM Paul 9:00 AM Danfeng 9:15 AM Hanghang 2:00 PM Detecting Correlated Activity in Communication Networks
4
Detecting Correlated Activity in Communication Networks
Network Representation For each pair of nodes (could be directed or undirected), we extract a time sequence: Muthu Rebecca t1 t2 t3 t4 t5 Detecting Correlated Activity in Communication Networks
5
Detecting Correlated Activity in Communication Networks
Network Representation We can visualize the network like this: Muthu Danfeng Paul Rebecca Hanghang Detecting Correlated Activity in Communication Networks
6
Detecting Correlated Activity in Communication Networks
Related Work Time series analysis Sequence of “summary graphs” t = 1 t = 2 t = 3 t = 4 Detecting Correlated Activity in Communication Networks
7
Detecting Correlated Activity in Communication Networks
Time-scale Bias Q: What is the “right” time scale? A: In a heterogeneous network, there is none! The result of any analysis depends on the length of the “time blocks,” a phenomenon we call time-scale bias Time Sequence 1: Time Sequence 2: √ 6 4 Block length = 10 min.: 2 2 3 3 3 3 3 2 3 3 ? 1 1 1 1 1 ? 10 9 10 Block length = 30 min.: 8 √ 3 1 1 Detecting Correlated Activity in Communication Networks
8
Detecting Correlated Activity in Communication Networks
Our Approach Use a streaming stochastic model to concisely represent communication between each node pair Define a notion of “recent” communication that is time-scale invariant Apply statistical tests to detect sets of nodes with an unexpectedly high concentration of recent activity Detecting Correlated Activity in Communication Networks
9
Inter-Arrival Time Distribution
Stochastic Model A renewal process Φ generates a sequence of events with inter-arrival times sampled independently at random from the same positive distribution xmin xmax Inter-Arrival Time Distribution Time sequence: t1 t2 t3 t4 t5 inter-arrival time = 𝑡 3 − 𝑡 2 Detecting Correlated Activity in Communication Networks
10
Inter-Arrival Time Distribution
Stochastic Model Given an observed time sequence corresponding to communication between a pair of nodes in the network, we can estimate the parameters of the renewal process that is most likely to have generated it Streaming parameter estimation means efficient updates xmin xmax Inter-Arrival Time Distribution Time sequence: t1 t2 t3 t4 t5 inter-arrival time = 𝑡 3 − 𝑡 2 Detecting Correlated Activity in Communication Networks
11
Detecting Correlated Activity in Communication Networks
Recency A natural choice for recency function is the age of a renewal process, the time elapsed since the last event, denoted Age Φ 𝑡 : However, the most frequent communicators will always seem “recent,” overshadowing others’ behavior: Age Φ 𝑡 𝑇 Φ : t1 t2 t3 t4 t5 t Router Traffic: { Correlated Activity: t Detecting Correlated Activity in Communication Networks
12
Detecting Correlated Activity in Communication Networks
Recency We define recency using the probability integral transform: Rec Φ 𝑡 =1− 𝐹 Φ Age ∗ Age Φ 𝑡 where 𝐹 Φ Age ∗ is the limit distribution of the Age function Rec Φ normalizes the Age function with respect to the node pair’s typical frequency of activity, i.e. Rec Φ ∼Uniform 0,1 Detecting Correlated Activity in Communication Networks
13
Detecting Correlated Activity in Communication Networks
Recency We define the recency of a set of processes Φ 1 ,…, Φ 𝑛 at time 𝑡 using the Kolmogorov-Smirnov test: Rec Φ 1 ,…, Φ 𝑛 𝑡 =1− 𝑝 𝐾𝑆 Rec Φ i 𝑡 ∥Uniform 0,1 𝐻0 : i.i.d. samples from Uniform 0,1 The p-value, 𝑝 𝐾𝑆 , is the probability of getting a max distance at least as large as 𝑑 𝐾𝑆 under 𝐻 0 Detecting Correlated Activity in Communication Networks
14
Detecting Correlated Activity in Communication Networks
The L-CORE Algorithm Local algorithm for detecting CORrelated Events For a given set of node pairs 𝐸, maintain the IAT distribution of communication between each pair Every time there is communication activity: Update the corresponding IAT distribution Output Rec 𝐸 𝑡 and the most recent node pairs 𝒖 𝟓 𝒖 𝟓 𝒖 𝟐 𝒖 𝟐 0.8 1.0 𝑅𝑒𝑐 𝐸 𝑡 =𝟎.𝟗𝟒𝟑 𝑢 1 ⟼ 𝒖 𝟐 , 𝒖 𝟑 , 𝒖 𝟓 𝒖 𝟏 0.3 0.9 𝒖 𝟒 𝒖 𝟑 𝒖 𝟑 Detecting Correlated Activity in Communication Networks
15
Detecting Correlated Activity in Communication Networks
The G-CORE Algorithm Global algorithm for detecting CORrelated Events Construct a graph on 𝒰, with 𝑤 𝑢, 𝑢 ′ = Rec 𝑢, 𝑢 ′ 𝑡 Initialize a disjoint set data structure on the nodes Run a variant of the Union-Find algorithm, keeping track of the subgraphs with highest recency 0.9 0.75 0.7 0.1 0.5 0.3 𝒖 𝟐 𝒖 𝟐 Node set 𝑹𝒆𝒄(𝒕) .42 𝒖 𝟏 𝒖 𝟏 𝑢 1 , 𝑢 2 0.900 .97 𝒖 𝟑 𝒖 𝟑 𝒖 𝟏 , 𝒖 𝟐 , 𝒖 𝟑 0.973 𝑢 1 , 𝑢 2 , 𝑢 3 0.973 𝑢 4 , 𝑢 5 0.500 𝒖 𝟒 , 𝒖 𝟓 0.500 𝒖 𝟓 𝒖 𝟓 𝑢 1 𝑢 2 𝑢 3 𝑢 4 𝑢 5 .90 .90 .50 𝑢 1 , 𝑢 2 , 𝑢 3 , 𝑢 4 , 𝑢 5 0.421 𝒖 𝟒 𝒖 𝟒 Detecting Correlated Activity in Communication Networks
16
Detecting Correlated Activity in Communication Networks
Complexity Let 𝑛= 𝒰 , and let 𝑚 be the number of node pairs that have ever communicated. Streaming model: 𝑂(𝑚) space, 𝑂(1) update per event L-CORE: 𝑂 𝐸 time per event, where 𝐸 is the set of node pairs of interest G-CORE: worst-case 𝑂 𝑛⋅𝑚 time Heuristic G-CORE: 𝑂 𝑚⋅ 1 𝛿 +𝛼 𝑚,𝑛 time, where 0<𝛿<1 is a precision parameter and 𝛼 𝑚,𝑛 is the inverse Ackermann function, a small constant in practice Detecting Correlated Activity in Communication Networks
17
Robustness to Time Scale
Simulation on network of 200 nodes, 100 of which have a period of increased activity across outgoing edges Our approach achieves high accuracy and precision in heterogeneous networks with high temporal variability Methods based on discretizing time only perform well when the activity rate matches the time scale of analysis Z-Score: number of standard deviations above the mean of the previous values. Daily: 10 edges, normal rate 1/day, 12 hours of correlated activity at 10x normal rate. Varying parameters: random # of edges , normal activity rate 1/min – 1/30days, correlated activity at 5-10x normal rate. (using a logarithmic distribution to encourage sampling from the full range of time scales) Detecting Correlated Activity in Communication Networks
18
Anomaly Detection in IP Traffic
LBNL network trace, ~9 million packets sent between ~3000 nodes during a 1-hour time span Compare to total traffic and labeled “scanning activity” Scanning at 12:07 due to DNS and NBNS lookups Peak in Rec 𝑡 at 12:26 was not flagged by the analysts since the sequence of IP addresses was not monotonic Rec(t) Detecting Correlated Activity in Communication Networks
19
Change Detection in Email
Enron corpus, ~5000 s sent between ~1000 Enron employees over a 2-year span Compare L-CORE to GraphScope [Sun et al., KDD ’07] The algorithms identify similar change points, but our approach has shorter detection latency Detecting Correlated Activity in Communication Networks
20
Event Detection in Proximity Data
RealityMining Bluetooth proximity, ~100k interactions between ~100 mobile devices at MIT over 9 months L-CORE has spikes in correlated activity on days 92-94 G-CORE shows an unusually large set of nodes with many recent pairwise interactions This corresponds to the last 3 days of the Fall semester Reality Mining dataset, Eagle et al.; ~100 nodes, ~3k edges, ~100k events Day 100, 12pm Day 93, 6pm Detecting Correlated Activity in Communication Networks
21
Summary of Contributions
A formal definition of recency that is time-scale invariant L-CORE: a streaming local algorithm for detecting correlated recent activity among a fixed set of entities G-CORE: an efficient global algorithm that simultaneously detects sets of entities exhibiting correlated activity in disparate parts of the network Applications to a variety of real-world domains Detecting Correlated Activity in Communication Networks
22
Detecting Correlated Activity in Communication Networks
Future Work In contexts where additional meta-data is available, incorporate geospatial, textual, or other attributes Generalize our model to support multiple-recipient s, broadcast messages (e.g. tweets), or bipartite graphs (e.g. purchase networks, recommender systems) Related Projects Infer pairwise influence between nodes based on the times of their respective activity (MILCOM) Discover functional communities in social media (DMSN) Detect cascades when textual content is not available Detecting Correlated Activity in Communication Networks
23
Detecting Correlated Activity in Communication Networks
Acknowledgments Part of this work was conducted at Lawrence Livermore National Lab, under the guidance of Tina Eliassi-Rad This project was partially supported by a DHS Career Development Grant, under the auspices of CCICADA, a DHS Center of Excellence My travel was partially funded by DIMACS, the Center for Discrete Math & Theoretical Computer Science Detecting Correlated Activity in Communication Networks
24
Detecting Correlated Activity in Communication Networks
Contact Info: Brian Thompson Detecting Correlated Activity in Communication Networks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.