Workshop on Data Mining in Networks ICDM 2015

Slides:



Advertisements
Similar presentations
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Modeling & Simulation. System Models and Simulation Framework for Modeling and Simulation The framework defines the entities and their Relationships that.
Anomaly Detection in Communication Networks Brian Thompson James Abello.
Tru-Alarm: Trustworthiness Analysis of Sensor Network in Cyber Physical Systems Lu-An Tang, Xiao Yu, Sangkyum Kim, Jiawei Han, Chih-Chieh Hung, Wen-Chih.
Sampling and Flow Measurement Eric Purpus 5/18/04.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Structure of Information Pathways in a Social Communication Network Gueorgi KossinetsJon Kleinberg Duncan Watts.
Incremental Learning of Temporally-Coherent Gaussian Mixture Models Ognjen Arandjelović, Roberto Cipolla Engineering Department, University of Cambridge.
Streaming Models and Algorithms for Communication and Information Networks Brian Thompson (joint work with James Abello)
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
The max-divergence of E’ is: Intuitively, p-divergence of d means that the probability of at least X E’,p edges occurring p-recently is 1/d A (maximal)
Lecture 11. Matching A set of edges which do not share a vertex is a matching. Application: Wireless Networks may consist of nodes with single radios,
1 An Information Theoretic Approach to Network Trace Compression Y. Liu, D. Towsley, J. Weng and D. Goeckel.
Algorithm: For all e E t, define X e = {w e if e G t, 1 - w e otherwise}. Measure likelihood of substructure S by. Flag S as anomalous if, where is an.
Self-Similar through High-Variability: Statistical Analysis of Ethernet LAN Traffic at the Source Level Walter Willinger, Murad S. Taqqu, Robert Sherman,
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
 1  Outline  stages and topics in simulation  generation of random variates.
Verification & Validation
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Network Aware Resource Allocation in Distributed Clouds.
Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.
DoWitcher: Effective Worm Detection and Containment in the Internet Core S. Ranjan et. al in INFOCOM 2007 Presented by: Sailesh Kumar.
Hung X. Nguyen and Matthew Roughan The University of Adelaide, Australia SAIL: Statistically Accurate Internet Loss Measurements.
User-Centric Data Dissemination in Disruption Tolerant Networks Wei Gao and Guohong Cao Dept. of Computer Science and Engineering Pennsylvania State University.
A Software Framework for Distributed Services Michael M. McKerns and Michael A.G. Aivazis California Institute of Technology, Pasadena, CA Introduction.
Consensus Extraction from Heterogeneous Detectors to Improve Performance over Network Traffic Anomaly Detection Jing Gao 1, Wei Fan 2, Deepak Turaga 2,
Models and Algorithms for Event-Driven Networks PhD Defense Brian Thompson Committee: Muthu Muthukrishnan (advisor), Danfeng Yao (Virginia Tech), Rebecca.
1 Random Disambiguation Paths Al Aksakalli In Collaboration with Carey Priebe & Donniell Fishkind Department of Applied Mathematics and Statistics Johns.
Association Mining via Co-clustering of Sparse Matrices Brian Thompson *, Linda Ness †, David Shallcross †, Devasis Bassu † *†
1 Finding Spread Blockers in Dynamic Networks (SNAKDD08)Habiba, Yintao Yu, Tanya Y., Berger-Wolf, Jared Saia Speaker: Hsu, Yu-wen Advisor: Dr. Koh, Jia-Ling.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
1 Travel Times from Mobile Sensors Ram Rajagopal, Raffi Sevlian and Pravin Varaiya University of California, Berkeley Singapore Road Traffic Control TexPoint.
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Kalman Filter and Data Streaming Presented By :- Ankur Jain Department of Computer Science 7/21/03.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
 DM-Group Meeting Liangzhe Chen, Oct Papers to be present  RSC: Mining and Modeling Temporal Activity in Social Media  KDD’15  A. F. Costa,
OPERATING SYSTEMS CS 3502 Fall 2017
On-line Detection of Real Time Multimedia Traffic
Delay-Tolerant Networks (DTNs)
Data Mining: Concepts and Techniques
Introduction to Wireless Sensor Networks
Scientific Research Group in Egypt (SRGE)
Online Conditional Outlier Detection in Nonstationary Time Series
Probabilistic Data Management
Collective Network Linkage across Heterogeneous Social Platforms
A Consensus-Based Clustering Method
Kijung Shin1 Mohammad Hammoud1
Chapter 10 Verification and Validation of Simulation Models
Recovering Temporally Rewiring Networks: A Model-based Approach
Jargon & Basic Concepts
Chapter 4 – Part 3.
Discovering Functional Communities in Social Media
Introduction Wireless Ad-Hoc Network
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
GANG: Detecting Fraudulent Users in OSNs
Pei Lee, ICDE 2014, Chicago, IL, USA
Alan Kuhnle*, Victoria G. Crawford, and My T. Thai
Learning to Rank Typed Graph Walks: Local and Global Approaches
Real time signal processing
Overview: Chapter 2 Localization and Tracking
GhostLink: Latent Network Inference for Influence-aware Recommendation
Discovering Frequent Poly-Regions in DNA Sequences
Continuous Random Variables: Basics
Presentation transcript:

Workshop on Data Mining in Networks (DaMNet) @ ICDM 2015 Efficient and Time Scale-Invariant Detection of Correlated Activity in Communication Networks Brian Thompson and James Abello Workshop on Data Mining in Networks (DaMNet) @ ICDM 2015

Detecting Correlated Activity in Communication Networks Problem Description Setup: A communication network: a set of entities that interact with one another at specific moments in time Goal: Identify times and parts of the network with an unexpectedly high concentration of recent activity Challenges: Scalability – data accumulates, need concise representation Efficiency – high data rate, time-sensitive information Variability – entities have different temporal dynamics Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Network Representation Given a set of entities: and a stream of pairwise interactions: Muthu Rebecca Paul Danfeng Hanghang Node 1 Node 2 Timestamp Muthu Rebecca 8:30 AM Paul 9:00 AM Danfeng 9:15 AM Hanghang 2:00 PM Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Network Representation For each pair of nodes (could be directed or undirected), we extract a time sequence: Muthu Rebecca t1 t2 t3 t4 t5 Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Network Representation We can visualize the network like this: Muthu Danfeng Paul Rebecca Hanghang Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Related Work Time series analysis Sequence of “summary graphs” t = 1 t = 2 t = 3 t = 4 Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Time-scale Bias Q: What is the “right” time scale? A: In a heterogeneous network, there is none! The result of any analysis depends on the length of the “time blocks,” a phenomenon we call time-scale bias Time Sequence 1: Time Sequence 2: √ 6 4 Block length = 10 min.: 2 2 3 3 3 3 3 2 3 3 ? 1 1 1 1 1 ? 10 9 10 Block length = 30 min.: 8 √ 3 1 1 Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Our Approach Use a streaming stochastic model to concisely represent communication between each node pair Define a notion of “recent” communication that is time-scale invariant Apply statistical tests to detect sets of nodes with an unexpectedly high concentration of recent activity Detecting Correlated Activity in Communication Networks

Inter-Arrival Time Distribution Stochastic Model A renewal process Φ generates a sequence of events with inter-arrival times sampled independently at random from the same positive distribution xmin xmax Inter-Arrival Time Distribution Time sequence: t1 t2 t3 t4 t5 inter-arrival time = 𝑡 3 − 𝑡 2 Detecting Correlated Activity in Communication Networks

Inter-Arrival Time Distribution Stochastic Model Given an observed time sequence corresponding to communication between a pair of nodes in the network, we can estimate the parameters of the renewal process that is most likely to have generated it Streaming parameter estimation means efficient updates xmin xmax Inter-Arrival Time Distribution Time sequence: t1 t2 t3 t4 t5 inter-arrival time = 𝑡 3 − 𝑡 2 Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Recency A natural choice for recency function is the age of a renewal process, the time elapsed since the last event, denoted Age Φ 𝑡 : However, the most frequent communicators will always seem “recent,” overshadowing others’ behavior: Age Φ 𝑡 𝑇 Φ : t1 t2 t3 t4 t5 t Router Traffic: { Correlated Activity: t Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Recency We define recency using the probability integral transform: Rec Φ 𝑡 =1− 𝐹 Φ Age ∗ Age Φ 𝑡 where 𝐹 Φ Age ∗ is the limit distribution of the Age function Rec Φ normalizes the Age function with respect to the node pair’s typical frequency of activity, i.e. Rec Φ ∼Uniform 0,1 Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Recency We define the recency of a set of processes Φ 1 ,…, Φ 𝑛 at time 𝑡 using the Kolmogorov-Smirnov test: Rec Φ 1 ,…, Φ 𝑛 𝑡 =1− 𝑝 𝐾𝑆 Rec Φ i 𝑡 ∥Uniform 0,1 𝐻0 : i.i.d. samples from Uniform 0,1 The p-value, 𝑝 𝐾𝑆 , is the probability of getting a max distance at least as large as 𝑑 𝐾𝑆 under 𝐻 0 Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks The L-CORE Algorithm Local algorithm for detecting CORrelated Events For a given set of node pairs 𝐸, maintain the IAT distribution of communication between each pair Every time there is communication activity: Update the corresponding IAT distribution Output Rec 𝐸 𝑡 and the most recent node pairs 𝒖 𝟓 𝒖 𝟓 𝒖 𝟐 𝒖 𝟐 0.8 1.0 𝑅𝑒𝑐 𝐸 𝑡 =𝟎.𝟗𝟒𝟑 𝑢 1 ⟼ 𝒖 𝟐 , 𝒖 𝟑 , 𝒖 𝟓 𝒖 𝟏 0.3 0.9 𝒖 𝟒 𝒖 𝟑 𝒖 𝟑 Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks The G-CORE Algorithm Global algorithm for detecting CORrelated Events Construct a graph on 𝒰, with 𝑤 𝑢, 𝑢 ′ = Rec 𝑢, 𝑢 ′ 𝑡 Initialize a disjoint set data structure on the nodes Run a variant of the Union-Find algorithm, keeping track of the subgraphs with highest recency 0.9 0.75 0.7 0.1 0.5 0.3 𝒖 𝟐 𝒖 𝟐 Node set 𝑹𝒆𝒄(𝒕) .42 𝒖 𝟏 𝒖 𝟏 𝑢 1 , 𝑢 2 0.900 .97 𝒖 𝟑 𝒖 𝟑 𝒖 𝟏 , 𝒖 𝟐 , 𝒖 𝟑 0.973 𝑢 1 , 𝑢 2 , 𝑢 3 0.973 𝑢 4 , 𝑢 5 0.500 𝒖 𝟒 , 𝒖 𝟓 0.500 𝒖 𝟓 𝒖 𝟓 𝑢 1 𝑢 2 𝑢 3 𝑢 4 𝑢 5 .90 .90 .50 𝑢 1 , 𝑢 2 , 𝑢 3 , 𝑢 4 , 𝑢 5 0.421 𝒖 𝟒 𝒖 𝟒 Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Complexity Let 𝑛= 𝒰 , and let 𝑚 be the number of node pairs that have ever communicated. Streaming model: 𝑂(𝑚) space, 𝑂(1) update per event L-CORE: 𝑂 𝐸 time per event, where 𝐸 is the set of node pairs of interest G-CORE: worst-case 𝑂 𝑛⋅𝑚 time Heuristic G-CORE: 𝑂 𝑚⋅ 1 𝛿 +𝛼 𝑚,𝑛 time, where 0<𝛿<1 is a precision parameter and 𝛼 𝑚,𝑛 is the inverse Ackermann function, a small constant in practice Detecting Correlated Activity in Communication Networks

Robustness to Time Scale Simulation on network of 200 nodes, 100 of which have a period of increased activity across outgoing edges Our approach achieves high accuracy and precision in heterogeneous networks with high temporal variability Methods based on discretizing time only perform well when the activity rate matches the time scale of analysis Z-Score: number of standard deviations above the mean of the previous values. Daily: 10 edges, normal rate 1/day, 12 hours of correlated activity at 10x normal rate. Varying parameters: random # of edges 10-100, normal activity rate 1/min – 1/30days, correlated activity at 5-10x normal rate. (using a logarithmic distribution to encourage sampling from the full range of time scales) Detecting Correlated Activity in Communication Networks

Anomaly Detection in IP Traffic LBNL network trace, ~9 million packets sent between ~3000 nodes during a 1-hour time span Compare to total traffic and labeled “scanning activity” Scanning at 12:07 due to DNS and NBNS lookups Peak in Rec 𝑡 at 12:26 was not flagged by the analysts since the sequence of IP addresses was not monotonic Rec(t) Detecting Correlated Activity in Communication Networks

Change Detection in Email Enron corpus, ~5000 emails sent between ~1000 Enron employees over a 2-year span Compare L-CORE to GraphScope [Sun et al., KDD ’07] The algorithms identify similar change points, but our approach has shorter detection latency Detecting Correlated Activity in Communication Networks

Event Detection in Proximity Data RealityMining Bluetooth proximity, ~100k interactions between ~100 mobile devices at MIT over 9 months L-CORE has spikes in correlated activity on days 92-94 G-CORE shows an unusually large set of nodes with many recent pairwise interactions This corresponds to the last 3 days of the Fall semester Reality Mining dataset, Eagle et al.; ~100 nodes, ~3k edges, ~100k events Day 100, 12pm Day 93, 6pm Detecting Correlated Activity in Communication Networks

Summary of Contributions A formal definition of recency that is time-scale invariant L-CORE: a streaming local algorithm for detecting correlated recent activity among a fixed set of entities G-CORE: an efficient global algorithm that simultaneously detects sets of entities exhibiting correlated activity in disparate parts of the network Applications to a variety of real-world domains Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Future Work In contexts where additional meta-data is available, incorporate geospatial, textual, or other attributes Generalize our model to support multiple-recipient emails, broadcast messages (e.g. tweets), or bipartite graphs (e.g. purchase networks, recommender systems) Related Projects Infer pairwise influence between nodes based on the times of their respective activity (MILCOM) Discover functional communities in social media (DMSN) Detect cascades when textual content is not available Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Acknowledgments Part of this work was conducted at Lawrence Livermore National Lab, under the guidance of Tina Eliassi-Rad This project was partially supported by a DHS Career Development Grant, under the auspices of CCICADA, a DHS Center of Excellence My travel was partially funded by DIMACS, the Center for Discrete Math & Theoretical Computer Science Detecting Correlated Activity in Communication Networks

Detecting Correlated Activity in Communication Networks Contact Info: Brian Thompson bthompso8784@gmail.com http://pidancer.com Detecting Correlated Activity in Communication Networks