Models and Algorithms for Event-Driven Networks PhD Defense Brian Thompson Committee: Muthu Muthukrishnan (advisor), Danfeng Yao (Virginia Tech), Rebecca.

Slides:



Advertisements
Similar presentations
ICDE 2014 LinkSCAN*: Overlapping Community Detection Using the Link-Space Transformation Sungsu Lim †, Seungwoo Ryu ‡, Sejeong Kwon§, Kyomin Jung ¶, and.
Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Fast Algorithms For Hierarchical Range Histogram Constructions
Mauro Sozio and Aristides Gionis Presented By:
Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
Modularity and community structure in networks
Community Detection Laks V.S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks. Physical Review E 69,
Anomaly Detection in Communication Networks Brian Thompson James Abello.
Patch to the Future: Unsupervised Visual Prediction
Introduction of Probabilistic Reasoning and Bayesian Networks
Copyright 2006, Data Mining Research Laboratory An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs Sitaram Asur,
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Heuristic alignment algorithms and cost matrices
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Network Bandwidth Allocation (and Stability) In Three Acts.
Streaming Models and Algorithms for Communication and Information Networks Brian Thompson (joint work with James Abello)
The max-divergence of E’ is: Intuitively, p-divergence of d means that the probability of at least X E’,p edges occurring p-recently is 1/d A (maximal)
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Kyle Heath, Natasha Gelfand, Maks Ovsjanikov, Mridul Aanjaneya, Leo Guibas Image Webs Computing and Exploiting Connectivity in Image Collections.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Algorithm: For all e E t, define X e = {w e if e G t, 1 - w e otherwise}. Measure likelihood of substructure S by. Flag S as anomalous if, where is an.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Models of Influence in Online Social Networks
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
Social Network Analysis via Factor Graph Model
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
(C) 2009 J. M. Garrido1 Object Oriented Simulation with Java.
Efficient Gathering of Correlated Data in Sensor Networks
Graphical models for part of speech tagging
Network Aware Resource Allocation in Distributed Clouds.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
December 7-10, 2013, Dallas, Texas
I MPROVING C O -C LUSTER Q UALITY WITH A PPLICATION TO P RODUCT R ECOMMENDATIONS Michail Vlachos et al. Distributed Application Systems Presentation by.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Association Mining via Co-clustering of Sparse Matrices Brian Thompson *, Linda Ness †, David Shallcross †, Devasis Bassu † *†
Modeling Collaboration in Academia: A Game Theoretic Approach Graham Cormode, Qiang Ma, S. Muthukrishnan, and Brian Thompson 1.
Spectral Clustering Jianping Fan Dept of Computer Science UNC, Charlotte.
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
Manuel Gomez Rodriguez Bernhard Schölkopf I NFLUENCE M AXIMIZATION IN C ONTINUOUS T IME D IFFUSION N ETWORKS , ICML ‘12.
Data Structures and Algorithms in Parallel Computing Lecture 3.
Hedonic Clustering Games Moran Feldman Joint work with: Seffi Naor and Liane Lewin-Eytan.
Association Mining via Co-clustering of Sparse Matrices Brian Thompson *, Linda Ness †, David Shallcross †, Devasis Bassu † *†
(C) J. M. Garrido1 Objects in a Simulation Model There are several objects in a simulation model The activate objects are instances of the classes that.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Facets: Fast Comprehensive Mining of Coevolving High-order Time Series Hanghang TongPing JiYongjie CaiWei FanQing He Joint Work by Presenter:Wei Fan.
Mining of Massive Datasets Edited based on Leskovec’s from
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Bo Zong, Yinghui Wu, Ambuj K. Singh, Xifeng Yan 1 Inferring the Underlying Structure of Information Cascades
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Semi-Supervised Clustering
MEIKE: Influence-based Communities in Networks
Workshop on Data Mining in Networks ICDM 2015
Gephi Gephi is a tool for exploring and understanding graphs. Like Photoshop (but for graphs), the user interacts with the representation, manipulate the.
Jianping Fan Dept of CS UNC-Charlotte
Hierarchical clustering approaches for high-throughput data
Q4 : How does Netflix recommend movies?
Discovering Functional Communities in Social Media
Modeling IDS using hybrid intelligent systems
Presentation transcript:

Models and Algorithms for Event-Driven Networks PhD Defense Brian Thompson Committee: Muthu Muthukrishnan (advisor), Danfeng Yao (Virginia Tech), Rebecca Wright, Paul Kantor, Hanghang Tong (CUNY City College) December 19, 2013Rutgers University

Models and Algorithms for Event-Driven Networks 2 What is an event-driven network?

Models and Algorithms for Event-Driven Networks We consider three problems that arise in the study of event-driven networks: 1. Detecting correlated events 2. Discovering functional communities 3. Modeling academic collaboration 3 Outline

Models and Algorithms for Event-Driven Networks Temporal dynamics Group behavior Attribution Computational feasibility 4 Themes

Detecting Correlated Events in Communication Networks Joint work with James Abello 5

Detecting Correlated Events in Communication Networks Setup: An event-driven network, where events indicate communication between two nodes Goal: Identify parts of the network with an unexpectedly high concentration of recent activity Challenges: Scalability – data accumulates, need concise representation Efficiency – high data rate, time-sensitive information Variability – entities have different temporal dynamics Problem Description 6

Detecting Correlated Events in Communication Networks Network Representation 7 Given an event-driven communication network: Muthu RebeccaPaulDanfengHanghang Node 1Node 2Timestamp MuthuRebecca8:30 AM RebeccaPaul9:00 AM MuthuDanfeng9:15 AM PaulHanghang2:00 PM

Detecting Correlated Events in Communication Networks Network Representation 8 For each pair of nodes (could be directed or undirected), we extract a time sequence: t1t1 t2t2 t3t3 t4t4 t5t5 Muthu Rebecca

Detecting Correlated Events in Communication Networks Network Representation Paul Rebecca MuthuDanfeng Hanghang 9 We can visualize the network like this:

Goal: Identify sets of nodes with an unexpectedly high concentration of recent activity Question: How to define “recent”? The most frequent communications will always seem “recent”, overshadowing others’ behavior. We call this time-scale bias. NOW Router Traffic Temporal Bias Attack Traffic Detecting Correlated Events in Communication Networks 10

Detecting Correlated Events in Communication Networks Time series analysis Sequence of “summary graphs” t = 1t = 2t = 3t = 4 Related Work 11

Our Approach 1. Use a streaming stochastic model to concisely represent communication between each node pair 2. Define a notion of “recent” communication that addresses time-scale bias 3. Apply a statistical test to detect correlated recent activity among a set of nodes Detecting Correlated Events in Communication Networks 12

Detecting Correlated Events in Communication Networks x min x max Inter-Arrival Time Distribution REneWal theory Approach for Real-time Data Streams The REWARDS Model 13 Time sequence: t1t1 t2t2 t3t3 t4t4 t5t5

For each pair of nodes in the network, estimate the parameters of the renewal process that is most likely to have generated the corresponding time sequence Detecting Correlated Events in Communication Networks x min x max Inter-Arrival Time Distribution REneWal theory Approach for Real-time Data Streams The REWARDS Model 14 Time sequence: t1t1 t2t2 t3t3 t4t4 t5t5

Detecting Correlated Events in Communication Networks Recency 15 t1t1 t2t2 t3t3 t4t4 t5t5 0 t

Recency Detecting Correlated Events in Communication Networks 16

Recency Detecting Correlated Events in Communication Networks 17

Detecting Correlated Events in Communication Networks 18 The L-CORE Algorithm Local algorithm for detecting CORrelated Events

Node set Run a variant of the Union-Find algorithm, keeping track of the subgraphs with highest recency 2. Initialize a disjoint set data structure on the nodes Detecting Correlated Events in Communication Networks 19 The G-CORE Algorithm Global algorithm for detecting CORrelated Events

Detecting Correlated Events in Communication Networks 20 Complexity

Robustness to Time Scale Detecting Correlated Events in Communication Networks 21 Simulation: star network, 100 trials w/ normal activity, and 100 trials including a period of correlated activity Our approach is robust to temporal variability

Detection Latency Detecting Correlated Events in Communication Networks 22 Data: Enron corpus, ~1000 nodes and ~5000 events The algorithms identify similar times of correlated activity, but our approach has shorter response time

Visualization Detecting Correlated Events in Communication Networks 23 Output from G-CORE algorithm on the Bluetooth dataset at 12:00pm on Day 100

Summary of Contributions REWARDS: a stochastic model for event-driven networks A formal definition of recency that is time-scale invariant L-CORE: a streaming local algorithm for detecting correlated recent activity among a given set of node pairs G-CORE: an efficient global algorithm for detecting correlations throughout the network simultaneously Detecting Correlated Events in Communication Networks 24

Discovering Functional Communities Joint work with Linda Ness, David Shallcross, Devasis Bassu 25

Discovering Functional Communities Setup: An event-driven network, where events correspond to actions by a single node, each with an associated label Goal: Identify functional communities of individuals who use the same labels Challenges: Scalability – there may be many nodes and many labels Mixed membership – each node may be part of more than one community Problem Description 26

Discovering Functional Communities Network Representation Paul Rebecca Muthu Danfeng Hanghang 27 Given a set of nodes and a collection of labeled events:

Discovering Functional Communities Network Representation 28 Hanghang Rebecca Paul Danfeng Muthu bicluster

Discovering Functional Communities Network Representation 29 Hanghang Rebecca Paul Danfeng Muthu

Discovering Functional Communities Network Representation 30 Hanghang Danfeng Paul Rebecca Muthu

Goal: Given a matrix, cluster the rows and columns simultaneously to reveal hidden structure Challenges: Don’t know the number or sizes of clusters a priori Number of possible co-clusterings is exponential in the size of the matrix R1R1 R2R2 C1C1 C2C2 Discovering Functional Communities 31 Co-Clustering

Spectral methods use linear algebraic techniques such as SVD to fit a block diagonal structure Usually require number of clusters to be pre-specified Likely to perform well on the matrix on the left, but not the one on the right: Discovering Functional Communities 32 Related Work

1. Define a quality metric for co-clusterings that rewards large, dense biclusters 2. Find a co-clustering that maximizes the metric value NP-hard in general, so need efficient heuristics Discovering Functional Communities 33 Our Approach

largedense Property P 1 Property P 2 Discovering Functional Communities 34 Choosing a Metric

1. Build randomized k-d trees on the rows and columns 2. Initialize maximal anti-chains as the leaves of each tree 3. Traverse the trees simultaneously from the bottom up, greedily merging the rows or columns that result in the greatest increase in the metric value 4. Output the co-clustering with the best metric value Discovering Functional Communities 35 The CC-MACS Algorithm Co-Clustering via Maximal Anti-Chain Search

Discovering Functional Communities 36

Discovering Functional Communities 37

Discovering Functional Communities 38

Discovering Functional Communities 39

Discovering Functional Communities 40

Discovering Functional Communities 41

Discovering Functional Communities 42

Discovering Functional Communities 43

Discovering Functional Communities 44

Discovering Functional Communities 45 Experiments: Synthetic Data

Matrices with known structure, taken from the NIST Matrix Market repository Discovering Functional Communities 46 Experiments: Visual Comparison Original Matrix Randomly Permuted Cross- Association

Meme-Tracker dataset of Leskovec et al. Top biclusters returned by the CC-MACS algorithm: Discovering Functional Communities 47 Experiments: Web Memes # of Domains# of MemesDensityTopic % St. Jude Children’s Hospital %Brazilian news %Spanish news %Tech news %Politics

A new class of co-clustering metrics that reward large, dense biclusters The CC-MACS algorithm, which efficiently searches the space of possible co-clusterings for one which maximizes the value of a given metric Advantages over existing methods: Do not need to specify number of clusters in advance Not limited to matrices with a block diagonal structure Discovering Functional Communities 48 Summary of Contributions

Modeling Collaboration in Academia Joint work with Graham Cormode, Qiang Ma, Muthu Muthukrishnan 49

Modeling Collaboration in Academia Problem Description 50

Modeling Collaboration in Academia Model one researcher’s papers and citations over time Model as a static network: same collaborations and number of papers per year Related Work

Our Approach Model the system as a repeated game, where the researchers choose collaborators each year in an attempt to maximize their long-term academic success Determine which sets of collaboration strategies form a game equilibrium, such that no pair of researchers would benefit from changing their strategies in order to collaborate with each other Modeling Collaboration in Academia 52

Game-Theoretic Model Modeling Collaboration in Academia 53

Main Results Modeling Collaboration in Academia 54

Future Directions Do there exist equilibria in the dynamic game? Extend the model to allow mixed strategies Analyze the game under other metrics of academic success besides the h-index Modeling Collaboration in Academia 55

Models and Algorithms for Event-Driven Networks 1. Detecting correlated events New stochastic model to address issue of time-scale bias Efficiently find subgraphs with unusually high recent activity 2. Discovering functional communities New class of metrics to reward large, dense biclusters CC-MACS algorithm efficiently finds a good co-clustering 3. Modeling academic collaboration Game-theoretic model allows formal analysis and simulation of collaborative behavior in a dynamic setting 56

Other Work Measuring pairwise influence Use the REWARDS model to measure influence between nodes based on the times of their respective activity Innovation and circulation in information networks Determine most likely sources of new content, and measure the importance of each node in the diffusion process Cascade partitioning Infer likely threads of related content from temporal and relational information alone 57

I owe much gratitude to: My committee: Muthu Muthukrishnan, Danfeng Yao, Rebecca Wright, Paul Kantor, and Hanghang Tong Fred Roberts, Tami Carpenter, Tina Eliassi-Rad, and James Abello, for mentoring me over the years My other collaborators, mentors, and friends at Rutgers, DIMACS/CCICADA, ACS, and elsewhere The DHS Fellowship which funded me for 3 years Last but not least, my family and friends 58