Bursty and Hierarchical Structure in Streams

Slides:



Advertisements
Similar presentations
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Advertisements

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
1 Detection and Analysis of Impulse Point Sequences on Correlated Disturbance Phone G. Filaretov, A. Avshalumov Moscow Power Engineering Institute, Moscow.
Complex Feature Recognition: A Bayesian Approach for Learning to Recognize Objects by Paul A. Viola Presented By: Emrah Ceyhan Divin Proothi Sherwin Shaidee.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Presented by Zeehasham Rasheed
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
 1  Outline  stages and topics in simulation  generation of random variates.
Temporal Analysis using Sci2 Ted Polley and Dr. Katy Börner Cyberinfrastructure for Network Science Center Information Visualization Laboratory School.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Scalable and Near Real-Time Burst Detection from eCommerce Queries Nish Parikh, Neel Sundaresan ACM SIGKDD ’08 Presenter: Luo Yiming.
CS Statistical Machine learning Lecture 24
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
1 CS 430: Information Discovery Lecture 5 Ranking.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Other Models for Time Series. The Hidden Markov Model (HMM)
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
More on Exponential Distribution, Hypo exponential distribution
Modeling and Simulation CS 313
Queensland University of Technology
Probabilistic Analysis of Computer Systems
OPERATING SYSTEMS CS 3502 Fall 2017
CPU Scheduling CSSE 332 Operating Systems
Online Multiscale Dynamic Topic Models
12. Principles of Parameter Estimation
Hidden Markov Models.
Lecture 13: Language Models for IR
Modeling and Simulation CS 313
Reading Notes Wang Ning Lab of Database and Information Systems
Task: It is necessary to choose the most suitable variant from some set of objects by those or other criteria.
Classification of unlabeled data:
Particle Filtering for Geometric Active Contours
Yi-Chia Wang LTI 2nd year Master student
Reconstructing Ancient Literary Texts from Noisy Manuscripts
Dynamical Statistical Shape Priors for Level Set Based Tracking
Markov chain monte carlo
Kernel Stick-Breaking Process
Quantum One.
Hidden Markov Models Part 2: Algorithms
CSc4730/6730 Scientific Visualization
More about Posterior Distributions
Today.
Discrete Event Simulation - 4
Integration of sensory modalities
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Principles of the Global Positioning System Lecture 11
Handwritten Characters Recognition Based on an HMM Model
Learning From Observed Data
Mathematical Foundations of BME Reza Shadmehr
Essential Statistics Introducing Probability
15th Scandinavian Workshop on Algorithm Theory
12. Principles of Parameter Estimation
Sets and Probabilistic Models
HKN ECE 313 Exam 2 Review Session
EE384Y: Packet Switch Architectures II
Machine Learning and Data Mining Clustering
Generalized Perturbation and Simulation
Presentation transcript:

Bursty and Hierarchical Structure in Streams Jon Kleinberg ACM SIGKDD’02 Presented by Deng Cai 10/20/2004 Corporate Development & Strategy Microsoft Confidential: Internal Use Only

The problem and main idea A Weighted Automation Model Outline The problem and main idea A Weighted Automation Model Two state Infinite state Experiments Email Paper Thoughts

Extract meaningful structure from document stream Main Idea Extract meaningful structure from document stream Burst of activity: certain features rising sharply in frequency as the topic emerges A formal approach for modeling such “bursts” An infinite-state automaton Bursts appear as state transitions A nested representation of the set of bursts that imposes a hierarchical structure on the overall stream.

Email Paper title Two Cases Articles’ arrival over time Try to find hierarchy structure Paper title Batch appearing Thy to enumerate all the bursts (ranking bursts)

A Weighted Automation Model: One State Model Generating model: : the gap in time of two consecutive messages Expectation: : rate of message arrivals Why this model?

A Weighted Automaton Model: Two State Model Two states automaton A: q0,q1 A changes state with probability p, remaining in its current state with probability 1-p, independently of previous emissions and state changes. A begins in state q0. Before each message is emitted, A changes state with probability p. A message is then emitted, and the gap in time until the next message is determined by the distribution associated with A's current state.

A Weighted Automaton Model: Two State Model Based on a set of messages to estimate a state sequence Maximum likelihood n inter-arrival gaps: A state sequence: b denotes the number of state transitions in the sequence q

A Weighted Automaton Model: Two State Model Finding a state sequence q maximizing previous probability is equivalent to finding one that minimizes Equivalent to minimize the following cost function:

An Infinite-state Model Base state q0 Exponential density function with rate Consistent with completely uniform message arrivals. State qi , scaling parameter the infinite sequence of states models inter-arrival gaps that decrease geometrically from for every i and j, there is a cost  associated with a state transition from qi to qj . In this paper

An Infinite-state Model This automaton, with its associated parameters s and , will be denoted as Given , find a state sequence that minimizes the cost function: As before, minimizing the first term is consistent with having few state transitions and transitions that span only a few distinct states, while minimizing the second term is consistent with passing through states whose rates agree closely with the inter-arrival gaps. Thus, the combined goal is to track the sequence of gaps as well as possible without changing state too much. Observe that the scaling parameter s controls the “resolution" with which the discrete rate values of the states are able to track the real-valued gaps; the parameter controls the ease with which the automaton can change states. How to choose these two parameters?

An Infinite-state Model An optimal state sequence in can be found by restricting to a number of states k that is a very small constant, always at most 25. This can be done by adapting the standard forward dynamic programming algorithm used for hidden Markov models to the model and cost function defined here

An Infinite-state Model We can formally define a burst of intensity j to be a maximal interval over which q is in a state of index j or higher. It follows that bursts exhibit a natural nested structure

An Infinite-state Model

Experiments: Email (Hierarchical Structure) Saved email of the author (June 9, 1997 — Aug. 23, 2001) Total 34344 messages Subsets of the collection can be chosen by selecting all messages that contain a particular string or set of strings ITR: it is the name of a large National Science Foundation program for which my colleagues and I wrote two proposals in 1999-2000 Prelim: the term used at Cornell for (non-final) exams in undergraduate courses. To examining: First, is it in fact the case that the appearance of messages containing particular words exhibits a “spike," in some informal sense, in the (temporal) vicinity of significant times such as deadlines, scheduled events, or unexpected developments? Do the algorithms developed here provide a means for identifying this phenomenon Figure 2 shows an analysis of the stream of all messages containing the word \ITR," which is prominent in my e-mail because it is the name of a large National Science Foundation program for which my colleagues and I wrote two proposals in 1999-2000. There are many possible ways to organize this stream of messages, but one general backdrop against which to view the stream is the set of deadlines imposed by the NSF for the rst run of the program. Finally, note that there is no burst of positive intensity over the nal deadline for large proposal, since we did not continue our large submission past the pre-proposal stage.

ITR

ITR

prelim

prelim

Experiments: Paper Title (Enumerating Bursts) For every word w that appears in the collection, one computes all the bursts in the stream of messages containing w. Combined with a method for computing a weight associated with each burst, and for then ranking by weight, This essentially provides a way to find the terms that exhibit the most prominent rising and falling pattern over a limited period of time. Extracting bursts in term usage from the titles of conference papers. Two distinct sources of data will be used here: The titles of all papers from the database conferences SIGMOD and VLDB for the years 1975-2001 The titles of all papers from the theory conferences STOC and FOCS for the years 1969-2001.

The Automaton is not suitable in this case. Since it is fundamentally based on analyzing the distribution of inter-arrival gaps Documents arrive in discrete batches; in each new batch of documents, some are relevant and some are irrelevant. The idea is thus to find an automaton model that generates batched arrivals, with particular fractions of relevant documents. A sequence of batched arrivals could be considered bursty if the fraction of relevant documents alternates between reasonably long periods in which the fraction is large and other periods in which it is small.

The Automaton Base state State qi will only be defined for i such that State qi produces a mixture of relevant and irrelevant documents according to a binomial distribution with probability pi.

Cost function The Automaton If the automaton is in state qi when the tth batch arrives

Experiment (Paper Title) The main goal is to enumerate bursts of positive intensity, thus the two state automaton is used. Given an optimal state sequence, bursts of positive intensity correspond to intervals in which the state is q1 rather than q0 weight of the burst

Experiment Result

Experiment Result

Thoughts How to identify one message belongs to a certain topic is the hard problem. (This paper avoid this) Text mining should handle this problem The result are not very impressive and the model might not so meaningful. Why this generating model? Why this parameters? All of these can not be mathematically proved (verified)