SEEKING STABLE CLUSTERS IN THE BLOGOSPHERE SNU IDB Lab. Chung-soo Jang MAR 21, 2008 VLDB 2007, VIENNA Nilesh Bansal, Fei Chiang, Nick Koudas University.

Slides:

Advertisements

Similar presentations

SEARCHING THE BLOGOSPHERE

Advertisements

Google News Personalization: Scalable Online Collaborative Filtering

QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.

Group Recommendation: Semantics and Efficiency

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Fast Algorithms For Hierarchical Range Histogram Constructions

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

Dynamic Bayesian Networks (DBNs)

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Data Mining Association Analysis: Basic Concepts and Algorithms

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Introduction to Approximation Algorithms Lecture 12: Mar 1.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005

HCS Clustering Algorithm

Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.

Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.

Tag-based Social Interest Discovery SNU IDB Lab. Chung-soo Jang April 18, 2008 WWW 2008, Beijing, China. Xin Li, Lei Guo, Yihong (Eric) Zhao Yahoo! Inc.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Querying Structured Text in an XML Database By Xuemei Luo.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

Google News Personalization: Scalable Online Collaborative Filtering

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Efficient Processing of Top-k Spatial Preference Queries

Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.

Client Assignment in Content Dissemination Networks for Dynamic Data Shetal Shah Krithi Ramamritham Indian Institute of Technology Bombay Chinya Ravishankar.

Graph Algorithms Maximum Flow - Best algorithms [Adapted from R.Solis-Oba]

CS 590 Term Project Epidemic model on Facebook

Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.

Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Seeking Stable Clusters in the Blogosphere Nilesh Bansal, Fei Chiang, Nick Koudas (univ. of Toronto) Frank Wm. Tompa (univ. of Waterloo) Presented by Jung-yeon,

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Visualization of Biological Information with Circular Drawings.

1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)

Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Efficient Multi-User Indexing for Secure Keyword Search

RE-Tree: An Efficient Index Structure for Regular Expressions

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Rank Aggregation.

Randomized Algorithms CS648

Building Topic/Trend Detection System based on Slow Intelligence

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

SEEKING STABLE CLUSTERS IN THE BLOGOSPHERE SNU IDB Lab. Chung-soo Jang MAR 21, 2008 VLDB 2007, VIENNA Nilesh Bansal, Fei Chiang, Nick Koudas University of Toronto Frank Wm. Tompa University of Waterloo

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 2

Introduction (1)  The Blogosphere 3 67M KNOWN BLOGS 100K NEW EVERYDAY DOUBLING EVERY 200 DAYS

PERSONAL LIFE PRODUCT REVIEWS POLITICS TECHNOLOGY TOURISM SPORTS ENTERTAINMENT  What are they writing about in blogosphere? Introduction (2) 4

Introduction (3)  Why should we care? Huge data repository Will continue to grow Extracting public opinions Valuable insights  MARKET RESEARCH  PUBLIC RELATION STRATEGIES  CUSTOMER OPINION TRACKING 5

Introduction (4)  The Blogoscope University of Toronto Live blog search and analysis engine Tracking over 13 million blogs, 100 million posts Serves thousands of daily visitors Visit: 6

Introduction (5)  The Blogoscope 7 Hot Keywords Hot Keywords

Related Terms Related Terms Popularity Curve Popularity Curve Search Results Search Results Geo Search Geo Search

Hawaii Earthquake Taiwan Undersea Earthquake Sumatra Earthquake

December March

Baseball ON JAN

Introduction (5)  Challenges and opportunities Various stories => Topics evolve => keywords align together A specific topic or event => A set of keywords forming a cluster.  Note that such keyword clusters are temporal (associated with specific time periods) and transient.  As topics recede, associated keyword clusters dissolve, because their keywords do not appear frequently together anymore.  Identifying such clusters for specific time intervals is a challenging problem. Our Goal: Finding persistent chatter (keyword cluster) 14

Introduction (6)  Persistent Chatter Apple iPhone – January 2007  Jan first week: Anticipation of iPhone release  Jan 9th: iPhone release at Macworld  Jan 10th: Lawsuit by Cisco  Jan third week: Decrease  in chatter about iPhone 15

Introduction (7)  Stable Clusters - Apple iPhone Persistent for 4 days  Topic drifts Starts with discussion about Apple in general  Moves towards the Cisco lawsuit 16

Introduction (8)  Why stable cluster? Information Discovery  Monitor the buzz in the Blogosphere  “What were bloggers talking about in April last year?” Query refinement and expansion  If the query keyword belongs to one of the cluster, good Visualization?  Show keyword clusters directly to the user 17

Introduction (9)  Contribution in this paper Efficient algorithm to identify keyword clusters  BlogScope data contains over 13M unique keywords  Applicable to other streaming text sources  Flickr tags, News articles Formalize the notion of stable clusters Efficient algorithms to identify stable clusters  BFS, DFS and TA  Amenable to online computation over streaming data Using real dataset, experimental evaluation 18

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 19

Related Work (1)  Graph partitioning A topic of active research topic A k-way graph partitioning  Graph G => K mutually exclusive subsets of vertices of approximately the same size such that the number of edges of G that belong to different subsets is minimized.  NP-HARD  Several heuristic technique  Especially, multilevel graph bisection Kernighan-Lin based on cut-size reduction when changing node  Constraint that number of partitions has to be specified in advance 20

Related Work (2)  Correlation clustering Drop of this constraint Production of graph cuts  Given a graph in which each edge is marked with a ‘+’ or a ‘-’, correlation clustering produces a partitioning of the graph such that the number of ‘+’ edges within each cluster and the number of ‘-’ edges across clusters is maximized. NP-HARD Several approximation algorithms  Very interesting theoretically, but far from practical.  Moreover the existing algorithms require the edges to have binary labels, which is not the case in the applications we have in mind. 21

Related Work (3)  Alternative formulation of graph clustering Flake’s Solving the problem using network flows. Drawback  A sensitivity parameter ∂ before executing the algorithm, and the  ∂ choice of affects the solutions produced significantly.  The running time of such an algorithm is prohibitively large  O(V E), for V vertices and E edges, both of which are in the order of millions in our problem.  required six hours to conduct a graph cut on agraph with a few thousand edges and vertices.)  Unclear how to set parameters of this algorithm, and no guidelines 22

Related Work (4)  Measures for evaluating clusters Been utilized in the past to assess associations between keywords in a corpus  We employ some of these techniques to infer the strength of association between keywords during our cluster generation process. 23

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 24

Cluster Generation (1)  Definitions for organizing keyword graph D: the set of interesting text documents D ∈ D: represented as a bag of words u, v: kewords A D (u, v): 1, if both u and v are present in D 0, otherwise A(u,v)= ∑ D ∈ D A D (u, v): the count of documents in D that contains both u and v Triplets of the form (u, v, A(u,v)) V: the union of all keywords in these triplets 25

Cluster Generation (2)  Definitions for organizing keyword graph Triplets of the form (u, v, A(u,v))  Each triplet represents an edge E with weight A(u, v) in graph G over vertices V A(u) : the number of documents in D containing the keyword u A(u, ‾u ): the number of documents containing u but not v. 26

Cluster Generation (3)  BlogScope crawler fetches all newly created blog posts at regular time intervals. D: the set of all blog posts created in a temporal interval A(u, v) : the number of blog posts created in the selected temporal interval containing both u and v. Indexing around 75 million blog posts, and fetches over 200,000 new posts everyday. Needs for the effective computation of the triplets (u, v,A(u, v)) 27

Cluster Generation (4)  The process of computation of the triplets [pass 1] [pass 2]  Stemming and removal of stop words [pass 3]  All keyword pairs, A(u)=(u, u) [pass 3]  A file with all keyword pair sorted lexicography => triplets 28 {D}

Cluster Generation (5)  The result of computation of the triplets 29

Cluster Generation (6)  The result of computation of the triplets Filtering process  Given graph G, we first infer statistically significant associations between pairs of keywords in this graph.  Null Hypothesis  if one keyword appears in n1 fraction of the posts and another keyword in a fraction n2 we would expect them both to occur together in n1n2 fraction of posts.  The test of null hypothesis  c 30

Cluster Generation (7)  The result of computation of the triplets Filtering process In Χ square test, if, u and v are correlated at the 95% confidence level.  => Null hypothesis (True)  This test can act as a filter omitting edges from G not correlated according to the test at the desired level of significance. 31

Cluster Generation (8)  The result of computation of the triplets How about a correlation strength?  Χ square test doesn’t capture a correlation strength  So we need other measure for a correlation strength  d 32

Cluster Generation (9)  The result of computation of the triplets P(u, v)  Criteria between strong correlation and a week correlation  Reduced by eliminating all edges with values of less than a specific threshold. (p> 0.2)  Importance of correlations  The strong ones offer good indicators for query refinement (e.g., for a query keyword we may suggest the strongest correlation as a refinement)  Help for tracking the nature of ‘chatter’ around specific keywords. 33

Cluster Generation (10)  The result of computation of the triplets Only strong associations remain after pruning 34 G=>G’

Cluster Generation (11)  Our aim => Extracting keyword clusters Segmenting the Keyword Graph  Graph clustering algorithms [KK’98, FRT’05]  We don’t know the number of clusters  High computational complexity  Graph may not fit in main memory  Correlation clustering [BBC’04] – expensive  Our aim => fast, suitable for graphs  Bi-connected components  An articulation point in a graph is a vertex such that its removal makes the graph disconnected. A graph with at least two edges is bi-connected if it contains no articulation points. 35

Cluster Generation (12)  Our aim => Extracting keyword clusters Segmenting the Keyword Graph  Bi-connected components  A biconnected component of a graph is a maximal biconnected graph.  An articulation point in a graph is a vertex such that its removal makes the graph disconnected.  A graph with at least two edges is bi-connected if it contains no articulation points. 36

Cluster Generation (13)  Our aim => Extracting keyword clusters Segmenting the Keyword Graph  Why do we use bi-connected components in segmenting the keyword graph?  The underlying intuition is that nodes in a biconnected component survived pruning, due to very strong pair-wise correlations.  This problem is a well studied one [7]CLRS. 37

Cluster Generation (14)  Our aim => Extracting keyword clusters Segmenting the Keyword Graph  Why do we use bi-connected components in segmenting the keyword graph?  The underlying intuition is that nodes in a biconnected component survived pruning, due to very strong pair-wise correlations.  This problem is a well studied one [7]CLRS. 38

Cluster Generation (15)  Our aim => Extracting keyword clusters Bi-connected components 39

Cluster Generation (16)  Our aim => Extracting keyword clusters Finding Bi-connected components  Efficient algorithm exists – single pass  Realizable in secondary storage [CGGTV’05]  Perform a DFS on the graph  Maintain two numbers, un and low, with each node 40

Cluster Generation (17)  Our aim => Extracting keyword clusters Finding Bi-connected components 41

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 42

Cluster Graph (1)  Graph over clusters from three time steps Max temporal gap size, g=1 Three keyword clusters on each time step Each node is a keyword cluster Add a dummy source and sink, and make edges directed Edge weights represent similarity between clusters 43

Cluster Graph (2) 44

Cluster Graph (3)  Formal Problem Definition Weight of path = sum of participating edge weights  Definition: kl-Stable clusters Find top-k paths of length l with highest weight  Definition: normalized stable clusters Find top-k paths of minimum length lmin of highest weight normalized by their lengths 45

Cluster Graph (4)  Outline for kl-Stable Clusters 46

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 47

Breadth First Search (1) 48

Breadth First Search (2) 49

Breadth First Search (3) 50

Breadth First Search (4) 51

Breadth First Search (5)  BFS Analysis Algorithm requires a single pass over all G i  I/O linear in number of clusters (sequential I/O only) Needs enough memory to keep all clusters from past g+1 time steps in memory If enough memory is not available, multiple pass required  Similar to block nested join Amenable to streaming computation  Can easily update as new data arrives 52

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 53

Depth First Search (1) 54

Depth First Search (2) 55

Depth First Search (3) 56

Depth First Search (4) 57

Depth First Search (5)  DFS Analysis The number of I/O accesses is proportional the number of edges in cluster graph Small memory requirement  Keeps the stack in the memory  Size of the stack bounded by total number of temporal intervals Can be easily updated as new data arrives 58

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 59

Adapting the Threshold Algorithm (1)  Fagin’s* Threshold Algorithm (TA) Long studied and well understood.  ΤΑ Algorithm Read all grades of an object once seen from a sorted access  No need to wait until the lists give k common objects Do sorted access (and corresponding random accesses) until you have seen the top k answers. How do we know that grades of seen objects are higher than the grades of unseen objects ? Predict maximum possible grade unseen objects 60

Adapting the Threshold Algorithm (2)  ΤΑ Algorithm 61 a: 0.9 b: 0.8 c: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: f: 0.65 d: 0.6 f: 0.6 Seen Possibly unseen Threshold value T = min(0.72, 0.7) = 0.7

Adapting the Threshold Algorithm (3)  Example of ΤΑ Algorithm Step 1: - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer 62 IDA1A1 A2A2 Min(A 1,A 2 ) (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d

Adapting the Threshold Algorithm (4)  Example of ΤΑ Algorithm Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? Stop else go to next entry position in sorted list and repeat step 1 63 IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: a d T = min(0.9, 0.9) = 0.9

64 IDA1A1 A2A2 Min(A 1,A 2 ) (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d b Adapting the Threshold Algorithm (5)  Example of ΤΑ Algorithm Step 1 (Again): - parallel sorted access to each list

65 IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: a b T = min(0.8, 0.85) = 0.8 Adapting the Threshold Algorithm (6)  Example of ΤΑ Algorithm Step 2 (Again)

66 IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: a b T = min(0.72, 0.7) = 0.7 Adapting the Threshold Algorithm (7)  Example of ΤΑ Algorithm  Situation at stopping condition

Adapting the Threshold Algorithm (8)  Fagin’s* Threshold Algorithm (TA) Why is the threshold correct?  Because the threshold essentially gives us the maximum Score for the objects not seen (<= τ ) Advantages:  The number of object accessed is minimized! 67

Adapting the Threshold Algorithm (9) 68 IDA1A1 A2A2 A3 a: 0.9 b: 0.8 c: 0.72 d: D1 d: 0.9 a: 0.85 b: 0.7 c: a b D2 d: 0.9 a: 0.85 b: 0.7 c: D3 Min(A 1,A 2 )

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 69

Normalized Stable Clusters (1)  Find top-k paths of length greater than lmin with highest weight normalized by their length stability(π) = weight(π)/length(π)  Both the BFS or DFS based techniques can be used  weight(π)/length(π) is not monotonic Makes pruning tricky 70

Normalized Stable Clusters (2)  Theorem 1 71

Normalized Stable Clusters (3)  Proof of Theorem 1 72

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 73

Online Version (1)  New data arriving at every time interval. The need of the algorithms presented to be amenable to incremental adjustment A point of view in data structures:  BFS based algorithm: Good online version  DFS based algorithm: Not an online streaming algorithm  Our DFS: Incremental fashion as new data arrives. 74

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 75

Experiments (1)  Process Outline 76

The battle by Islamist militia against the Somali forces and Ethiopian troops. On Jan 9, Abdullahi Mogadishu US gunships attack Al-qaeda targets. Experiments (2)  We present results from blog postings in the week of Jan 6th  Around clusters were produced for each day Threshold of 0.2 used for correlation coefficient 77

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 78

Cluster Generation (1) 79

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 80

Stable Clusters (1) 81  Time  Space for finding top-3 paths of length 6 on a dataset with n = 2000, m = 9 and g = 0, Less than 22MB RAM for DFS 35MB for BFS.

Stable Clusters (2) 82  Time  Space for finding top-3 paths of length 6 on a dataset with n = 2000, m = 9 and g = 0, Less than 22MB RAM for DFS 35MB for BFS.

Stable Clusters (3)  Running times for BFS based algorithm seeking top-5 full paths for different values of g as the number of temporal intervals is increased from 5 to 25. Number of nodes per temporal interval was fixed at n = 1000 and average out degree was set to d = 5. 83

Stable Clusters (4)  Running times for BFS based algorithm seeking top- 5 full paths for different values of d as the number of temporal intervals is increased from 5 to 25. Number of nodes per temporal interval was fixed at n = 1000 and gap size was set to g = 2. 84

Stable Clusters (5)  Running time for BFS seeking top-5 paths. m is the number of time steps. Average out degree set to 5, and max gap size set to 1. 85

Stable Clusters (6)  Running time for DFS as we increase the number for nodes in each time step and length of l Seeking top 5 path in a graph over 6 time steps 86

Stable Clusters (7) 87

Stable Clusters (8) 88

Stable Clusters (9) 89

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 90

Qualitative Results  Capturing clusters of keywords with strong pairwise correlations  Capturing the dynamic nature of stories in the blogosphere, and their evolution with time.  Handling topic drifts 91

Content  Introduction  Related Work  Cluster Generation  Stable Clusters Cluster Graph Breadth First Search Depth First Search Adapting the Threshold Algorithm Normalized Stable Clusters Online Version  Experiments Cluster Generation Stable Clusters Qualitative Results  Conclusions 92

Conclusions  Formalize the problem of discovering persistent chatter in the blogosphere Applicable to other temporal text sources  Identifying topics as keyword clusters  Discovering stable clusters Aggregate stability or normalized stability 3 algorithms, based on BFS, DFS, and TA  Experimental Evaluation 93