1 T-Scroll: Visualizing Trends in a Time-Series of Documents for Interactive User Exploration Yoshiharu Ishikawa and Mikine Hasegawa Nagoya University, Japan
2 Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
3 Background Time-series of documents Example: news articles delivered on the Internet, online academic journals Continually delivered everyday Problems A large number of documents: appropriate summarization is required Topics will change: topic detection/tracking and trend extraction are useful
4 Objectives Development and evaluation of T-Scroll (Trend/Topic-Scroll) User interface for visualizing the transition of topics extracted from a time-series documents System Features Constructed over a document clustering system that outputs new clustering results periodically Clusters are displayed along the time axis like a scroll Links are shown between related clusters to represent topic transition Some useful features for interactive exploratory analysis
5
6 Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
7 Visualization of a time-series of documents A few systems for visualization of trends in a time- series of documents ThemeRiver (Havre et al, IEEE Trans. VCG, 2002) [4] Visualizes topic streams like a river Focuses on providing visual impacts No features for analysis and browsing TimeMine (Swan and Allan, SIGIR ’ 00) [5] Extracts topics from a time-series of documents Displays timelines to represent topics on the screen
8 ThemeRiver Analysis of the articles related to Cuba (1960 – 1961)
9 TimeMine Swan & Allan (U. of Massachusetts)
10 Analysis of time-dependent clusters Mei & Zhai (KDD ’ 05) [6] Statistical approach for discovering major topics from a time-series of documents Probabilistic modeling MONIC (Spiliopoulou et al., KDD ’ 06) [7] Detects various types of patterns from cluster transitions Examples: splitting/merging of clusters, cluster size changes Based on the analysis of historical snapshots of clusters
11 Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
12 Novelty-based document clustering (1) Developed by our group (ECDL ’ 01 [8], WWW Journal 2007 [10] etc.) Clusters documents incrementally based on their similarity and novelty Features Similarity considers novelty Assign high weights to recent documents, low weights to old ones Document weights decay as time passes: Based on the concept of obsolescence (aging) Delete old documents whose weights are smaller than the threshold Incremental processing: low update cost
13 Novelty-based document clustering (2) time New President Sarkozy Yeltsin ’ s Death Other articles Blair to Resign “Yeltsin’s Death” and other documents are obsolete! Periodical clustering processes are performed on a time-series of documents
14 Document similarity (1) acquisition time of document of document d i 1 dw i TiTi t Current time (0 < < 1) : forgetting factor determines the forgetting speed The weight of a document exponentially decreases as time passes. Assumption: each delivered document gradually loses its value as time passes dw i : the weight of a document d i at time
15 Document similarity (2) Similarity score of documents d i and d j Based on novelty of documents and word occurrence patterns in the documents. Extension of the tf-idf method New documents have high impact on the clustering result Document clustering: k-means method
16 Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
17 T-Scroll: Idea Periodical clustering results are displayed like a scroll Links represents related cluster pairs
18
19 System functionalities (1) Cluster labels: selected based on the formula Pr(d i ) : document weight, tf ij : term frequency count Cluster sizes: ellipse size roughly corresponds to the number of documents Links: If the score is greater than the threshold, links are shown
20 System functionalities (2) Cluster quality: visualized using different colors for the cluster border lines red (good) purple (bad) High score can be achieved if (1) the cluster size is large, and (2) documents contained in the cluster are similar
21 System functionalities (3) Drill-down/roll-up: user can specify the interval of between two consecutive clustering interactively (e.g., one day, one week) Displaying keyword list: user can browse the keyword list for a specified cluster Access to original documents Keyword-based emphasis: clusters that contain a user-specified keyword are emphasized
22 Demo
23 System implementation T-Scroll module Written by Perl: generates an SVG file Browser displays the generated SVG file SVG file includes scripts (JavaScript) Used for interactive manipulation Clustering module Written by Ruby Novelty-based incremental document clustering
24 System architecture SVG Control Module T-Scroll Main Module SVG Output Module (JavaScript) SVG file (includes JavaScript) (Perl) ( Perl ) Plug-in Outputs T-Scroll Browser News articles InputOutput Clustering result Input Command inputs Cluster display Interactive manipulation User Clustering Module RSS Feed Module
25 Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
26 Evaluation 10 Users Data set Japanese news articles collected from news web sites from Sept to Feb articles per day Clustering was performed at six-hour intervals Evaluation criteria Overall impressions Evaluation of each function Obervability of topics Comparison with ThemeRiver
27 Overall impression User specifies scores between 0 to 5
28 Evaluation on each function
29 Observability of topics (1) Can users observe major topics in Nov. 2006? Five major topics are specified by ours: user gives scores how clearly he or she can observe the topic
30 Observability of topics (2) 10 users (different from former experiments) Users should reply observed topics and their scores with no information Topics 1 to 5 are major topics used in the previous experiments Topic 2 (big hurricane) was regarded as a normal weather topic
31 Comparison with ThemeRiver (1) ThemeRiver-like display figure was manually created for news articles in Dec 11 users (different from previous experiments) Questions to users Overall impressions Obserbability of topics
32
33 Comparison with ThemeRiver (2) Overall impression CategoryNo. of replies T-Scroll is better2 T-Scroll is slightly bettrer3 Almost same3 ThemeRiver is slightly better3 ThemeRiver is better0
34 Comparison with ThemeRiver (2) Can users observe five major topics that we selected? CategoryNo. of replies Good0 Possible3 No good4 Impossible4
35 Summary of experiments Overall impressions Good, but improvements required for usability Some users made comments on the response speed System functionalities Several features (quality info, article lists, etc.) are useful in practice Appropriate labels are necessary: should be improved Comparison with ThemeRiver ThemeRiver has visual impacts, but its display tends to be complicated for many topics
36 Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
37 Conclusions and future work Development and evaluation of T-Scroll system Based on novelty-based incremental clustering method Scroll-like display for showing changing trends Several features for interactive analysis Evaluation Overall impression Observability of topics Comparison with ThemeRiver Future work Sophisticated keyword (label) selection Improvement of interactive speed