Online Multiscale Dynamic Topic Models Best Research Paper Award Honorable Mention Online Multiscale Dynamic Topic Models Tomoharu Iwata Yasushi Sakurai Takeshi Yamada Naonori Ueda NTT Communication Science Laboratories Japan
Introduction Topic models for analyzing document dynamics Models Dynamic topic model [Blei+06] Topic over time [Wang+06] Dynamic mixture model [Wei+07] Topic tracking model [Iwata+09] Data scientific papers news articles blog e-mail
Multiscale dynamics Topics naturally evolve with multiple timescales Example: Politics topic in news articles Many years constitution, congress, president Tens of years names of members in Congress A few days names of bills under discussion Long timescale Middle timescale Short timescale
Proposed model Multiscale Dynamic Topic Model (MDTM) Topic model for analyzing dynamics with multiple timescales Robust Information loss is reduced by considering short and long timescale dynamics Efficient online inference The model is updated using only newly obtained data Past data need not to be stored
Standard topic model Graphical model Latent Dirichlet Allocation (LDA) [Blei+03] Basis of the proposed model A document is modeled as a mixture of topics Word distribution is generated from a symmetric Dirichlet No dynamics Dirichlet topic proportions Multinomial topic #docs word Multinomial word distribution #words ○:latent variable ●:observed variable □:repetition Dirichlet Graphical model #topics
Multiscale word distribution word distribution at scale s (from t-2 to t-1) s-1 generated depending on weighted sum of multiscale distributions long-scale word distribution at t-1 word distribution at t short-scale word distribution at t-1
Generative process of MDMT Gamma (of documents at epoch t ) Dirichlet topic proportions prior Word distribution is generated depended on the weighted sum of multiscale distributions Topic proportions’ prior is generated depended on the previous value Multinomial Multinomial weight Dirichlet multiscale word dist. * ξ t-1: hyper-parameter #scales
Online inference Update the model at each epoch using Stochastic EM the newly obtained data the previous model Stochastic EM [E-step] collapsed Gibbs sampling of latent topics [M-step] maximum joint likelihood of parameters model model data data t t+1
Estimation of multiscale distribution Maximum likelihood estimate word probability of scale s word count of scale s word count of scale s word count at epoch t’ word count at epoch t t-2 +1 s-1 t-1 t
Estimation of multiscale distribution Maximum likelihood estimate Online update Required memory word probability of scale s word count of scale s word count of scale s word count at epoch t’ word count at epoch t t-2 +1 s-1 t-1 t current word count first word count in the scale current value previous value
Approximated efficient estimation of multiscale word distribution Decrease update frequency for long-scale dist. Store only the previous epoch count
Approximated efficient estimation of multiscale word distribution Decrease update frequency for long-scale dist. Store only the previous epoch count Required memory count of s=3 t=4 t=5 t=6 t=7 t=8 scale=3 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 5 6 7 8 scale=2 3 4 3 4 5 6 5 6 7 8 scale=1 4 5 6 7 8 count at t=3 5 6 7 8 update [s=1] update [s=1,2] update [s=1] update [s=1,2,3] newly obtained count
Experiments Data sets NIPS: papers in NIPS from 1987 to 1999 PNAS: titles in PNAS from 1915 to 2005 Digg: blog posts in social news site Digg from 1/29 to 2/20 in 2000 Addresses: the State of the Union addresses from 1790 to 2002 Methods MDTM: Online Multiscale Dynamic Topic Model DTM: Online Dynamic Topic Model (MDTM with #scales=1) LDAall: LDA that uses all past data for inference LDAone: LDA that uses just the current data for inference LDAonline: LDA with online inference
Average perplexity MDTM (standard deviation) MDTM can appropriately model the dynamics through its use of multiscale properties DTM does not model the long-timescale dependencies LDAall and LDAonline do not model the dynamics LDAone ignores the past information
Perplexity with different #scales Digg Addresses #scales #scales Perplexities decreased as #scales increased indicates the importance of considering multiscale dynamics
Estimated weights for each scale Addresses weight (λ) Digg scale scale Weights decreased as the timescale lengthened recent short-scale distributions are more informative for estimating current distribution
Topic extraction 1 (NIPS) Speech recognition topic speech recognition word speaker training set tdnn time test speakers 1992 - 1999 system data letter state letters neural utterance words phoneme classification state hmm system probabilities model words context hmms markov probability 1992 - 1995 1996 - 1999 level phonetic segmentation language segment accuracy duration continuous unit male spectral feature false acoustic independent models normalization rate trained gradient log likelihood models sequence sequences hidden hybrid states frame transition hidden states models feature continuous modeling features adaption human acoustic 1992 - 1993 1994-1995 1996-1997 1998 - 1999
Topic extraction 2 (NIPS) Reinforcement learning topic learning state control action time policy reinforcement optimal actions recognition 1992 - 1999 dynamic space model exploration states programming barto sutton goal task function state algorithm model agent decision step reward markov space 1992 - 1995 1996 - 1999 robot based controller system forward level memory real jordan world skills policies singh adaptive iteration stochastic transition values expected based grid based memory controller continuous cost system temporal iteration interpolation rl machine policies environment iteration mdp singh finite update search 1992 - 1993 1994-1995 1996-1997 1998 - 1999
Conclusion Topic model with multiscale dynamics Efficient online inference Experimentally confirmed the high predictive performance Future work Automatic determination of length of scale, and #topics Evaluation on other data, such as web access log, blog, e-mail