A network approach to topic models

Slides:



Advertisements
Similar presentations
Fig. 5 Correlation of RNA expression and protein abundance.
Advertisements

Fig. 4 Altitudinal distributions of ice thickness change (m year−1) for the 650 glaciers. Altitudinal distributions of ice thickness change (m year−1)
Fig. 2 Transport properties of a BP transistor at low temperature.
Fig. 2 Global production, use, and fate of polymer resins, synthetic fibers, and additives (1950 to 2015; in million metric tons). Global production, use,
Fig. 2 CFD results. CFD results. Results of CFD simulations in horizontal (left column) and vertical (right column) cross-sections. All models oriented.
Fig. 2 Box plots of water use with lateral lengths.
Vibrational spectra of medieval human bones (Leopoli-Cencelle, Italy)
Fig. 3 Performance of HoloConvNet.
Fig. 5 Thermal conductivity of n-type ZrCoBi-based half-Heuslers.
Fig. 2 Meta-analysis: Country-specific effects on vote choice.
Fig. 1 Map of water stress and shale plays.
Fig. 1 Examples of experimental stimuli and behavioral performance.
Fig. 3 Saturation velocity of BP FETs.
Fig. 1 NP-free Ch-CNC droplets.
Fig. 3 Electron PSD in various regions.
Fig. 4 Resynthesized complex boronic acid derivatives based on different scaffolds on a millimole scale and corresponding yields. Resynthesized complex.
Fig. 6 Comparison of properties of water models.
Fig. 1 Mean and median RCR (Relative Citation Ratio) of Roadmap Epigenomics Program research articles for each year. Mean and median RCR (Relative Citation.
Fig. 2 Reference-fixing experiment, results.
Fig. 3 Scan rate effects on the layer edge current.
Fig. 6 Comparison of first- and second-generation predictions with HiTp experimental results for Co-Ti-Zr (first row), Co-Fe-Zr (second row), and Fe-Ti-Nb.
Fig. 3 Rotation experiment, setup.
Fig. 1 Product lifetime distributions for the eight industrial use sectors plotted as log-normal probability distribution functions (PDF). Product lifetime.
Fig. 1 Concept of the livestock transition in China between 1980 and Concept of the livestock transition in China between 1980 and The left-
Fig. 1 Distribution of total and fake news shares.
Fig. 3 Photon number statistics resulting from Fock state |l, S − l〉 interference. Photon number statistics resulting from Fock state |l, S − l〉 interference.
Fig. 2 2D QWs of different propagation lengths.
Electronic structure of the oligomer (n = 8) at the UB3LYP/6-31G
Fig. 1 Schematic illustration and atomic-scale rendering of a silica AFM tip sliding up and down a single-layer graphene step edge on an atomically flat.
Fig. 1 Histograms of the number of first messages received by men and women in each of our four cities. Histograms of the number of first messages received.
Fig. 5 Schematic phase diagrams of Ising spin systems and Mott transition systems. Schematic phase diagrams of Ising spin systems and Mott transition systems.
Fig. 1 Average contribution (million metric tons) of seafood-producing sectors, 2009–2014. Average contribution (million metric tons) of seafood-producing.
Fig. 4 The mechanical performances of thermally stable click-ionogels.
Fig. 2 Results of the learning and testing phases.
Fig. 5 The different fractions of the rates of energy flow in the oscillating thermal circuit for the experiment with L = 58.5 H shown in Fig. 4A. The.
Fig. 2 Comparison of laser-chemical tailored Ni(TPA/TEG) and Ni(TPA) composites. Comparison of laser-chemical tailored Ni(TPA/TEG) and Ni(TPA) composites.
Fig. 2 Folding motions of the TCO with strain-softening behavior.
Fig. 2 Desirability, quantified using the measures defined here, as a function of demographic variables of the user population. Desirability, quantified.
Fig. 4 Projected [CO2]-induced deficits in protein and minerals (Fe and Zn) and cumulative changes in vitamin B and cumulative changes in vitamin E derived.
Fig. 1 Empirical probability density functions of the estimated climatic drivers. Empirical probability density functions of the estimated climatic drivers.
Fig. 2 Top 10 countries, ecoregions, conservation hotspots, and KBAs with the largest area of restoration hotspots. Top 10 countries, ecoregions, conservation.
Fig. 4 Relationships between light and economic parameters.
Fig. 5 Comparison of the liquid products generated from photocatalytic CO2 reduction reactions (CO2RR) and CO reduction reactions (CORR) on two catalysts.
Schematic of the proposed brain-controlled assistive hearing device
Fig. 2 Behavioral modulation of brain change.
Fig. 1 Location of the Jirzankal Cemetery.
Fig. 3 Inference of African yam domestication history.
Fig. 4 CO2 emission changes triggered by the JJJ clean air policy.
Multiplexed four- and eight-channel devices for rapid processing
Fig. 4 The relationship between the total mean absolute momentum disturbance 〈∣p∣〉zB (in units of ℏ/D) and fringe visibility V. The relationship between.
Fig. 3 Comparisons of NDVI trends over the globally vegetated areas from 1982 to Comparisons of NDVI trends over the globally vegetated areas from.
Experimenter gender and replicability in science
Fig. 1 Effects of experimental warming on nematode communities across the gradient of plant species richness. Effects of experimental warming on nematode.
Fig. 1 Schematic depiction of a paradigm for rapid and guided discovery of materials through iterative combination of ML with HiTp experimentation. Schematic.
Fig. 4 Spatial mapping of the distribution and intensity of industrial fishing catch. Spatial mapping of the distribution and intensity of industrial fishing.
Fig. 6 MD simulations of assembled binary supraballs.
Fig. 3 Results of the proposed algorithm applied to a detail from the Adam panel. Results of the proposed algorithm applied to a detail from the Adam panel.
Fig. 1 Quantifying dissonance.
Fig. 3 Performance of the generative model G, with and without stack-augmented memory. Performance of the generative model G, with and without stack-augmented.
Fig. 2 Growth kinetics of borophene on silver.
Fig. 3 High-tide flood extent at water levels of 1. 73, 2. 03, 2
Fig. 4 Behavior of resistance peak near density nm = 5.
Fig. 2 Comparison between the different reflective metasurface proposals when θi = 0° and θr = 70°. Comparison between the different reflective metasurface.
Fig. 4 Scaling laws distinguish biochemical networks from random networks across levels of organization. Scaling laws distinguish biochemical networks.
Fig. 4 Analysis of charge storage mechanism in ZnxCo1−xO NRs.
Fig. 2 Spatial distribution of five city groups.
Fig. 2 Speaker-independent speech separation with ODAN.
Fig. 3 Calculated electronic structure of ZrCoBi.
Fig. 5 Enhancers and super-enhancers identified in various brain cell populations. Enhancers and super-enhancers identified in various brain cell populations.
Fig. 2 Longitudinal (local) resistance Rxx and nonlocal resistance Rnl with different terminal configurations on the six-terminal device measured without.
Presentation transcript:

A network approach to topic models by Martin Gerlach, Tiago P. Peixoto, and Eduardo G. Altmann Science Volume 4(7):eaaq1360 July 18, 2018 Copyright © 2018 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).

Fig. 1 Two approaches to extract information from collections of texts. Two approaches to extract information from collections of texts. Topic models represent the texts as a document-word matrix (how often each word appears in each document), which is then written as a product of two matrices of smaller dimensions with the help of the latent variable topic. The approach we propose here represents texts as a network and infers communities in this network. The nodes consists of documents and words, and the strength of the edge between them is given by the number of occurrences of the word in the document, yielding a bipartite multigraph that is equivalent to the word-document matrix used in topic models. Martin Gerlach et al. Sci Adv 2018;4:eaaq1360 Copyright © 2018 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).

Fig. 2 Parallelism between topic models and community detection methods. Parallelism between topic models and community detection methods. The pLSI and SBMs are mathematically equivalent, and therefore, methods from community detection (for example, the hSBM we propose in this study) can be used as alternatives to traditional topic models (for example, LDA). Martin Gerlach et al. Sci Adv 2018;4:eaaq1360 Copyright © 2018 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).

Fig. 3 LDA is unable to infer non-Dirichlet topic mixtures. LDA is unable to infer non-Dirichlet topic mixtures. Visualization of the distribution of topic mixtures logP(θd) for different synthetic and real data sets in the two-simplex using K = 3 topics. We show the true distribution in the case of the synthetic data (top) and the distributions inferred by LDA (middle) and SBM (bottom). (A) Synthetic data sets with Dirichlet mixtures from the generative process of LDA with document hyperparameters αd = 0.01 × (1/3, 1/3, 1/3) (left) and αd = 100 × (1/3, 1/3, 1/3) (right) leading to different true mixture distributions logP(θd). We fix the word hyperparameter βrw = 0.01, D = 1000 documents, V = 100 different words, and text length kd = 1000. (B) Synthetic data sets with non-Dirichlet mixtures from a combination of two Dirichlet mixtures, respectively: αd ϵ {100 × (1/3, 1/3, 1/3), 100 × (0.1, 0.8, 0.1)} (left) and αd ϵ {100 × (0.1, 0.2, 0.7), 100 × (0.1, 0.7, 0.2)} (right). (C) Real data sets with unknown topic mixtures: Reuters (left) and Web of Science (right) each containing D = 1000 documents. For LDA, we use hyperparameter optimization. For SBM, we use an overlapping, non-nested parametrization in which each document belongs to its own group such that B = D + K, allowing for an unambiguous interpretation of the group membership as topic mixtures in the framework of topic models. Martin Gerlach et al. Sci Adv 2018;4:eaaq1360 Copyright © 2018 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).

Fig. 4 Comparison between LDA and SBM for artificial corpora drawn from LDA. Comparison between LDA and SBM for artificial corpora drawn from LDA. Description length Σ of LDA and hSBM for an artificial corpus drawn from the generative process of LDA with K = 10 topics. (A) Difference in Σ, ΔΣ = Σi − ΣLDA−trueprior, compared to the LDA with true priors—the model that generated the data—as a function of the text length kd = m and D = 106 documents. (B) Normalized Σ (per word) as a function of the number of documents D for fixed text length kd = m = 128. The four curves correspond to different choices in the parametrization of the topic models: (i) LDA with noninformative (noninf) priors (light blue, ×), (ii) LDA with true priors, that is, the hyperparameters used to generate the artificial corpus (dark blue, •), (iii) hSBM with without clustering of documents (light orange, ▲), and (iv) hSBM with clustering of documents (dark orange, ▼). Martin Gerlach et al. Sci Adv 2018;4:eaaq1360 Copyright © 2018 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).

Fig. 5 Inference of hSBM to articles from the Wikipedia. Inference of hSBM to articles from the Wikipedia. Articles from three categories (chemical physics, experimental physics, and computational biology). The first hierarchical level reflects bipartite nature of the network with document nodes (left) and word nodes (right). The grouping on the second hierarchical level is indicated by solid lines. We show examples for nodes that belong to each group on the third hierarchical level (indicated by dotted lines): For word nodes, we show the five most frequent words; for document nodes, we show three (or fewer) randomly selected articles. For each word, we calculate the dissemination coefficient UD, which quantifies how unevenly words are distributed among documents (60): UD = 1 indicates the expected dissemination from a random null model; the smaller UD (0 < UD < 1), the more unevenly a word is distributed. We show the 5th, 25th, 50th, 75th, and 95th percentile for each group of word nodes on the third level of the hierarchy. Intl. Soc. for Comp. Biol., International Society for Computational Biology; RRKM theory, Rice-Ramsperger-Kassel-Marcus theory. Martin Gerlach et al. Sci Adv 2018;4:eaaq1360 Copyright © 2018 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).