Lecture 16: Unsupervised Learning from Text Padhraic Smyth Department of Computer Science University of California, Irvine.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Information retrieval – LSI, pLSI and LDA
Cross-Corpus Analysis with Topic Models Padhraic Smyth, Mark Steyvers, Dave Newman, Chaitanya Chemudugunta University of California, Irvine New York Times.
Title: The Author-Topic Model for Authors and Documents
Supervised Learning Recap
Probabilistic Clustering-Projection Model for Discrete Data
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Generative Topic Models for Community Analysis
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Principal Component Analysis
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Padhraic.
Author-Topic Models for Large Text Corpora Padhraic Smyth Department of Computer Science University of California, Irvine In collaboration with: Mark Steyvers.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Latent Dirichlet Allocation a generative model for text
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
British Museum Library, London Picture Courtesy: flickr.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)
Scalable Text Mining with Sparse Generative Models
Data Mining – Intro.
Introduction to machine learning
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Data Mining Techniques
CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.
Lecture #13: Gibbs Sampling for LDA
Anomaly detection with Bayesian networks Website: John Sandiford.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
27. May Topic Models Nam Khanh Tran L3S Research Center.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Statistical Modeling of Large Text Collections Padhraic Smyth Department of Computer Science University of California, Irvine MURI Project Kick-off Meeting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
CS Statistical Machine learning Lecture 24
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Latent Dirichlet Allocation
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Data Mining and Decision Support
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hidden Markov Models BMI/CS 576
Online Multiscale Dynamic Topic Models
School of Computer Science & Engineering
Multimodal Learning with Deep Boltzmann Machines
Latent Dirichlet Analysis
Hidden Markov Models Part 2: Algorithms
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

Lecture 16: Unsupervised Learning from Text Padhraic Smyth Department of Computer Science University of California, Irvine

Outline General aspects of text mining Named-entity extraction, question-answering systems, etc Unsupervised learning from text documents Motivation Topic model and learning algorithm Results Extensions Author-topic models Applications Demo of topic browser Future directions

Different Aspects of Text Mining Named-entity extraction: Parsers to recognize names of people, places, genes, etc E.g., GATE system Question-answering systems News summarization Google news, Newsblaster ( Document clustering Standard algorithms: k-means, hierarchical Probabilistic approaches Topic modeling Representing document as mixtures of topics And many more…

Named Entity-Extraction Often a combination of Knowledge-based approach (rules, parsers) Machine learning (e.g., hidden Markov model) Dictionary Non-trivial since entity-names can be confused with real names E.g., gene name ABS and abbreviation ABS Also can look for co-references E.g., “IBM today…… Later, the company announced…..” Very useful as a preprocessing step for data mining, e.g., use entity-names to train a classifier to predict the category of an article

Example: GATE/ANNIE extractor GATE: free software infrastructure for text analysis (University of Sheffield, UK) ANNIE: widely used entity-recognizer, part of GATE

Question-Answering Systems See additional slides on Dumais et al AskMSR system

Unsupervised Learning from Text Large collections of unlabeled documents.. Web Digital libraries archives, etc Often wish to organize/summarize/index/tag these documents automatically We will look at probabilistic techniques for clustering and topic extraction from sets of documents

Outline Background on statistical text modeling Unsupervised learning from text documents Motivation Topic model and learning algorithm Results Extensions Author-topic models Applications Demo of topic browser Future directions

Pennsylvania Gazette ,000 articles 25 million words

Enron data 250,000 s 28,000 authors

Other Examples of Data Sets CiteSeer digital collection: 700,000 papers, 700,000 authors, MEDLINE collection 16 million abstracts in medicine/biology US Patent collection and many more....

Problems of Interest What topics do these documents “span”? Which documents are about a particular topic? How have topics changed over time? What does author X write about? Who is likely to write about topic Y? Who wrote this specific document? and so on…..

Probability Models for Documents Example: 50,000 possible words in our vocabulary Simple memoryless model, aka "bag of words" 50,000-sided die each side of the die represents 1 word a non-uniform die: each side/word has its own probability to generate N words we toss the die N times gives a "bag of words" (no sequence information) This is a simple probability model: p( document |  ) =  p(word i |  ) to "learn" the model we just count frequencies p(word i) = number of occurences of i / total number

The Multinomial Model Example: tossing a 6-sided die P = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] Multinomial model for documents: V-sided “die” = probability distribution over possible words Some words have higher probability than others Document with K words generated by N memoryless “draws” Typically interested in conditional multinomials, e.g., p(words | spam) versus p(words | non-spam)

Real examples of Word Multinomials

Parameters Real World Data P( Data | Parameters) P( Parameters | Data) Probabilistic Model Statistical Inference

A Graphical Model w1w1w1w1  p( doc |  ) =  p(w i |  ) p( w |  ) w2w2w2w2 wnwnwnwn   = "parameter vector" = set of probabilities one per word

Another view....  p( doc |  ) =  p(w i |  ) wiwiwiwi i=1:n This is “plate notation” Items inside the plate are conditionally independent given the variable outside the plate There are “n” conditionally independent replicates represented by the plate

Being Bayesian....  wiwiwiwi i=1:n  This is a prior on our multinomial parameters, e.g., a simple Dirichlet smoothing prior with e.g., a simple Dirichlet smoothing prior with symmetric parameter  to avoid symmetric parameter  to avoid estimates of probabilities that are 0 estimates of probabilities that are 0

Being Bayesian....  Learning: infer p(  | words,  ) proportional to p( words |  ) p(  ) wiwiwiwi i=1:n 

Multiple Documents  wiwiwiwi 1:n 1:n  1:D 1:D p( corpus |  ) =  p( doc |  )

Different Document Types  wiwiwiwi 1:n 1:n  p( w |  ) is a multinomial over words

Different Document Types  wiwiwiwi 1:n 1:n  1:D 1:D p( w |  ) is a multinomial over words

Different Document Types  wiwiwiwi 1:n 1:n  1:D 1:D zdzdzdzd p( w | , z d ) is a multinomial over words z d is the "label" for each doc

Different Document Types  wiwiwiwi 1:n 1:n  1:D 1:D zdzdzdzd p( w | , z d ) is a multinomial over words z d is the "label" for each doc Different multinomials, depending on the value of z d (discrete)  now represents |z| different multinomials

Unknown Document Types  wiwiwiwi 1:n 1:n  1:D 1:D zdzdzdzd  Now the values of z for each document are unknown - hopeless?

Unknown Document Types  wiwiwiwi 1:n 1:n  Now the values of z for each document are unknown - hopeless? 1:D 1:D zdzdzdzd  Not hopeless :) Can learn about both z and  e.g., EM algorithm This gives probabilistic clustering p(w | z=k,  ) is the kth multinomial over words

Topic Model  wiwiwiwi 1:n 1:n  z i is a "label" for each word p( w | , z i = k) = multinomial over words = a "topic" p( z i |  d  ) = distribution over topics that is document specific 1:D 1:D zizizizi dddd

Key Features of Topic Models Generative model for documents in form of bags of words Allows a document to be composed of multiple topics Much more powerful than 1 doc -> 1 cluster Completely unsupervised Topics learned directly from data Leverages strong dependencies at word level AND large data sets Learning algorithm Gibbs sampling is the method of choice Scalable Linear in number of word tokens Can be run on millions of documents

Document generation as a probabilistic process Each topic is a distribution over words Each document a mixture of topics Each word chosen from a single topic From parameters  (j) From parameters  (d)

Topics  MONEY 1 BANK 1 BANK 1 LOAN 1 BANK 1 MONEY 1 BANK 1 MONEY 1 BANK 1 LOAN 1 LOAN 1 BANK 1 MONEY Mixtures θ Documents and topic assignments RIVER 2 MONEY 1 BANK 2 STREAM 2 BANK 2 BANK 1 MONEY 1 RIVER 2 MONEY 1 BANK 2 LOAN 1 MONEY RIVER 2 BANK 2 STREAM 2 BANK 2 RIVER 2 BANK Example of generating words

Topics  ? ? MONEY ? BANK BANK ? LOAN ? BANK ? MONEY ? BANK ? MONEY ? BANK ? LOAN ? LOAN ? BANK ? MONEY ?.... Mixtures θ RIVER ? MONEY ? BANK ? STREAM ? BANK ? BANK ? MONEY ? RIVER ? MONEY ? BANK ? LOAN ? MONEY ?.... RIVER ? BANK ? STREAM ? BANK ? RIVER ? BANK ?.... Inference Documents and topic assignments ?

Bayesian Inference Three sets of latent variables topic mixtures θ word distributions  topic assignments z Integrate out θ and  and estimate topic assignments: Use Gibbs sampling for approximate inference Sum over terms

Gibbs Sampling Start with random assignments of words to topics Repeat M iterations Repeat for all words i Sample a new topic assignment for word i conditioned on all other topic assignments Each sample is simple: draw from a multinomial represented as a ratio of appropriate counts

16 Artificial Documents Can we recover the original topics and topic mixtures from this data? documents

Starting the Gibbs Sampling Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )

After 1 iteration

After 4 iterations

After 32 iterations ● ●

More Details on Learning Gibbs sampling for x and z Typically run several hundred Gibbs iterations 1 iteration = full pass through all words in all documents Estimating θ and  x and z sample -> point estimates non-informative Dirichlet priors for θ and  Computational Efficiency Learning is linear in the number of word tokens Memory requirements can be a limitation for large corpora Predictions on new documents can average over θ and  (from different samples, different runs)

History of topic models origins in statistics: latent class models in social science admixture models in statistical genetics applications in computer science Hoffman, SIGIR, 1999 Blei, Jordan, Ng, JMLR 2003 Griffiths and Steyvers, PNAS, 2004 more recent work author-topic models: Steyvers et al, Rosen-Zvi et al, 2004 Hierarchical topics: McCallum et al, 2006 Correlated topic models: Blei and Lafferty, 2005 Dirichlet process models: Teh, Jordan, et al large-scale web applications: Buntine et al, 2004, 2005 undirected models: Welling et al, 2004

Topic = probability distribution over words Important point: these distributions are learned in a completely automated “unsupervised” fashion from the data

Examples of Topics from CiteSeer

Four example topics from NIPS

Examples Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN _STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy

History of topic models Latent class models in statistics (late 60’s) “Aspect model”, Hoffman (1999) Original application to documents LDA Model: Blei, Ng, and Jordan (2001, 2003) Variational methods Topics Model: Griffiths and Steyvers (2003, 2004) Gibbs sampling approach (very efficient) More recent work on alternative (but similar) models, e.g., by Max Welling (ICS), Buntine, McCallum, and others

Comparing Topics and Other Approaches Clustering documents Computationally simpler… But a less accurate and less flexible model LSI/LSA/SVD Linear project of V-dim word vectors into lower dimensions Less interpretable Not generalizable E.g., authors or other side-information Not as accurate E.g., precision-recall: Hoffman, Blei et al, Buntine, etc Probabilistic models such as topic Models “next-generation” text modeling, after LSI provide a modular extensible framework

Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on- line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification.

Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on- line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov One Cluster

Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on- line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov [topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling [topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes Multiple Topics One Cluster

Examples of Topics learned from Proceedings of the National Academy of Sciences Griffiths and Steyvers, PNAS, 2004 FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN

PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST Examples of PNAS topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS

PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST Examples of PNAS topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS

What can Topic Models be used for? Queries Who writes on this topic? e.g., finding experts or reviewers in a particular area What topics does this person do research on? Comparing groups of authors or documents Discovering trends over time Detecting unusual papers and authors Interactive browsing of a digital library via topics Parsing documents (and parts of documents) by topic and more…..

What is this paper about? Empirical Bayes screening for multi-item associations Bill DuMouchel and Daryl Pregibon, ACM SIGKDD 2001 Most likely topics according to the model are… 1. data, mining, discovery, association, attribute.. 2. set, subset, maximal, minimal, complete,… 3. measurements, correlation, statistical, variation, 4. Bayesian, model, prior, data, mixture,…..

3 of 300 example topics (TASA)

Automated Tagging of Words (numbers & colors  topic assignments)

Experiments on Various Data Sets Corpora CiteSeer:160K abstracts, 85K authors NIPS:1.7K papers, 2K authors Enron:250K s, 28K authors (sender) Medline:300K abstracts,128K authors Removed stop words; no stemming Ignore word order, just use word counts Processing time: Nips: 2000 Gibbs iterations  8 hours CiteSeer: 2000 Gibbs iterations  4 days

Four example topics from CiteSeer (T=300)

More CiteSeer Topics

Temporal patterns in topics: hot and cold topics We have CiteSeer papers from For each year, calculate the fraction of words assigned to each topic -> a time-series for topics Hot topics become more prevalent Cold topics become less prevalent

Four example topics from NIPS (T=100)

NIPS: support vector topic

NIPS: neural network topic

Pennsylvania Gazette ,000 articles (courtesy of David Newman & Sharon Block, UC Irvine)

Pennsylvania Gazette Data courtesy of David Newman (CS Dept) and Sharon Block (History Dept)

Topic trends from New York Times TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE Tour-de-France COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING Quarterly Earnings ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING Anthrax 330,000 articles

Enron data 250,000 s 28,000 authors

Enron business topics

Enron: non-work topics…

Enron: public-interest topics...

PubMed-Query Topics

PubMed-Query: Topics by Country

Examples of Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN _STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy

Examples of Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN _STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy

TOPIC MIXTURE TOPIC WORD X TOPIC WORD X TOPIC WORD For each document, choose a mixture of topics For every word slot, sample a topic If x=0, sample a word from the topic If x=1, sample a word from the distribution based on previous word Collocation Topic Model

TOPIC MIXTURE TOPIC DOW X=1 JONES X=0 TOPIC RISES Example: “DOW JONES RISES” JONES is more likely explained as a word following DOW than as word sampled from topic Result: DOW_JONES recognized as collocation Collocation Topic Model

Using Topic Models for Information Retrieval

Stability of Topics Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) However, Majority of topics are stable over processing time Majority of topics can be aligned across runs Topics appear to represent genuine structure in data

Comparing NIPS topics from the same Markov chain KL distance topics at t 1 =1000 Re-ordered topics at t 2 =2000 BEST KL = 0.54 WORST KL = 4.78

Comparing NIPS topics from two different Markov chains KL distance topics from chain 1 Re-ordered topics from chain 2 BEST KL = 1.03

Outline Background on statistical text modeling Unsupervised learning from text documents Motivation Topic model and learning algorithm Results Extensions Author-topic models Applications Demo of topic browser Future directions

Approach The author-topic model extension of the topic model: linking authors and topics authors -> topics -> words learned from data completely unsupervised, no labels generative model Different questions or queries can be answered by appropriate probability calculus E.g., p(author | words in document) E.g., p(topic | author)

Graphical Model x z Author Topic

x z w Author Topic Word

x z w Author Topic Word n

x z w a Author Topic Word D n

x z w a Author Topic Word   Author-Topicdistributions Topic-Worddistributions D n

Generative Process Let’s assume authors A 1 and A 2 collaborate and produce a paper A 1 has multinomial topic distribution   A 2 has multinomial topic distribution   For each word in the paper: 1.Sample an author x (uniformly) from A 1, A 2 2.Sample a topic z from  X 3.Sample a word w from a multinomial topic distribution  z

Graphical Model x z w a Author Topic Word   Author-Topicdistributions Topic-Worddistributions D n

Learning Observed W = observed words, A = sets of known authors Unknown x, z : hidden variables Θ,  : unknown parameters Interested in: p( x, z | W, A) p( θ,  | W, A) But exact learning is not tractable

Step 1: Gibbs sampling of x and z x z w a Author Topic Word   D n Average over unknown parameters

Step 2: estimates of θ and  x z w a Author Topic Word   D n Condition on particular samples of x and z

Gibbs Sampling Need full conditional distributions for variables The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k

Authors and Topics (CiteSeer Data)

Some likely topics per author (CiteSeer) Author = Andrew McCallum, U Mass: Topic 1: classification, training, generalization, decision, data,… Topic 2: learning, machine, examples, reinforcement, inductive,….. Topic 3: retrieval, text, document, information, content,… Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

Finding unusual papers for an author Perplexity = exp [entropy (words | model) ] = measure of surprise for model on data = measure of surprise for model on data Can calculate perplexity of unseen documents, conditioned on the model for a particular author conditioned on the model for a particular author

Papers and Perplexities: M_Jordan Factorial Hidden Markov Models687 Learning from Incomplete Data702

Papers and Perplexities: M_Jordan Factorial Hidden Markov Models687 Learning from Incomplete Data702 MEDIAN PERPLEXITY2567

Papers and Perplexities: M_Jordan Factorial Hidden Markov Models687 Learning from Incomplete Data702 MEDIAN PERPLEXITY2567 Defining and Handling Transient Fields in Pjama An Orthogonally Persistent JAVA16021

Papers and Perplexities: T_Mitchell Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196

Papers and Perplexities: T_Mitchell Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196 MEDIAN PERPLEXITY2837

Papers and Perplexities: T_Mitchell Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196 MEDIAN PERPLEXITY2837 Text Classification from Labeled and Unlabeled Documents using EM 3802 A Method for Estimating Occupational Radiation Dose… 8814

Who wrote what? A method 1 is described which like the kernel 1 trick 1 in support 1 vector 1 machines 1 SVMs 1 lets us generalize distance 1 based 2 algorithms to operate in feature 1 spaces usually nonlinearly related to the input 1 space This is done by identifying a class of kernels 1 which can be represented as norm 1 based 2 distances 1 in Hilbert spaces It turns 1 out that common kernel 1 algorithms such as SVMs 1 and kernel 1 PCA 1 are actually really distance 1 based 2 algorithms and can be run 2 with that class of kernels 1 too As well as providing 1 a useful new insight 1 into how these algorithms work the present 2 work can form the basis 1 for conceiving new algorithms This paper presents 2 a comprehensive approach for model 2 based 2 diagnosis 2 which includes proposals for characterizing and computing 2 preferred 2 diagnoses 2 assuming that the system 2 description 2 is augmented with a system 2 structure 2 a directed 2 graph 2 explicating the interconnections between system 2 components 2 Specifically we first introduce the notion of a consequence 2 which is a syntactically 2 unconstrained propositional 2 sentence 2 that characterizes all consistency 2 based 2 diagnoses 2 and show 2 that standard 2 characterizations of diagnoses 2 such as minimal conflicts 1 correspond to syntactic 2 variations 1 on a consequence 2 Second we propose a new syntactic 2 variation on the consequence 2 known as negation 2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm 2 for computing consequences in NNF given a structured system 2 description We show that if the system 2 structure 2 does not contain cycles 2 then there is always a linear size 2 consequence 2 in NNF which can be computed in linear time 2 For arbitrary 1 system 2 structures 2 we show a precise connection between the complexity 2 of computing 2 consequences and the topology of the underlying system 2 structure 2 Finally we present 2 an algorithm 2 that enumerates 2 the preferred 2 diagnoses 2 characterized by a consequence 2 The algorithm 2 is shown 1 to take linear time 2 in the size 2 of the consequence 2 if the preference criterion 1 satisfies some general conditions Written by (1) Scholkopf_B Written by (2) Darwiche_A Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author

The Author-Topic Browser Querying on author Pazzani_M Querying on topic relevant to author Querying on document written by author

Comparing Predictive Power Train models on part of a new document and predict remaining words

Outline Background on statistical text modeling Unsupervised learning from text documents Motivation Topic model and learning algorithm Results Extensions Author-topic models Applications Demo of topic browser Future directions

Online Demonstration of Topic Browser for UCI and UCSD Faculty

Outline Background on statistical text modeling Unsupervised learning from text documents Motivation Topic model and learning algorithm Results Extensions Author-topic models Applications Demo of topic browser Information retrieval Future directions

New Directions Cross-corpus browsing: being able to search across multiple document sets, multiple topic models Search and topics Using topics to improve search engines Scaling up to massive document streams online learning from blogs, news sources: Google News on steroids Change Detection automatically detecting new topics as they emerge over time Development of topic-based browsers faculty browser for Calit2 domain-specific browsers for medical specialties etc

Summary Probabilistic modeling of text can build realistic probability models that link words, documents, topics, authors, etc given such a model... can answer many queries just by computing appropriate conditional probabilities in the model Topic models topic-models are very flexible probability models for text can be learned efficiently from large sets of documents extract semantically interpretable content provide a framework with many possible extensions numerous possible applications This is just the tip of the iceberg.. likely to be many new models and applications over next few years

Software for Topic Modeling psiexp.ss.uci.edu/research/programs_data/toolbox.htm Mark Steyver’s public-domain MATLAB toolbox for topic modeling on the Web

References on Topic Models Latent Dirichlet allocation David Blei, Andrew Y. Ng and Michael Jordan. Journal of Machine Learning Research, 3: , Finding scientific topics Griffiths, T., & Steyvers, M. (2004). Proceedings of the National Academy of Sciences, 101 (suppl. 1), Probabilistic author-topic models for information discovery M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths, in Proceedings of the ACM SIGKDD Conference on Data Mining and Knowledge Discovery, August Integrating topics and syntax. Griffiths, T.L., & Steyvers, M., Blei, D.M., & Tenenbaum, J.B. (in press, 2005). In: Advances in Neural Information Processing Systems, 17.