Download presentation
Presentation is loading. Please wait.
1
Lecture 16: Unsupervised Learning from Text Padhraic Smyth Department of Computer Science University of California, Irvine
2
Outline General aspects of text mining Named-entity extraction, question-answering systems, etc Unsupervised learning from text documents Motivation Topic model and learning algorithm Results Extensions Author-topic models Applications Demo of topic browser Future directions
3
Different Aspects of Text Mining Named-entity extraction: Parsers to recognize names of people, places, genes, etc E.g., GATE system Question-answering systems News summarization Google news, Newsblaster (http://www1.cs.columbia.edu/nlp/newsblaster/) Document clustering Standard algorithms: k-means, hierarchical Probabilistic approaches Topic modeling Representing document as mixtures of topics And many more…
4
Named Entity-Extraction Often a combination of Knowledge-based approach (rules, parsers) Machine learning (e.g., hidden Markov model) Dictionary Non-trivial since entity-names can be confused with real names E.g., gene name ABS and abbreviation ABS Also can look for co-references E.g., “IBM today…… Later, the company announced…..” Very useful as a preprocessing step for data mining, e.g., use entity-names to train a classifier to predict the category of an article
5
Example: GATE/ANNIE extractor GATE: free software infrastructure for text analysis (University of Sheffield, UK) ANNIE: widely used entity-recognizer, part of GATE http://www.gate.ac.uk/annie/
7
Question-Answering Systems See additional slides on Dumais et al AskMSR system
8
Unsupervised Learning from Text Large collections of unlabeled documents.. Web Digital libraries Email archives, etc Often wish to organize/summarize/index/tag these documents automatically We will look at probabilistic techniques for clustering and topic extraction from sets of documents
9
Outline Background on statistical text modeling Unsupervised learning from text documents Motivation Topic model and learning algorithm Results Extensions Author-topic models Applications Demo of topic browser Future directions
10
Pennsylvania Gazette 1728-1800 80,000 articles 25 million words www.accessible.com
11
Enron email data 250,000 emails 28,000 authors 1999-2002
13
Other Examples of Data Sets CiteSeer digital collection: 700,000 papers, 700,000 authors, 1986-2005 MEDLINE collection 16 million abstracts in medicine/biology US Patent collection and many more....
14
Problems of Interest What topics do these documents “span”? Which documents are about a particular topic? How have topics changed over time? What does author X write about? Who is likely to write about topic Y? Who wrote this specific document? and so on…..
15
Probability Models for Documents Example: 50,000 possible words in our vocabulary Simple memoryless model, aka "bag of words" 50,000-sided die each side of the die represents 1 word a non-uniform die: each side/word has its own probability to generate N words we toss the die N times gives a "bag of words" (no sequence information) This is a simple probability model: p( document | ) = p(word i | ) to "learn" the model we just count frequencies p(word i) = number of occurences of i / total number
16
The Multinomial Model Example: tossing a 6-sided die P = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] Multinomial model for documents: V-sided “die” = probability distribution over possible words Some words have higher probability than others Document with K words generated by N memoryless “draws” Typically interested in conditional multinomials, e.g., p(words | spam) versus p(words | non-spam)
17
Real examples of Word Multinomials
18
Parameters Real World Data P( Data | Parameters) P( Parameters | Data) Probabilistic Model Statistical Inference
19
A Graphical Model w1w1w1w1 p( doc | ) = p(w i | ) p( w | ) w2w2w2w2 wnwnwnwn = "parameter vector" = set of probabilities one per word
20
Another view.... p( doc | ) = p(w i | ) wiwiwiwi i=1:n This is “plate notation” Items inside the plate are conditionally independent given the variable outside the plate There are “n” conditionally independent replicates represented by the plate
21
Being Bayesian.... wiwiwiwi i=1:n This is a prior on our multinomial parameters, e.g., a simple Dirichlet smoothing prior with e.g., a simple Dirichlet smoothing prior with symmetric parameter to avoid symmetric parameter to avoid estimates of probabilities that are 0 estimates of probabilities that are 0
22
Being Bayesian.... Learning: infer p( | words, ) proportional to p( words | ) p( ) wiwiwiwi i=1:n
23
Multiple Documents wiwiwiwi 1:n 1:n 1:D 1:D p( corpus | ) = p( doc | )
24
Different Document Types wiwiwiwi 1:n 1:n p( w | ) is a multinomial over words
25
Different Document Types wiwiwiwi 1:n 1:n 1:D 1:D p( w | ) is a multinomial over words
26
Different Document Types wiwiwiwi 1:n 1:n 1:D 1:D zdzdzdzd p( w | , z d ) is a multinomial over words z d is the "label" for each doc
27
Different Document Types wiwiwiwi 1:n 1:n 1:D 1:D zdzdzdzd p( w | , z d ) is a multinomial over words z d is the "label" for each doc Different multinomials, depending on the value of z d (discrete) now represents |z| different multinomials
28
Unknown Document Types wiwiwiwi 1:n 1:n 1:D 1:D zdzdzdzd Now the values of z for each document are unknown - hopeless?
29
Unknown Document Types wiwiwiwi 1:n 1:n Now the values of z for each document are unknown - hopeless? 1:D 1:D zdzdzdzd Not hopeless :) Can learn about both z and e.g., EM algorithm This gives probabilistic clustering p(w | z=k, ) is the kth multinomial over words
30
Topic Model wiwiwiwi 1:n 1:n z i is a "label" for each word p( w | , z i = k) = multinomial over words = a "topic" p( z i | d ) = distribution over topics that is document specific 1:D 1:D zizizizi dddd
31
Key Features of Topic Models Generative model for documents in form of bags of words Allows a document to be composed of multiple topics Much more powerful than 1 doc -> 1 cluster Completely unsupervised Topics learned directly from data Leverages strong dependencies at word level AND large data sets Learning algorithm Gibbs sampling is the method of choice Scalable Linear in number of word tokens Can be run on millions of documents
32
Document generation as a probabilistic process Each topic is a distribution over words Each document a mixture of topics Each word chosen from a single topic From parameters (j) From parameters (d)
33
Topics .4 1.0.6 1.0 MONEY 1 BANK 1 BANK 1 LOAN 1 BANK 1 MONEY 1 BANK 1 MONEY 1 BANK 1 LOAN 1 LOAN 1 BANK 1 MONEY 1.... Mixtures θ Documents and topic assignments RIVER 2 MONEY 1 BANK 2 STREAM 2 BANK 2 BANK 1 MONEY 1 RIVER 2 MONEY 1 BANK 2 LOAN 1 MONEY 1.... RIVER 2 BANK 2 STREAM 2 BANK 2 RIVER 2 BANK 2.... Example of generating words
34
Topics ? ? MONEY ? BANK BANK ? LOAN ? BANK ? MONEY ? BANK ? MONEY ? BANK ? LOAN ? LOAN ? BANK ? MONEY ?.... Mixtures θ RIVER ? MONEY ? BANK ? STREAM ? BANK ? BANK ? MONEY ? RIVER ? MONEY ? BANK ? LOAN ? MONEY ?.... RIVER ? BANK ? STREAM ? BANK ? RIVER ? BANK ?.... Inference Documents and topic assignments ?
35
Bayesian Inference Three sets of latent variables topic mixtures θ word distributions topic assignments z Integrate out θ and and estimate topic assignments: Use Gibbs sampling for approximate inference Sum over terms
36
Gibbs Sampling Start with random assignments of words to topics Repeat M iterations Repeat for all words i Sample a new topic assignment for word i conditioned on all other topic assignments Each sample is simple: draw from a multinomial represented as a ratio of appropriate counts
37
16 Artificial Documents Can we recover the original topics and topic mixtures from this data? documents
38
Starting the Gibbs Sampling Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )
39
After 1 iteration
40
After 4 iterations
41
After 32 iterations ● ●
42
More Details on Learning Gibbs sampling for x and z Typically run several hundred Gibbs iterations 1 iteration = full pass through all words in all documents Estimating θ and x and z sample -> point estimates non-informative Dirichlet priors for θ and Computational Efficiency Learning is linear in the number of word tokens Memory requirements can be a limitation for large corpora Predictions on new documents can average over θ and (from different samples, different runs)
43
History of topic models origins in statistics: latent class models in social science admixture models in statistical genetics applications in computer science Hoffman, SIGIR, 1999 Blei, Jordan, Ng, JMLR 2003 Griffiths and Steyvers, PNAS, 2004 more recent work author-topic models: Steyvers et al, Rosen-Zvi et al, 2004 Hierarchical topics: McCallum et al, 2006 Correlated topic models: Blei and Lafferty, 2005 Dirichlet process models: Teh, Jordan, et al large-scale web applications: Buntine et al, 2004, 2005 undirected models: Welling et al, 2004
44
Topic = probability distribution over words Important point: these distributions are learned in a completely automated “unsupervised” fashion from the data
45
Examples of Topics from CiteSeer
46
Four example topics from NIPS
47
Examples Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy
48
History of topic models Latent class models in statistics (late 60’s) “Aspect model”, Hoffman (1999) Original application to documents LDA Model: Blei, Ng, and Jordan (2001, 2003) Variational methods Topics Model: Griffiths and Steyvers (2003, 2004) Gibbs sampling approach (very efficient) More recent work on alternative (but similar) models, e.g., by Max Welling (ICS), Buntine, McCallum, and others
49
Comparing Topics and Other Approaches Clustering documents Computationally simpler… But a less accurate and less flexible model LSI/LSA/SVD Linear project of V-dim word vectors into lower dimensions Less interpretable Not generalizable E.g., authors or other side-information Not as accurate E.g., precision-recall: Hoffman, Blei et al, Buntine, etc Probabilistic models such as topic Models “next-generation” text modeling, after LSI provide a modular extensible framework
50
Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on- line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification.
51
Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on- line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov One Cluster
52
Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on- line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov [topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling [topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes Multiple Topics One Cluster
53
Examples of Topics learned from Proceedings of the National Academy of Sciences Griffiths and Steyvers, PNAS, 2004 FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN
54
PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST Examples of PNAS topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS
55
PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST Examples of PNAS topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS
56
What can Topic Models be used for? Queries Who writes on this topic? e.g., finding experts or reviewers in a particular area What topics does this person do research on? Comparing groups of authors or documents Discovering trends over time Detecting unusual papers and authors Interactive browsing of a digital library via topics Parsing documents (and parts of documents) by topic and more…..
57
What is this paper about? Empirical Bayes screening for multi-item associations Bill DuMouchel and Daryl Pregibon, ACM SIGKDD 2001 Most likely topics according to the model are… 1. data, mining, discovery, association, attribute.. 2. set, subset, maximal, minimal, complete,… 3. measurements, correlation, statistical, variation, 4. Bayesian, model, prior, data, mixture,…..
58
3 of 300 example topics (TASA)
59
Automated Tagging of Words (numbers & colors topic assignments)
60
Experiments on Various Data Sets Corpora CiteSeer:160K abstracts, 85K authors NIPS:1.7K papers, 2K authors Enron:250K emails, 28K authors (sender) Medline:300K abstracts,128K authors Removed stop words; no stemming Ignore word order, just use word counts Processing time: Nips: 2000 Gibbs iterations 8 hours CiteSeer: 2000 Gibbs iterations 4 days
61
Four example topics from CiteSeer (T=300)
62
More CiteSeer Topics
63
Temporal patterns in topics: hot and cold topics We have CiteSeer papers from 1986-2002 For each year, calculate the fraction of words assigned to each topic -> a time-series for topics Hot topics become more prevalent Cold topics become less prevalent
71
Four example topics from NIPS (T=100)
72
NIPS: support vector topic
73
NIPS: neural network topic
74
Pennsylvania Gazette 1728-1800 1728-1800 80,000 articles (courtesy of David Newman & Sharon Block, UC Irvine)
75
Pennsylvania Gazette Data courtesy of David Newman (CS Dept) and Sharon Block (History Dept)
76
Topic trends from New York Times TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE Tour-de-France COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING Quarterly Earnings ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING Anthrax 330,000 articles 2000-2002
77
Enron email data 250,000 emails 28,000 authors 1999-2002
78
Enron email: business topics
79
Enron: non-work topics…
80
Enron: public-interest topics...
82
PubMed-Query Topics
84
PubMed-Query: Topics by Country
86
Examples of Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy
87
Examples of Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy
88
TOPIC MIXTURE TOPIC WORD X TOPIC WORD X TOPIC WORD For each document, choose a mixture of topics For every word slot, sample a topic If x=0, sample a word from the topic If x=1, sample a word from the distribution based on previous word......... Collocation Topic Model
89
TOPIC MIXTURE TOPIC DOW X=1 JONES X=0 TOPIC RISES Example: “DOW JONES RISES” JONES is more likely explained as a word following DOW than as word sampled from topic Result: DOW_JONES recognized as collocation......... Collocation Topic Model
90
Using Topic Models for Information Retrieval
91
Stability of Topics Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) However, Majority of topics are stable over processing time Majority of topics can be aligned across runs Topics appear to represent genuine structure in data
92
Comparing NIPS topics from the same Markov chain KL distance topics at t 1 =1000 Re-ordered topics at t 2 =2000 BEST KL = 0.54 WORST KL = 4.78
93
Comparing NIPS topics from two different Markov chains KL distance topics from chain 1 Re-ordered topics from chain 2 BEST KL = 1.03
94
Outline Background on statistical text modeling Unsupervised learning from text documents Motivation Topic model and learning algorithm Results Extensions Author-topic models Applications Demo of topic browser Future directions
95
Approach The author-topic model extension of the topic model: linking authors and topics authors -> topics -> words learned from data completely unsupervised, no labels generative model Different questions or queries can be answered by appropriate probability calculus E.g., p(author | words in document) E.g., p(topic | author)
96
Graphical Model x z Author Topic
97
x z w Author Topic Word
98
x z w Author Topic Word n
99
x z w a Author Topic Word D n
100
x z w a Author Topic Word Author-Topicdistributions Topic-Worddistributions D n
101
Generative Process Let’s assume authors A 1 and A 2 collaborate and produce a paper A 1 has multinomial topic distribution A 2 has multinomial topic distribution For each word in the paper: 1.Sample an author x (uniformly) from A 1, A 2 2.Sample a topic z from X 3.Sample a word w from a multinomial topic distribution z
102
Graphical Model x z w a Author Topic Word Author-Topicdistributions Topic-Worddistributions D n
103
Learning Observed W = observed words, A = sets of known authors Unknown x, z : hidden variables Θ, : unknown parameters Interested in: p( x, z | W, A) p( θ, | W, A) But exact learning is not tractable
104
Step 1: Gibbs sampling of x and z x z w a Author Topic Word D n Average over unknown parameters
105
Step 2: estimates of θ and x z w a Author Topic Word D n Condition on particular samples of x and z
106
Gibbs Sampling Need full conditional distributions for variables The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k
107
Authors and Topics (CiteSeer Data)
108
Some likely topics per author (CiteSeer) Author = Andrew McCallum, U Mass: Topic 1: classification, training, generalization, decision, data,… Topic 2: learning, machine, examples, reinforcement, inductive,….. Topic 3: retrieval, text, document, information, content,… Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….
109
Finding unusual papers for an author Perplexity = exp [entropy (words | model) ] = measure of surprise for model on data = measure of surprise for model on data Can calculate perplexity of unseen documents, conditioned on the model for a particular author conditioned on the model for a particular author
110
Papers and Perplexities: M_Jordan Factorial Hidden Markov Models687 Learning from Incomplete Data702
111
Papers and Perplexities: M_Jordan Factorial Hidden Markov Models687 Learning from Incomplete Data702 MEDIAN PERPLEXITY2567
112
Papers and Perplexities: M_Jordan Factorial Hidden Markov Models687 Learning from Incomplete Data702 MEDIAN PERPLEXITY2567 Defining and Handling Transient Fields in Pjama 14555 An Orthogonally Persistent JAVA16021
113
Papers and Perplexities: T_Mitchell Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196
114
Papers and Perplexities: T_Mitchell Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196 MEDIAN PERPLEXITY2837
115
Papers and Perplexities: T_Mitchell Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196 MEDIAN PERPLEXITY2837 Text Classification from Labeled and Unlabeled Documents using EM 3802 A Method for Estimating Occupational Radiation Dose… 8814
116
Who wrote what? A method 1 is described which like the kernel 1 trick 1 in support 1 vector 1 machines 1 SVMs 1 lets us generalize distance 1 based 2 algorithms to operate in feature 1 spaces usually nonlinearly related to the input 1 space This is done by identifying a class of kernels 1 which can be represented as norm 1 based 2 distances 1 in Hilbert spaces It turns 1 out that common kernel 1 algorithms such as SVMs 1 and kernel 1 PCA 1 are actually really distance 1 based 2 algorithms and can be run 2 with that class of kernels 1 too As well as providing 1 a useful new insight 1 into how these algorithms work the present 2 work can form the basis 1 for conceiving new algorithms This paper presents 2 a comprehensive approach for model 2 based 2 diagnosis 2 which includes proposals for characterizing and computing 2 preferred 2 diagnoses 2 assuming that the system 2 description 2 is augmented with a system 2 structure 2 a directed 2 graph 2 explicating the interconnections between system 2 components 2 Specifically we first introduce the notion of a consequence 2 which is a syntactically 2 unconstrained propositional 2 sentence 2 that characterizes all consistency 2 based 2 diagnoses 2 and show 2 that standard 2 characterizations of diagnoses 2 such as minimal conflicts 1 correspond to syntactic 2 variations 1 on a consequence 2 Second we propose a new syntactic 2 variation on the consequence 2 known as negation 2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm 2 for computing consequences in NNF given a structured system 2 description We show that if the system 2 structure 2 does not contain cycles 2 then there is always a linear size 2 consequence 2 in NNF which can be computed in linear time 2 For arbitrary 1 system 2 structures 2 we show a precise connection between the complexity 2 of computing 2 consequences and the topology of the underlying system 2 structure 2 Finally we present 2 an algorithm 2 that enumerates 2 the preferred 2 diagnoses 2 characterized by a consequence 2 The algorithm 2 is shown 1 to take linear time 2 in the size 2 of the consequence 2 if the preference criterion 1 satisfies some general conditions Written by (1) Scholkopf_B Written by (2) Darwiche_A Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author
117
The Author-Topic Browser Querying on author Pazzani_M Querying on topic relevant to author Querying on document written by author
118
Comparing Predictive Power Train models on part of a new document and predict remaining words
119
Outline Background on statistical text modeling Unsupervised learning from text documents Motivation Topic model and learning algorithm Results Extensions Author-topic models Applications Demo of topic browser Future directions
120
Online Demonstration of Topic Browser for UCI and UCSD Faculty
128
Outline Background on statistical text modeling Unsupervised learning from text documents Motivation Topic model and learning algorithm Results Extensions Author-topic models Applications Demo of topic browser Information retrieval Future directions
129
New Directions Cross-corpus browsing: being able to search across multiple document sets, multiple topic models Search and topics Using topics to improve search engines Scaling up to massive document streams online learning from blogs, news sources:.......Google News on steroids Change Detection automatically detecting new topics as they emerge over time Development of topic-based browsers faculty browser for Calit2 domain-specific browsers for medical specialties etc
130
Summary Probabilistic modeling of text can build realistic probability models that link words, documents, topics, authors, etc given such a model... can answer many queries just by computing appropriate conditional probabilities in the model Topic models topic-models are very flexible probability models for text can be learned efficiently from large sets of documents extract semantically interpretable content provide a framework with many possible extensions numerous possible applications This is just the tip of the iceberg.. likely to be many new models and applications over next few years
131
Software for Topic Modeling psiexp.ss.uci.edu/research/programs_data/toolbox.htm Mark Steyver’s public-domain MATLAB toolbox for topic modeling on the Web
132
References on Topic Models Latent Dirichlet allocation David Blei, Andrew Y. Ng and Michael Jordan. Journal of Machine Learning Research, 3:993-1022, 2003. Finding scientific topics Griffiths, T., & Steyvers, M. (2004). Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235 Probabilistic author-topic models for information discovery M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths, in Proceedings of the ACM SIGKDD Conference on Data Mining and Knowledge Discovery, August 2004. Integrating topics and syntax. Griffiths, T.L., & Steyvers, M., Blei, D.M., & Tenenbaum, J.B. (in press, 2005). In: Advances in Neural Information Processing Systems, 17.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.