Download presentation
Published byAnastasia Thomas Modified over 9 years ago
1
Tutorial : Using Stanford Topic Modeling Toolbox Lili Lin
Method Seminar Tutorial : Using Stanford Topic Modeling Toolbox Lili Lin
2
Contents Introduction Getting Started Toolbox Running Prerequisites
Installation Toolbox Running Latent Dirichlet Allocation Model (LDA Model) Labeled LDA Model
3
Contents Introduction Getting Started Toolbox Running Prerequisites
Installation Toolbox Running Latent Dirichlet Allocation Model (LDA Model) Labeled LDA Model
4
Introduction http://nlp.stanford.edu/software/tmt/tmt- 0.4/
The Stanford Topic Modeling Toolbox was written at the Stanford NLP group by: Daniel Ramage and Evan Rosen, first released in September 2009 Topic models (e.g. LDA, Labeled LDA) training and inference to create summaries of the text
5
Introduction - LDA Model
LDA model is a unsupervised topic model User need to define some important parameters, such as number of topics It is hard to choose the number of topics Even with some top terms for each topic, it is still difficult to interpret the content of the extracted topics
6
Introduction – Labeled LDA Model
Labeled LDA is a supervised topic model for credit attribution in multi-labeled corpora. If one of the columns in your input text file contains labels or tags that apply to the document, you can use Labeled LDA to discover which parts of each document go with each label, and to learn accurate models of the words best associated with each label globally
7
Contents Introduction Getting Started Toolbox Running Prerequisites
Installation Simple Testing Toolbox Running LDA Model Labeled LDA Model
8
Prerequisites A text editor (e.g. TextWrangler) for creating TMT processing scripts. TMT scripts are written in Scala, but no knowledge of Scala is required to get started. An installation of Java 6SE or greater: Windows, Mac, and Linux are supported.
9
Installation Download the TMT executable (tmt jar) from / Double-click the jar file to open toolbox or run the toolbox with the command line : java -jar tmt jar You should see a simple GUI
10
Simple Testing Example data and scripts for simple testing
Download the example data file: pubmed-oa- subset.csv Download the first testing script: example-0- test.scala Note: the data file and the script should be put into the same folder
11
Simple Testing - GUI Load script: File Open script
12
Simple Testing - GUI Edit script:
val pubmed = CSVFile("pubmed-oa-subset.csv”)
13
Simple Testing - GUI Run the script: click the button ‘Run’
14
Simple Testing - Command Line
15
Contents Introduction Getting Started Toolbox Running Prerequisites
Installation Toolbox Running Latent Dirichlet Allocation Model (LDA Model) Labeled LDA Model
16
LDA Model – Data Preparation
173, 777 Astronomy papers were collected from the Web of Science (WOS) covering the period from 1992 to 2012 In the file ‘astro_wos_lda.csv’, every record includes paper ID (the first column), title (the second column) and published year (the third column)
17
LDA Training – Script Loading
File Open script Navigate to example-2-lda-learn.scala Open
18
LDA Training – Data Loading
Edit Script : ‘val source = CSVFile("astro_wos_lda.csv”)’ ‘Column(2) ~>’ Note: if your text cover 2 columns or more than 2 columns, such as the third and forth columns, you can use ‘Columns(3,4) ~> Join(" ") ~>’ to replace ’Column(2) ~>’
19
LDA Training – Parameter Selection
Edit Script : val params = LDAModelParams(numTopics = 30, dataset = dataset, topicSmoothing = 0.01, termSmoothing = 0.01)
20
LDA Training – Model Training
Run : Out of Memory due to the big data
21
LDA Training – Model Training
Change the size of Memory Run
22
LDA Training – Output Generation
lda-b2aa edefe description.txt : A description of the model saved in this folder document-topic-distributions.csv : A csv file containing the per- document topic distribution for each document in the training dataset : Snapshots of the model during training
23
LDA Training – Output Generation
/params.txt : Model parameters used during training /tokenizer.txt : Tokenizer used to tokenize text for use with this model /summary.txt : Human readable summary of the topic model, with top-20 terms per topic and how many words instances of each have occurred /log-probability estimate.txt : Estimate of the log probability of the dataset at this iteration /term-index.txt : Mapping from terms in the corpus to ID numbers /description.txt : A description of the model saved in this iteration /topic-termdistributions.csv.gz : For each topic, the probability of each term in that topic
24
LDA Training – Command Line
Java –Xmx4G –jar tmt jar example- 2-lda-learn.scala
25
LDA Inference – Script Loading
File Open script Navigate to example-3-lda-infer Open
26
LDA Inference – Trained Model Loading
Edit Script: val modelPath = file("lda-b2aa edefe”)
27
LDA Inference – Data Loading
Edit Script: ‘val source = CSVFile("astro_wos_lda.csv”)’ ‘Column(2) ~>’ Note: Here we just use the same dataset as the inference data, but actually it should be some new dataset
28
LDA Inference – Model Inference
Change the size of Memory Run
29
LDA Inference – Output Generation
Navigate to the folder ’lda-b2aa edefe’ astro_wos_lda-document-topic-distributuions.csv : A csv file containing the per-document topic distribution for each document in the inference dataset astro_wos_lda-top-terms.csv: A csv file containing the top terms in the inference dataset for each topic astro_wos_lda-usage.csv
30
LDA Inference – Command Line
Java –Xmx4G –jar tmt jar example- 3-lda-infer.scala
31
LLDA Model – Data Preparation
4,770 metformin papers were collected from pubMed covering the period from 1997 to 2011 Training data : metformin_train_data_llda.csv (2798 papers), every record includes paper ID (the first column), bio-term list (the second column), title (the third column) and abstract (the forth column), the number of bio-terms in very record is at least 3 Inference data: metformin_infer_data_llda.csv (4770 papers), every record includes paper ID (the first column), title (the second column) and abstract (the third column)
32
LLDA Training – Script Loading
File Open script Navigate to example-6-llda-learn.scala Open
33
LLDA Training – Data Loading
Edit Script : ‘val source = CSVFile("metformin_train_data_llda.csv")’ ‘Columns(3,4) ~> Join(" ") ~>’ ’Column(2) ~>’
34
LLDA Training – Model Training
Run
35
LLDA Training – Output Generation
llda-cvb0-bd54e9b c7f4-222a08a4 description.txt : A description of the model saved in this folder document-topic-distributions.csv : A csv file containing the per- document topic distribution for each document in the training dataset : Snapshots of the model during training
36
LLDA Training – Output Generation
/params.txt : Model parameters used during training /tokenizer.txt : Tokenizer used to tokenize text for use with this model /summary.txt : Human readable summary of the topic model, with top-20 terms per topic and how many words instances of each have occurred /term-index.txt : Mapping from terms in the corpus to ID numbers /description.txt : A description of the model saved in this iteration /label-index.txt : Topics extracted after LLDA training /topic-termdistributions.csv.gz : For each topic, the probability of each term in that topic
37
LLDA Training – Command Line
Java –Xmx4G –jar tmt jar example- 6-llda-learn.scala
38
LLDA Inference – Jar Script
The TMT toolbox doesn’t provide script for LLDA inference A java script, packaged into ‘llda-infer.jar’, was generated in order to conduct LLDA inference
39
LLDA Inference – Command Line
java -jar llda-infer.jar metformin_infer_data_llda.csv llda-cvb0- bd54e9b c7f4-222a08a4 metformin_infer_result.csv
40
LLDA Inference – Output Generation
A file named metformin_infer_result.csv will be generated after LLDA Inference
41
Thanks….. Any Question?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.