Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topic Modeling Nick Jordan.

Similar presentations


Presentation on theme: "Topic Modeling Nick Jordan."— Presentation transcript:

1 Topic Modeling Nick Jordan

2 Introduction Topic Models are statistical models that uncover the hidden structure of “topics” throughout a collection of documents A topic is essentially a group or cluster of words which appear frequently together in a group of documents. For example, “dog” and “bone” would appear more often in a document about dogs, “cat” and “meow” would appear more often in a document about cats, while “the”, “it”, and “is” would appear equally in both. These words which appear frequently across the entire corpus are removed as they are nondiscriminatory.

3 Topic Modeling in Bioinformatics
Data Collection of nanomedical research papers detailing various phases of nanomedicine clinical trials (FDA approved) Topics can be extracted from the research papers about FDA approved nanomedicines to be compared with topics from papers detailing non-FDA approved nanomedicines. Topics from each phases can be analyzed for differences: Phase I: Safety Phase II: Efficacy Phase III: Broader Efficacy/Safety/Side Effects

4 Method Data Processing Latent Dirichlet Allocation
Apache Tika converts .pdf into text format Stopwords/nondiscriminatory words removed from corpus Latent Dirichlet Allocation Generative statistical model Generates observable data values given hidden (latent) parameters Assumes each document is a mixture of a small number of topics, and each word in the document is randomly chosen given a distribution of words throughout a distribution of topics Expectation-Maximization (David Blei Method for LDA) Algorithm for finding maximum likelihood or (MAP) estimates of parameters in statistical models, depending on hidden variables

5 David M. Blei, Andrew Y. Ng, Michael I. Jordan
Per-topic word distributions Per-document topic distributions ID of topic for specific word in document Topic distribution for specific document David M. Blei, Andrew Y. Ng, Michael I. Jordan

6 Evaluation Evaluation Based on Topic Differences in Clinical Trial Phases: Words in topics from Phase I should be safety related Words in topics from Phase II should be efficacy related Words in topics from Phase III should be related to both safety and efficacy, along with side effects and other treatments Manual Evaluation/Proof of Concept Looking at clusters of words grouped together and manually determining similarity


Download ppt "Topic Modeling Nick Jordan."

Similar presentations


Ads by Google