Presentation is loading. Please wait.

Presentation is loading. Please wait.

Curators’ Meeting Oct. 27, 2003 Clustering MeSH Representations of Medical Literature Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and.

Similar presentations


Presentation on theme: "Curators’ Meeting Oct. 27, 2003 Clustering MeSH Representations of Medical Literature Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and."— Presentation transcript:

1 Curators’ Meeting Oct. 27, 2003 Clustering MeSH Representations of Medical Literature Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University

2 Curators' Meeting Oct. 27, 20032 Overview Text Mining Goals of the project Medical Subject Headings (MeSH) Methods Results Conclusions Future Work

3 Curators' Meeting Oct. 27, 20033 Text Mining Looking for patterns in natural language text Information retrieval  “Find documents similar to …” Text classification  “This paper is about …” Text clustering  “There’s a group of papers like … in our database.” Mining associations  “X is related to Y by …”

4 Curators' Meeting Oct. 27, 20034 Goals of the project Develop MeSH based representations of medical literature for literature mining  Clustering  Classification Develop tools for RGD Curation Team Develop techniques for mining with other ontologies  GO, PO, UMLS, etc.

5 Curators' Meeting Oct. 27, 20035 MeSH (Medical Subject Headings) Biomedical literature is indexed by MeSH terms by the NLM (National Library of Medicine) for the purpose of subject indexing and searching of journal articles. The major components of the MeSH are:  Descriptors (Main Headings)  Qualifiers (Subheadings) Descriptors provide an indication of the main themes discussed in the article. Qualifiers indicate the specific aspect of the descriptor being discussed in the article. MeSH terms are assigned by trained indexers.

6 Curators' Meeting Oct. 27, 20036 Why MeSH? Well established controlled vocabulary. Alleviates some natural language processing challenges. Provides insights into applications of other controlled vocabularies.

7 Curators' Meeting Oct. 27, 20037 MeSH Hierarchies Increasing specificity Can be present in more than one branch (avg. ~1.8 for descriptors) Similar hierarchy for qualifiers Virus Diseases DNA Virus DiseasesRNA Virus Diseases Picornaviridae infections Hepatitis D Common Cold Respiratory Tract Diseases Respiratory Tract Infections

8 Curators' Meeting Oct. 27, 20038 Terms are tuples of descriptors and qualifiers  (chemotherapy, adverse effects) Total number of descriptors – 21,975 Total number of qualifiers –83 MeSH Terms Descriptors Qualifiers therapeutics diet therapy drug therapy premedication chemotherapy therapeutic use pharmacology adverse effects contraindications poisoning

9 Curators' Meeting Oct. 27, 20039 Related Work “TextQuest: Document Clustering of Medline Abstracts for Concept Discovery in Molecular Biology”, I.Iliopoulos, A.J. Enright, C.A. Ouzounis, Pac. Symp. Biocomput. 2001. “MeSHmap: A text-mining tool for MEDLINE”, Padmini Srinivasan, AMIA, 2001. “Exploring text mining from MEDLINE”, Padmini Srinivasan and Thomas Rindflesch, AMIA 2002. “Association of genes to genetically inherited diseases using data mining.” Perez-Iratxeta C, Bork P, Andrade MA, Nature Genetics 31, 2002. “A MeSH term based Distance Measure for Document Retrieval and Labeling Assistance”, Jorg Ontrup, Tim W. Nattkemper, Olaf Gerstung, Helge Ritter, IEEE Engineering in Med. and Biol. Soc., 2003.

10 Curators' Meeting Oct. 27, 200310 Methods Data - Papers from RGD (2713)  371 not contained in RGD Represent papers  Full text, MeSH (Descriptors, Qualifiers, Both)  Principle components analysis Calculate distances between papers  Cosine distance Cluster papers  Agglomerative hierarchical clustering, average linkage Summarize the clusters  Frequency of term appearance

11 Curators' Meeting Oct. 27, 200311 IDRatsAmino acidMale…Cell Line 1240…1 2001…2 …………… n111…1 Full-text Representation libbow/rainbow from McCallum Title and abstracts of papers Stop word removal, stemming No term weighting

12 Curators' Meeting Oct. 27, 200312 MeSH Representation MeSH descriptors :  Assigned – (2)  Inferred – (1)  Not present – (0) Common Cold Respiratory Diseases Respiratory Tract Diseases 2 1 1

13 Curators' Meeting Oct. 27, 200313 MeSH Representation-Example PMIDRats Molecular Sequence Data Metabolism Amino Acid Sequence Male 1002730021022 1002891620202

14 Curators' Meeting Oct. 27, 200314 Results (Full-Text)

15 Curators' Meeting Oct. 27, 200315 Results (Descriptors Only)

16 Curators' Meeting Oct. 27, 200316 Results (Qualifiers Only)

17 Curators' Meeting Oct. 27, 200317 Results (Descriptors + Qualifiers)

18 Curators' Meeting Oct. 27, 200318 Results (Two Cluster)

19 Curators' Meeting Oct. 27, 200319 Summarizing Clusters Compute cluster center using original representation  Mean of each component Return qualifier/descriptor if component > 0.5  “At least half papers have the term inferred”

20 Curators' Meeting Oct. 27, 200320 Conclusions Using full-text representation  No visually obvious clusters in PCA space  Produces small, narrowly focused clusters with larger diameters Using MeSH representations  Visually obvious clusters in PCA space  Summaries are easier to obtain  Combining descriptors and qualifiers is helpful

21 Curators' Meeting Oct. 27, 200321 Future Work Associate MeSH descriptors with their qualifiers  What does it mean to “infer” terms?  Combinatorial challenge Evaluate with different distance methods  Investigated: Manhattan, Euclidean, cosine  Try: Tree-based Use LSI/SVD representation  Early results are similar Term weighting Investigate leveled representations  Iteratively refine clustering and classification based on levels in MeSH hierarchy

22 Curators' Meeting Oct. 27, 200322 Acknowledgements Susan Bromberg RGD Chitti Dharmonalla  Current Masters student, clustering David Diggs, Yun Guan, Kevin Indrebo, Arti Mann  Classification Jacob Buchholz  Classification with boosting

23 Curators' Meeting Oct. 27, 200323 Rat Genome Database “The Rat Genome Database (RGD) collects and integrates rat genetic and genomic data and makes it widely available to the scientific community using rat as a genetic model to study human disease”. Researchers at Medical College of Wisconsin (MCW) read papers related to rat and decide if it contains relevant rat related information or not. Many papers related to rat research are published every month (approximately 1200). RGD curation is a time-consuming process Most papers do not contain RGD related information.


Download ppt "Curators’ Meeting Oct. 27, 2003 Clustering MeSH Representations of Medical Literature Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and."

Similar presentations


Ads by Google