Semi-Automatic Indexing of Full Text Biomedical Articles Washington D.C. October 25, 2005 Clifford W. Gay Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA
2 Acknowledgments u Alan R. Aronson, PhD. u Mehmet Kayaalp, M.D., PhD.
3 Outline u Introduction l The System: Medical Text Indexer (MTI) l The Data: Online biomedical journals l The Task: Emulate Medline indexing using full text u Results l Observations on PubMed Central articles l Model selection results l Recent work
5 Why Semi-Automatic Indexing? u U.S. National Library of Medicine indexes 5000 journal titles l Supports over 60 million PubMed searches each month l Has 130 indexers l Indexed 570,000 articles in 2004 n Will need to index 1,000,000 very soon l Automated support is helping to meet this demand –MTI was used on 26% of articles in 2004 u More about MTI l l Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM Indexing Initiative's Medical Text Indexer. Medinfo. 2004; 11(Pt 1): PMID:
Title + Abstract et al. Ordered list of MeSH Terms MeSH Headings UMLS Concepts Postprocessing Restrict to MeSH Trigram Phrase Matching Rel. Cits. PubMed Related Citations Extract MeSH Phrasex MetaMap Phrases Medical Text Indexer (MTI)
7 DCMS with MTI Suggestions
9 Why Full Text? u Medical Text Indexer uses article title and abstract u However l Human indexers taught not to use abstract l Author’s complete intent may not be in abstract l Check tags may only appear in a table or methods section. u If MTI indexes from full text articles it may l Find central concepts missing from abstract l Identify terms when article has no abstract l More accurately select check tags l Be in better compliance with indexing policy
10 Test Collection Selection u Available online from PubMed Central u Consistent XML format l Identifies title, abstract, sections, tables, figures, references, etc. u 500 articles from 17 diverse biomedical journals u Did not use: l References l Graphics l Math
11 Test Collection u 5 Clinical journals (165): l Breast Cancer Research (11) l Journal of Clinical Microbiology (80) u 3 Organization based journals (28): l Journal of American Medical Informatics Assoc. (10) l Proceeding of the National Academy of Sciences (11) u 9 Journals in other categories: Pharmacology (65); Biochemistry (65); Plants (46); Molecular Biology (45); Learning (30); Hospitals (22) Pharmacology (65); Biochemistry (65); Plants (46); Molecular Biology (45); Learning (30); Hospitals (22)
13 Indexing Task
u Medline Indexing beta-Lactamases /*genetics /*metabolism Enterobacteriaceae/drug effects /*enzymology/genetics Plasmids/*genetics Genes, Bacterial/genetics GenotypeKinetics Microbial Sensitivity Tests Molecular Sequence Data Research Support, Non-U.S. Gov't Example Article DNA Transposable Elements DNA Transposable Elements Escherichia coli Escherichia coli Genes, Bacterial Genes, Bacterial Cloning, Molecular Cloning, Molecular Klebsiella pneumoniae Klebsiella pneumoniae Amino Acid Sequence Amino Acid Sequence Microbial Sensitivity Tests Microbial Sensitivity Tests Cephalothin Cephalothin Proteus mirabilis Proteus mirabilis Erwinia Erwinia Salmonella typhimurium Salmonella typhimurium Enterobacteriaceae Infections Enterobacteriaceae Infections Lactams Lactams beta-Lactamases beta-Lactamases Plasmids Plasmids Enterobacteriaceae Enterobacteriaceae beta-Lactam Resistance beta-Lactam Resistance Conjugation, Genetic Conjugation, Genetic Cephalosporin Resistance Cephalosporin Resistance Cefotaxime Cefotaxime Nucleotide Sequences Nucleotide Sequences Molecular Sequence Data Molecular Sequence Data Cephalosporins Cephalosporins Chromosomes, Bacterial Chromosomes, Bacterial DNA, Bacterial DNA, Bacterial u MTI Indexing MMI MMI REL REL MMI & REL MMI & REL Recall = 0.67 Precison = 0.24 F 2 measure = 0.492
15 Evaluation u Measure u F 2 Measure l Weighted harmonic mean of Recall and Precision l Weights Recall twice as important as Precision l Values: 0.0 to 1.0 u Computed for each article and averaged
17 Section Header Classes u Semantically equivalent section headers u u MATERIALS AND METHODS class: l l Materials and Method(s) l l Method(s) l l Scoring Methods l l Experimental Procedures l l Other Methods Tested u CAPTIONS class: l the titles and captions from tables and figures
18 Section Class Average F 2 CAPTIONS ABSTRACT INTRODUCTION RESULTS DISCUSSION NO HEADER …… CONCLUSIONS ABBREVIATIONS Section Class Performance
20 Experiments u Varied MTI components used l MetaMap Indexing (MMI) l Related Citations (REL) u Varied section classes processed l Used model selection l Used binary weighting for sections u A model is l A selection of section classes and l The text in those sections l That represents the article
21 Production Baseline Title+Abstract MMI REL F 2 = 0.457
22 Naive Mode Title+Abstract MMI REL Materials and Methods Results and Discussion No Header F 2 = ( - 0.9%) All Section Classes
23 MetaMap Indexing Mode Title+Abstract MMI REL Introduction Results Discussion Other No Header F 2 = (-18.4%) Captions
24 Augmented Mode Title+Abstract MMI REL Introduction Results Discussion Other No Header F 2 = (+3.9%) Captions
25 Refined Augmented Mode Title+Abstract MMI REL Captions Results Background F 2 = (+ 6.1%)
26 Full MTI Mode Title+Abstract MMI REL Introduction Results Discussion Other No Header F 2 = (+ 6.8%) MMI model Captions
27 Refined Full MTI Title+Abstract MMI REL Results Results and Discussion No Header F 2 = (+ 7.4%) Captions Conclusions
28 MTI Performance Summary Indexing Model RecallPrecision Avg. F 2 Production Baseline (Ti, Ab) Naive Mode (full text) Augmented Mode (MMI + REL (Ti, Ab)) Augmented Mode (refined) Full MTI (MMI + REL common sections) Full MTI (refined)
30 Improvement Potential u With current model l No cut off at 25 terms yields maximum recall of 0.79 u If all good terms prioritized correctly l l F 2 = 0.64 l l Improvement over baseline 7% 40%
31 Increase REL Citations u MTI currently uses 10 Related Citations u Optimal number for full text articles is 15 u Best model confirmed for this setting u Additional Improvement in F 2 = 0.01
32 Summarization u Selecting important text before MTI processing u Using Yeh, Ke, Yang, Meng approach u Combines l Latent Semantic Analysis and l Salton’s Text Relationship Map u Start with current model u Document representation includes l Bag of words l MetaMap identified concepts
NLM Indexing Initiative Clifford W. Gay Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA
34 NONE Sections u Most appear in articles that have no abstract l 20/23 u Some are errors l 4 have “Introduction” header in publisher version l 2 appear within other sections with headers. u Many contain the primary text of the article l Comments, Editorials, Letters (11/23)
35 Other Sections u Other section class has 525 sections (16%) u Non-standard article organization l Common in Review articles u Example l ß-Lactamases of Kluyvera ascorbata, Probable Progenitors of Some Plasmid-Encoded CTX-M Types n Bacterial strains. n Antimicrobial agents and susceptibility testing. n Kinetic and IEF analyses. n Genetic characterization of blaKLUA. n Genetic environment of blaKLUA-1. n Arguments for mobilization of chromosomal blaKLUA gene.
36 Ranking Function u Made ranking function for Related Citations more like MetaMap Indexing. u Resulted in a more inclusive model l Materials and Methods l Introduction u F2 measure =
37 Tuning Path Weight u Ratio of weights between the two indexing paths l MetaMap Indexing – 7 l Related Citations – 2 u No improvement possible
38 Partial Weight for Singleton Headers u OTHER section class l Header is unique l Contain content terms u Gave section class weight between 0 and 1 l Some recall improvement l No collection wide improvement in F 2