Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.

Slides:



Advertisements
Similar presentations
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Advertisements

MINING FEATURE-OPINION PAIRS AND THEIR RELIABILITY SCORES FROM WEB OPINION SOURCES Presented by Sole A. Kamal, M. Abulaish, and T. Anwar International.
Knowledge Enabled Information and Services Science Schema-Driven Relationship Extraction from Unstructured Text Cartic Ramakrishnan Kno.e.sis Center, Wright.
1 Schema-Driven Relationship Extraction from Unstructured Text Cartic Ramakrishnan, Krys Kochut and Amit Sheth LSDIS Lab, University of Georgia, Athens,
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Codifying Semantic Information in Medical Questions Using Lexical Sources Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu.
The Role of the UMLS in Vocabulary Control CENDI Conference “Controlled Vocabulary and the Internet” Stuart J. Nelson, MD.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
Social Pharmacy and Pharmacoepidemiology Lister Hill National Center for Biomedical Communications Text-based Discovery in Biomedicine The Architecture.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Semantic Interpretation of Medical Text Barbara Rosario, SIMS Steve Tu, UC Berkeley Advisor: Marti Hearst, SIMS.
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
Query Expansion.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Session II: Scientific Publishing and Semantic Web W3C Semantic Web for Life Sciences Workshop October 27, 2004 Moderator: Alan R. Aronson.
Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Experiences in visualizing and navigating biomedical.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Knowledge Discovery in the Digital Library Access tools for mining science ICSTI Public Workshop Presented by: Bernard Dumouchel, Director-General February.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Survey of Medical Informatics CS 493 – Fall 2004 September 27, 2004.
PattArAn – From Annotation Triplets to Sentence Fingerprints Motivation Motivation  Scientific concepts are annotated with controlled vocabulary (CV)
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Flexible Text Mining using Interactive Information Extraction David Milward
Information overload –more than 12 million references already in MEDLINE –thousands more each day –well-articulated queries retrieve many relevant articles.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
UMLS Unified Medical Language System. What is UMLS? A Unified knowledge representation system Project of NLM Large scale Distributed First launched in.
Knowledge-Based Semantic Interpretation for Summarizing Biomedical Text Thomas C. Rindflesch, Ph.D. Marcelo Fiszman, M.D., Ph.D. Halil Kilicoglu, M.S.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
Seeking Abbreviations From MEDLINE Jeffrey T. Chang Hinrich Schütze Russ B. Altman Presented by: Bo Han.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.
Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA WHOI, June 3, 2004.
Mining the Biomedical Research Literature Ken Baclawski.
Reference Collections: Collection Characteristics.
1 Semantic Relations for Interpreting DNA Microarray Data and for Novel Hypotheses Generation Dimitar Hristovski, 1 PhD, Andrej Kastrin, 2 Borut Peterlin,
The UMLS Semantic Network Alexa T. McCray Center for Clinical Computing Beth Israel Deaconess Medical Center Harvard Medical School
Domain Model A representation of real-world conceptual classes in a problem domain. The core of object-oriented analysis They are NOT software objects.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Intelligent Database Systems Lab Presenter: YU-TING LU Authors: Yong-Bin Kang, Pari Delir Haghighi, Frada Burstein ESA CFinder: An intelligent key.
TDM in the Life Sciences Application to Drug Repositioning *
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
The Claim Framework Catherine Blake
RaJoLink: Creative Knowledge Discovery by Literature Outlier Detection
Lindsay & Gordon’s Discovery Support Systems Model
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Terminology problems in literature mining and NLP
John MacMullen SILS Bioinformatics Journal Club Fall 2002
Jian Wang Assistant Professor Science Based Business Program LIACS, Leiden University
Blake & Pratt’s ‘Collaborative Information Synthesis’
CSE 635 Multimedia Information Retrieval
By Hossein Hematialam and Wlodek Zadrozny Presented by
Presentation transcript:

Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University of California, Irvine Wanda Pratt Information School and Division of Biomedical & Health Informatics University of Washington

Motivation Information overload –MEDLINE = 11 million citations 8,000 each week –additional 8,000 each week Specialization of research –low communication between scientific areas –little focus on ‘big picture’

Goal Provide scientists with promising new treatment strategies Medical literature has implicit links Deductive logic can identify these links If A then B and If B then C then A  C Assumptions

Previous Approach Swanson and Smalheiser (1997) Target Literature A Magnesium Source Literature C Migraine B-Calcium Channel Blockers B-Platelet Activity B-Serotonin...

Current Pruning WordsDistinct Words No Pruning 14,0512,762 Stemmed 13,1122,492 Manual Pruning Remove ‘redundancies and non-useful terms’ ~92-94% of B-terms are manually pruned !

Our Approach Semantic representation –Unify synonymous text expressions –e.g. Serotonin = {5-HT, 5HT, Enteramine, 5-Hydroxytryptamine, 3-(2-Aminoethyl)- 1H-indol-5-ol } Prune using semantic types –e.g. Serotonin is a {Organic Chemical, Pharmacologic Substance, Neuroreactive Substance or Biogenic Amige}

Unified Medical Language System (UMLS) (1) Metathesaurus 311 vocabularies 776, 940 concepts ~11 million relationships 2.10 million strings (2) Semantic Network 134 semantic types 54 semantic relations (3) SPECIALIST lexicon POS + morphological entries nouns verbs

Methodology Collect migraine citations Generate alternative features –word –concept –semantically pruned concepts Evaluate C  B connections

Word Representation Domain independent Common choice Title words (to compare with Swanson) Removed –417 generic stopwords* e.g. a, and, between, their, really, room, said, think, the,... –31 medical stopwords e.g clinical, observed, provide, selection, study, therapy, test,... * Source: Sanderson, M. (1999) Available at

Concept Representation Medical specific Titles mapped to UMLS concept Mapped automatically (1) partition title sentences into phrases (2) for each phrase (2a) direct concept match (UMLS API) (2b) if not found approx match (UMLS API) select the best concept

Semantically Pruned Concept Used 37 of 134 semantic types in UMLS Substance Chemical Hormone Gene or Genome Enzyme Cell Amino Acid, Peptide or Protein Neuroreactive Substance or Biogenic Amine... Goal : generalize semantic types not blinded to B-terms

Evaluation Number of Relevant Items Step 1: Find potentially relevant titles –any representation + synonyms –e.g. calcium channel blockers any word in { calcium, channel, blokers, blocker } Step 2: Verify each title –Not all relevant B-terms indicated relevant links –E.g. Timolol maleate, a beta blocker, in the treatment of common migraine headache  calcium channel blocker

Evaluation - Metrics (1) Precision = (2) Recall = (3) Number of C  B links identified (4) Feature space dimensionality Number of relevant B-terms Number of B-terms returned Number of relevant B-terms Number of relevant titles

Interpolated Precision

Number of Links Identified

Dimensionality

Future Work Extend to B  A connections Use abstracts –dimensionality consequences Generalize –Raynaud’s disease and fish oil –other research questions

Conclusions Concept vs Words improved precision and recall more of the 11 connections in top 50 B-terms Semantic Pruning vs Concept degraded recall improved precision more of the 11 connections in top 50 B-terms

Catherine Blake Wanda Pratt

References Davis, R (1989). The Creation of New Knowledge by Information Retrieval and Classification. Journal of Documentation 45(4) Lindsay, R. K. and M. D. Gordon (1999). Literature-Based Discovery by Lexical Statistics. Journal of the American Society for Information Science 50(7): Sanderson, M. (1999). Stop word list. Available at: Swanson, D. R. (1988). Migraine and magnesium: eleven neglected connections. Perspect. Biol. Med. 31: Swanson, D. R. and N. R. Smalheiser (1997a). An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artifical Intelligence: Weeber, M., Klein,H., Mork,J.G, Jong-van den Berg,L., Vos,R. (2000). Text- Based Discovery in Biomedicine: The Architecture of the DAD-system. AMIA.