1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Software Testing and Quality Assurance
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin.
Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Introduction to Machine Learning Approach Lecture 5.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Survey of Semantic Annotation Platforms
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Automatic Keyphrase Extraction (Jim Nuyens) Keywords are an everyday part of looking up topics and specific content. What are some of the ways of obtaining.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Medical Information Retrieval: eEvidence System By Zhao Jin Mar
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Multilingual Opinion Holder Identification Using Author and Authority Viewpoints Yohei Seki, Noriko Kando,Masaki Aono Toyohashi University of Technology.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Information Extraction for Clinical Data Mining: A Mammography Case Study H. Nassif, R. Woods, E. Burnside, M. Ayvaci, J. Shavlik and D. Page University.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Language Identification and Part-of-Speech Tagging
Automatically Labeled Data Generation for Large Scale Event Extraction
Text Based Information Retrieval
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Bidirectional CRF for NER
CIS Term Project Proposal November 1, 2002 Sharon Diskin
Extracting Semantic Concept Relations
Citation-based Extraction of Core Contents from Biomedical Articles
Introduction to Text Analysis
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Presentation transcript:

1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan and Hsin-Hsi Chen

2 Outline Background Methods Results Discussion Conclusion

3 Background - Motivation The rapid proliferation of biomedical literature makes it increasingly difficult for researchers peruse, query, and synthesize it for biomedical knowledge gain. Less biomedical text mining work has been performed to identify disease-related objects and concepts. Related works about automated disease entity recognition often do not perform well. More extensive work on medical entity class recognition is necessary.

4 Related works 1.Automated extractors for the identification of gene and protein names. 2.Automated entity recognition to the identification of phenotypic and disease objects. 3.A machine-learning algorithm to extract gene-disorder relations. 4.Extract phenotypic attributes from Online Mendelian Inheritance in Man (OMIM).

5 Goal Develop a named entity recognizer (MTag), an entity tagger for recognizing clinical descriptions of malignancy presented in text. MTag is based upon the probability model Conditional Random Fields (CRFs). Minimize manual efforts and still perform with high accuracy.

6 Conditional Random Fields (CRFs) CRFs are probabilistic tagging models that give the conditional probability of a possible tag sequence t = t 1,... t n given the input token sequence o = o 1,..., o n (Ryan McDonald and Fernando Pereira, 2005). For example, the identification of gene mentions in text can be implemented as a tagging task. Begins (B), continues (I), or is outside (O) of a gene mention o: t:

7 Conditional Random Fields (CRFs) John Lafferty et al., 2001 input token sequence tag sequence

8 Methods – Task definition To develop an automated method that would accurately identify and extract strings of text corresponding to a clinician’s or researcher’s reference to cancer (malignancy). Label “Malignancy”: the full noun phrase encompassing a mention of a cancer subtype. For example, “neuroblastoma”, “localized neuroblastoma” and “primary extracranial neuroblastoma” were considered to be distinct mentions of malignancy. Directly adjacent prepositional phrases were not allowed, such as “cancer ”.

9 Two corpus combination 1.The first corpus concentrated upon a specific malignancy (neuroblastoma) and consisted of 1,000 randomly selected abstracts identified by querying PubMed with the query terms "neuroblastoma" and "gene". 2.The second corpus consisted of 600 abstracts previously selected as likely containing gene mutation instances for genes commonly mutated in a wide variety of malignancies.

10 Two corpus combination (cont.) =1442 abstracts (eliminating 158 abstracts that appeared to be non-topical, had no abstract body, or were not written in English.) Manually annotated for tokenization, part- of-speech assignments, and malignancy named entity recognition.

11 Two corpus combination (cont.) Annotations were performed on all documents by experienced annotators with biomedical knowledge. Discrepancies were resolved through forum discussions. A total of 7,303 malignancy mentions were identified in the document set.

12 MTag algorithm MTag was developed using the probability model Conditional Random Fields (CRFs). CRFs model the conditional probability of a tag sequence given an observation sequence. O is an observation sequence, or a sequence of tokens in the text. t is a corresponding tag sequence in which each tag labels the corresponding token with either Malignancy (meaning that the token is part of a malignancy mention) or Other. O: Lung cancer may be related to gene mutation. t:

13 MTag algorithm (cont.) CRFs are based on a set of feature functions, f i (t j, t j-1, O). This feature represents the probability of whether the token "cancer" is tagged with label Malignancy given the presence of "lung" as the previous token. O: Lung cancer may be related to gene mutation. t:

14 MTag algorithm (cont.) Consider many textual features when it makes decisions on classifying whether a word comprises all or part of a malignancy mention. Word-based features: The frequency of each string of 2, 3, or 4 adjacent characters (character n-grams) within each word of the training text was calculated. For example, lung (lu, lun, lung, un, ung, ng) The differential frequency of each n-gram within words manually tagged as being malignancy mentions was considered as a series of features. For example: lung (bigram: 3/6, trigram:2/6, fourgram:1/6)

15 MTag algorithm (cont.) Orthographic features included the usage and distribution of punctuation, alternative spellings, and case usage. Domain-specific features comprised a lexicon of 5,555 malignancies and a regular expression for tokens containing the suffix -oma.

16 Evaluation The evaluation set: 432 abstracts - 2,031 sentences containing mentions of malignancy - 3,752 sentences without mentions Correctly identified if the predicted and manually labeled tags were exactly the same in content and both boundary determinations. The performance of MTag was calculated according to precision, recall and F-measure.

17 Results - MTag performance Two separate training experiments were performed, either with or without the inclusion of malignancy-specific features, which were the addition of a lexicon of malignancy mentions and a list of indicative suffixes.

18 MTag performance (cont.) MTag model Evaluation set all biological feature sets: Yes all biological feature sets: No neuroblastoma- specific and genome-specific Precision: Recall: F-measure: Precision: Recall: F-measure: Neuroblastoma- specific Precision: 0.88 Recall: 0.87 F-measure: 0.88 genome-specificPrecision: 0.77 Recall: 0.69 F-measure: 0.73

19 MTag performance (cont.) As expected, the extractor performed with higher accuracy with the more narrowly defined corpus (neuroblastoma). At least for this class of entities, the extractor performs the task of identifying malignancy mentions efficiently without the use of a specialized lexicon.

20 Extraction versus string matching String matching: the NCI (National Cancer Institute) neoplasm ontology, a term list of 5,555 malignancies, was used as a lexicon to identify malignancy mentions. Lexicon terms were individually queried against text by case-insensitive exact string matching.

21 Extraction versus string matching (cont.) Testing set (432 abstracts) 39 abstracts (202 malignancy mentions) random selection MTag: automated extractor String matching

22 Extraction versus string matching (cont.) MTag identified 190 of the 202 mentions correctly (94.1%), while the NCI list identified only 85 mentions (42.1%), all of which were also identified by the extractor.

23 Extraction versus string matching (cont.) Change lexicon for string matching 79 of 202 mentions (39.1%) Combining the manually-derived lexicon with the NCI lexicon yielded 124 of 202 matches (61.4%). NCI list Malignancy mentions identified in the manually curated training set annotations (1,010 documents) 85 mentions (42.1%)

24 Extraction versus string matching (cont.) =78 (68) malignancy mentions 68 malignancy mentions Minor variations in spelling and form (e.g., "leukaemia" versus "leukemia") New mentions of malignancies that were in neither in the NCI list or training set. acronyms (e.g., "AML" in place of "acute myeloid leukemia") Missed by the string matching with combined lists but positively identified by MTag. This suggests that MTag contributes a significant learning component.

25 Application to MEDLINE MTag was used to extract mentions of malignancy from all MEDLINE abstracts through ,433,668 documents A total of 9,153,340 redundant mentions and 580,002 unique mentions (ignoring case) were identified.

26 Application to MEDLINE (cont.) The 25 mentions found in the greatest number of abstracts by MTag

27 Application to MEDLINE (cont.) Six false postives: pulmonary, fibroblasts, neoplastic, neoplasm metastasis, extramural, and abdominal Only "extramural“ is not frequently associated with malignancy descriptions. The remaining five phrases are likely the result of the extractor: - failing to properly define mention boundaries in certain cases. For example, "neoplasm“ v.s “neoplasm metastasis”. - shared use of an otherwise indicative character string (e.g., "opl" in "brain neoplasm" and "neoplastic") between a true positive and a false positive.

28 Application to MEDLINE (cont.) To assess document-level precision, 100 abstracts identified by MTag were randomly selected each for the malignancies "breast cancer" and "adenocarcinoma". Manual evaluation of these abstracts showed that all of the articles were true positives.

29 MTag input and output Directly accept files downloaded from PubMed and formatted in MEDLINE format as input. Text or HTML file versions of the extractor output results.

30 MTag HTML output

31 Discussion It is evident that an F-measure of 0.83 is not sufficient as a stand-alone approach for curation tasks. However, such an approach provides highly enriched material for manual curators to utilize further. Substantial improvement and efficiency MTag appeared to be accurately predicting malignancy mentions.

32 Discussion (cont.) Analysis of mis-annotations would likely suggest additional features and/or heuristics that could boost performance considerably. It may be no need for extensive domain- specific lexicons because the addition of biological features provided very little boost to the recall rate.

33 Conclusion MTag is one of the first directed efforts to automatically extract entity mentions in a disease-oriented domain with high accuracy. MTag substantially outperformed information retrieval methods using specialized lexicons. When combined with expert evaluation of output, MTag can assist with vocabulary building for cancer entity class.

34 Thank you for your attention