Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Problem Semi supervised sarcasm identification using SASI
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Introduction to Natural Language Processing Phenotype RCN Meeting Feb 2013.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 5 Understanding Entity Relationship Diagrams.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Part of speech (POS) tagging
Overview of Search Engines
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lawrence Hunter & K. Bretonnel Cohen Center for Computational Pharmacology UCHSC School of Medicine Using.
Concept Clustering, Summarization and Annotation Qiaozhu Mei.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Lars Juhl Jensen Biomedical text mining. exponential growth.
BioLINK Talks BioLINK,Detroit, June 24 (Edinburgh July 11) Linking Literature, Information and Knowledge for Biology.
1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
1 Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures Kevin Humphreys, George.
Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
MedKAT Medical Knowledge Analysis Tool December 2009.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
GENE INDEXING Janice Ward Indexer/Reviser Index Section, NLM.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
Database Design, Application Development, and Administration, 6 th Edition Copyright © 2015 by Michael V. Mannino. All rights reserved. Chapter 5 Understanding.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Information Retrieval in Practice
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Institute of Informatics & Telecommunications
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Databases.
Automatic Detection of Causal Relations for Question Answering
Batyr Charyyev.
Presentation transcript:

Biomedical Information Extraction

Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name variability [Cohen, Dolbey, Acquaah- Mensah, and Hunter] Name tagging [Tanabe and Wilbur]

PASTA [Demetriou and Gaizauskas] Protein Active Site Template Acquisition

Extraction Tasks Terminological Tagging “entities” Template Filling “relationships”

Terminology Tagging protein species residue site region secondary structure supersecondary structure quaternary structure base atom non-protein compound interaction

Template Filling residue := NAME:string SITE/FUN:string SEC_STRUCT:string QUAT_STRUCT:string REGION:string INTERACTION:string in_protein := RESIDUE:residue PROTEINprotein protein := NAME:string species := NAME:string in_species := PROTEIN:protein SPECIES:species

PASTA Architecture Text Preprocessing Title, author, abstract Tokenization, sentence boundaries

PASTA Architecture Terminological Processing Morphological analysis biochemical morphemes “-ase” Lexical lookup token lookup in databases token grammatical class tagging Terminology parsing create multi-token terms, rule-based parsing using grammatical tags

PASTA Architecture Syntactic and Semantic Processing Part-of-speech tags Phrase structure Compositional semantics Discourse Processing Semantic representations incorporated into discourse model of concept hierarchy and inference rules

PASTA Architecture Template Extraction Scan discourse model for template instances, check slots, build template

Performance DevInter- annotator Test Terminology88R/94P92R/86P82R/84P Template69R/79P78R/80P69R/64P

PASTAWeb Index document -> terminology, template terms -> templates from multiple documents IE tools need to be incorporated into effective interfaces for biology researchers

Indexing Problem Variations in expression of same protein name

Contrast and Variability [Cohen, Dolbey, Acquaah-Mensah, and Hunter] Named Entities location vs. identification Variability somatotropin rat somatotropin growth hormone

Variability Non-contrast (synonyms) tumor protein homolog vs tumour protein homologue Contrast (diffonyms?) ACE1 vs ACE2

Transformations 1. Remove first character 2. Remove first word 3. Remove last character 4. Remove last word 5. Replace sequence of vowels with one letter 6. Replace hyphen with space 7. Remove parenthesized material 8. Convert to lowercase

Experiment Collect groups of synonym gene names Get mouse, rat, and human genes from LocusLink Group OFFICIAL GENE NAME, PREFERRED GENE NAME, OFFICIAL SYMBOL, PREFERRED SYMBOL, PRODUCT, PREFERRED PRODUCT, ALIAS SYMBOL, ALIAS PROT entries together as synonyms

Results LMW, RMC, RMW identify contrastive variability Contrasts likely marked at name boundaries VS, HYPH, CASE, PM identify non- contrastive variability

Pattern Heuristics 1. Equivalence of vowel sequences 2. Optionality of hyphens 3. Optionality of parenthesized material 4. Case insensitivity

Tagging Genes and Proteins [Tanabe and Wilbur] ABGene Trained on MEDLINE abstracts Tested on PUBMED full texts

ABGene Transformation-based tagger False-positive and false-negative filters Compound term recovery Document ranking

Transformation-Based Tagging Learns sequence of transformation rules of the form A -> B / C greedily, based on number of errors corrected in training data tags Applies rules sequentially to tag new text

Gene Transformations GENE added as additional POS tag NNP -> GENE / gene fgoodleft * -> GENE / hassuf –A * -> GENE / haspref c- NNP -> GENE / prev1or2wd genes NNP -> GENE / nextbigram ( GENE VBG -> JJ nexttage GENE

Results Precision up to 0.74 Recall up to 0.64 depending on score threshold

Problems in Full Text Terms that do not appear in abstracts restriction enzyme site, lab protocol kits, primers, vectors, supply companies, chemical reagents Figures and tables

Summary Common thread in biomedical information extraction: normalization is hard!