Identifying Abbreviation Definitions in Biomedical Text Ariel SchwartzMarti Hearst.

Slides:



Advertisements
Similar presentations
FROM GENE TO PROTEIN.
Advertisements

Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
Gene Regulation in Eukaryotic Cells. Gene regulation is complex Regulation, and therefore, expression of a gene is complex. Regulation of these genes.
BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.
20,000 GENES IN HUMAN GENOME; WHAT WOULD HAPPEN IF ALL THESE GENES WERE EXPRESSED IN EVERY CELL IN YOUR BODY? WHAT WOULD HAPPEN IF THEY WERE EXPRESSED.
Improving miRNA Target Genes Prediction Rikky Wenang Purbojati.
GENETICS ESSENTIALS Concepts and Connections SECOND EDITION GENETICS ESSENTIALS Concepts and Connections SECOND EDITION Benjamin A. Pierce © 2013 W. H.
An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Next lectures: Differential Gene expression Chapter 5 and websites on syllabus Epigenetic control mechanisms –Histone modification –DNA methylation –Nucleosome.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Bacterial Physiology (Micr430) Lecture 13 Regulation of Gene Expression (Text Chapter: 6) (Moat book)
The Central Dogma of Molecular Biology (Things are not really this simple) Genetic information is stored in our DNA (~ 3 billion bp) The DNA of a.
New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF.
ECG Analysis for the Human Identification
ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
How do proteins fold? Folding in a test-tube The structure of proteins is determined by the amino acid sequence; many proteins in solution can be unfolded.
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
What toolbox is necessary for building exercise environments for algebraic transformations Rein Prank University of Tartu
Regulation of Gene Expression Eukaryotes
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
RNA & Protein Synthesis. DNA Determines Protein Structure The genetic information that is held in the molecules of DNA ultimately determines an organism’s.
Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
Eukaryotic Gene Regulation
Chapter 17 From Gene to Protein
Regulating Eukaryotic Gene Expression. Why change gene expression? Different cells need different components Responding to the environment Replacement.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Control of Gene Expression Chapter Proteins interacting w/ DNA turn Prokaryotic genes on or off in response to environmental changes  Gene Regulation:
Folding of proteins Proteins are synthesized on ribosomes as linear chains of amino acids. In order to be biologically active, they must fold into a unique.
1 CA García Sepúlveda MD PhD Chaperones. Laboratorio de Genómica Viral y Humana Facultad de Medicina, Universidad Autónoma de San Luis Potosí.
REPLICATION IN BACTERIA Replication takes place at several locations simultaneously Each replication bubble represents 2 replication forks moving in opposite.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Last Class 1. Transcription 2. RNA Modification and Splicing
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
Role of heat shock proteins in aging
Homework #2 is due 10/17 Bonus #1 is due 10/24 Exam key is online Office hours: M 10/ :30am 2-5pm in Bio 6.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
Control of Gene Expression in Bacteria
(3) Gene Expression Gene Expression (A) What is Gene Expression?
Simplified (partial) mechanism for the cytosolic stress response
Organization and control of Eukaryotic chromosomes
(3) Gene Expression Gene Expression (A) What is Gene Expression?
Control of Gene Expression
Lecture 12: Data Wrangling
Concept 18.2: Eukaryotic gene expression can be regulated at any stage
Cell Signaling.
Heat Shock Response of HSP-70 in Barley Aleurone Cells
Fast Sequence Alignments
Introduction to Gene Expression
Eukaryote Regulation and Gene Expression
Chaperones. CA García Sepúlveda MD PhD
Protein structure prediction.
Widespread modulation of epigenetic landscape for differentiated organoids is driven by Hnf4g Widespread modulation of epigenetic landscape for differentiated.
Joseph V Geisberg, Kevin Struhl  Molecular Cell 
Algebra 1B – Name: _________________________
Mediator–Nucleosome Interaction
DNA Topology and Global Architecture of Point Centromeres
Theodore R. Rieger, Richard I. Morimoto, Vassily Hatzimanikatis 
Eukaryotic Gene Regulation
Deep Learning in Bioinformatics
Signaling to Chromatin through Histone Modifications
Relationship between Genotype and Phenotype
Volume 7, Issue 1, Pages 1-11 (July 1997)
Presentation transcript:

Identifying Abbreviation Definitions in Biomedical Text Ariel SchwartzMarti Hearst

The Problem The volume of biomedical text is growing at a fast rate. New abbreviations are introduced frequently. Manual abbreviation dictionaries are out of date. The goal is to have a simple, fast and accurate algorithm to identify abbreviations and their definitions in biomedical text. We are interested in this algorithm, as one of many preprocessing steps we apply to biomedical texts, in order to be able to extract meaningful information from these texts.

Abbreviation Examples “Heat-shock protein 40 (Hsp40) enables Hsp70 to play critical roles in a number of cellular processes, such as protein folding, assembly, degradation and translocation in vivo.” “Glutathione S-transferase pull-down experiments showed the direct interaction of in vitro translated p110, p64, and p58 of the essential CBF3 kinetochore protein complex with Cbf1p, a basic region helix-loop-helix zipper protein (bHLHzip) that specifically binds to the CDEI region on the centromere DNA.” “Hpa2 is a member of the Gcn5-related N-acetyltransferase (GNAT) superfamily, a family of enzymes with diverse substrates including histones, other proteins,arylalkylamines and aminoglycosides.”

Related Work Pustejovsky et al. present a solution based on hand- build regular expression and syntactic information. Achieved 72% recall at 98% Chang et al. use linear regression on a pre-selected set of features. Achieved 83% recall at 80% * precision, and 75% recall at 95% precision. Park and Byrd present a rule-based algorithm for extraction of abbreviation definitions in general text. Yoshida et al. present an approach close to ours, trying to first match characters on word and syllable boundaries. * Counting partial matches, and abbreviations missing from the “gold-standard” their algorithm achieved 83% recall at 98% precision.

The Algorithm Much simpler than other approaches. Extracts abbreviation-definition candidates adjacent to parentheses. Finds correct definitions by matching characters in the abbreviation to characters in the definition, starting from the right. The first character in the abbreviation must match a character at the beginning of a word in the definition. To increase precision a few simple heuristics are applied to eliminate incorrect pairs. Example: Heat shock transcription factor (HSF). The algorithm finds the correct definition, but not the correct alignment: Heat shock transcription factor

Results On the “gold-standard” the algorithm achieved 83% recall at 96% precision. * On a larger test collection the results were 90% recall at 95% precision. An alternative algorithm, based on modification of the Park and Byrd algorithm using decision lists, achieved only slightly better results – 83% recall at 97% precision, and 90% at 96% precision. These results show that a very simple algorithm produces results that are comparable to these of the exiting more complex algorithms. * Counting partial matches, and abbreviations missing from the “gold-standard” our algorithm achieved 83% recall at 99% precision.