Download presentation
Presentation is loading. Please wait.
1
Biomedical Information Extraction
2
Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name variability [Cohen, Dolbey, Acquaah- Mensah, and Hunter] Name tagging [Tanabe and Wilbur]
3
PASTA [Demetriou and Gaizauskas] Protein Active Site Template Acquisition
4
Extraction Tasks Terminological Tagging “entities” Template Filling “relationships”
5
Terminology Tagging protein species residue site region secondary structure supersecondary structure quaternary structure base atom non-protein compound interaction
6
Template Filling residue := NAME:string SITE/FUN:string SEC_STRUCT:string QUAT_STRUCT:string REGION:string INTERACTION:string in_protein := RESIDUE:residue PROTEINprotein protein := NAME:string species := NAME:string in_species := PROTEIN:protein SPECIES:species
7
PASTA Architecture Text Preprocessing Title, author, abstract Tokenization, sentence boundaries
8
PASTA Architecture Terminological Processing Morphological analysis biochemical morphemes “-ase” Lexical lookup token lookup in databases token grammatical class tagging Terminology parsing create multi-token terms, rule-based parsing using grammatical tags
9
PASTA Architecture Syntactic and Semantic Processing Part-of-speech tags Phrase structure Compositional semantics Discourse Processing Semantic representations incorporated into discourse model of concept hierarchy and inference rules
10
PASTA Architecture Template Extraction Scan discourse model for template instances, check slots, build template
11
Performance DevInter- annotator Test Terminology88R/94P92R/86P82R/84P Template69R/79P78R/80P69R/64P
12
PASTAWeb Index document -> terminology, template terms -> templates from multiple documents IE tools need to be incorporated into effective interfaces for biology researchers
13
Indexing Problem Variations in expression of same protein name
14
Contrast and Variability [Cohen, Dolbey, Acquaah-Mensah, and Hunter] Named Entities location vs. identification Variability somatotropin rat somatotropin growth hormone
15
Variability Non-contrast (synonyms) tumor protein homolog vs tumour protein homologue Contrast (diffonyms?) ACE1 vs ACE2
16
Transformations 1. Remove first character 2. Remove first word 3. Remove last character 4. Remove last word 5. Replace sequence of vowels with one letter 6. Replace hyphen with space 7. Remove parenthesized material 8. Convert to lowercase
17
Experiment Collect groups of synonym gene names Get mouse, rat, and human genes from LocusLink Group OFFICIAL GENE NAME, PREFERRED GENE NAME, OFFICIAL SYMBOL, PREFERRED SYMBOL, PRODUCT, PREFERRED PRODUCT, ALIAS SYMBOL, ALIAS PROT entries together as synonyms
18
Results LMW, RMC, RMW identify contrastive variability Contrasts likely marked at name boundaries VS, HYPH, CASE, PM identify non- contrastive variability
19
Pattern Heuristics 1. Equivalence of vowel sequences 2. Optionality of hyphens 3. Optionality of parenthesized material 4. Case insensitivity
20
Tagging Genes and Proteins [Tanabe and Wilbur] ABGene Trained on MEDLINE abstracts Tested on PUBMED full texts
21
ABGene Transformation-based tagger False-positive and false-negative filters Compound term recovery Document ranking
22
Transformation-Based Tagging Learns sequence of transformation rules of the form A -> B / C greedily, based on number of errors corrected in training data tags Applies rules sequentially to tag new text
23
Gene Transformations GENE added as additional POS tag NNP -> GENE / gene fgoodleft * -> GENE / hassuf –A * -> GENE / haspref c- NNP -> GENE / prev1or2wd genes NNP -> GENE / nextbigram ( GENE VBG -> JJ nexttage GENE
24
Results Precision up to 0.74 Recall up to 0.64 depending on score threshold
25
Problems in Full Text Terms that do not appear in abstracts restriction enzyme site, lab protocol kits, primers, vectors, supply companies, chemical reagents Figures and tables
26
Summary Common thread in biomedical information extraction: normalization is hard!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.