Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.

Similar presentations


Presentation on theme: "Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name."— Presentation transcript:

1 Biomedical Information Extraction

2 Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name variability [Cohen, Dolbey, Acquaah- Mensah, and Hunter] Name tagging [Tanabe and Wilbur]

3 PASTA [Demetriou and Gaizauskas] Protein Active Site Template Acquisition

4 Extraction Tasks Terminological Tagging “entities” Template Filling “relationships”

5 Terminology Tagging protein species residue site region secondary structure supersecondary structure quaternary structure base atom non-protein compound interaction

6 Template Filling residue := NAME:string SITE/FUN:string SEC_STRUCT:string QUAT_STRUCT:string REGION:string INTERACTION:string in_protein := RESIDUE:residue PROTEINprotein protein := NAME:string species := NAME:string in_species := PROTEIN:protein SPECIES:species

7 PASTA Architecture Text Preprocessing Title, author, abstract Tokenization, sentence boundaries

8 PASTA Architecture Terminological Processing Morphological analysis biochemical morphemes “-ase” Lexical lookup token lookup in databases token grammatical class tagging Terminology parsing create multi-token terms, rule-based parsing using grammatical tags

9 PASTA Architecture Syntactic and Semantic Processing Part-of-speech tags Phrase structure Compositional semantics Discourse Processing Semantic representations incorporated into discourse model of concept hierarchy and inference rules

10 PASTA Architecture Template Extraction Scan discourse model for template instances, check slots, build template

11 Performance DevInter- annotator Test Terminology88R/94P92R/86P82R/84P Template69R/79P78R/80P69R/64P

12 PASTAWeb Index document -> terminology, template terms -> templates from multiple documents IE tools need to be incorporated into effective interfaces for biology researchers

13 Indexing Problem Variations in expression of same protein name

14 Contrast and Variability [Cohen, Dolbey, Acquaah-Mensah, and Hunter] Named Entities location vs. identification Variability somatotropin rat somatotropin growth hormone

15 Variability Non-contrast (synonyms) tumor protein homolog vs tumour protein homologue Contrast (diffonyms?) ACE1 vs ACE2

16 Transformations 1. Remove first character 2. Remove first word 3. Remove last character 4. Remove last word 5. Replace sequence of vowels with one letter 6. Replace hyphen with space 7. Remove parenthesized material 8. Convert to lowercase

17 Experiment Collect groups of synonym gene names Get mouse, rat, and human genes from LocusLink Group OFFICIAL GENE NAME, PREFERRED GENE NAME, OFFICIAL SYMBOL, PREFERRED SYMBOL, PRODUCT, PREFERRED PRODUCT, ALIAS SYMBOL, ALIAS PROT entries together as synonyms

18 Results LMW, RMC, RMW identify contrastive variability Contrasts likely marked at name boundaries VS, HYPH, CASE, PM identify non- contrastive variability

19 Pattern Heuristics 1. Equivalence of vowel sequences 2. Optionality of hyphens 3. Optionality of parenthesized material 4. Case insensitivity

20 Tagging Genes and Proteins [Tanabe and Wilbur] ABGene Trained on MEDLINE abstracts Tested on PUBMED full texts

21 ABGene Transformation-based tagger False-positive and false-negative filters Compound term recovery Document ranking

22 Transformation-Based Tagging Learns sequence of transformation rules of the form A -> B / C greedily, based on number of errors corrected in training data tags Applies rules sequentially to tag new text

23 Gene Transformations GENE added as additional POS tag NNP -> GENE / gene fgoodleft * -> GENE / hassuf –A * -> GENE / haspref c- NNP -> GENE / prev1or2wd genes NNP -> GENE / nextbigram ( GENE VBG -> JJ nexttage GENE

24 Results Precision up to 0.74 Recall up to 0.64 depending on score threshold

25 Problems in Full Text Terms that do not appear in abstracts restriction enzyme site, lab protocol kits, primers, vectors, supply companies, chemical reagents Figures and tables

26 Summary Common thread in biomedical information extraction: normalization is hard!


Download ppt "Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name."

Similar presentations


Ads by Google