Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.

Biomedical Information Extraction

Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name variability [Cohen, Dolbey, Acquaah- Mensah, and Hunter] Name tagging [Tanabe and Wilbur]

PASTA [Demetriou and Gaizauskas] Protein Active Site Template Acquisition

Extraction Tasks Terminological Tagging “entities” Template Filling “relationships”

Terminology Tagging protein species residue site region secondary structure supersecondary structure quaternary structure base atom non-protein compound interaction

Template Filling residue := NAME:string SITE/FUN:string SEC_STRUCT:string QUAT_STRUCT:string REGION:string INTERACTION:string in_protein := RESIDUE:residue PROTEINprotein protein := NAME:string species := NAME:string in_species := PROTEIN:protein SPECIES:species

PASTA Architecture Text Preprocessing Title, author, abstract Tokenization, sentence boundaries

PASTA Architecture Terminological Processing Morphological analysis biochemical morphemes “-ase” Lexical lookup token lookup in databases token grammatical class tagging Terminology parsing create multi-token terms, rule-based parsing using grammatical tags

PASTA Architecture Syntactic and Semantic Processing Part-of-speech tags Phrase structure Compositional semantics Discourse Processing Semantic representations incorporated into discourse model of concept hierarchy and inference rules

PASTA Architecture Template Extraction Scan discourse model for template instances, check slots, build template

Performance DevInter- annotator Test Terminology88R/94P92R/86P82R/84P Template69R/79P78R/80P69R/64P

PASTAWeb Index document -> terminology, template terms -> templates from multiple documents IE tools need to be incorporated into effective interfaces for biology researchers

Indexing Problem Variations in expression of same protein name

Contrast and Variability [Cohen, Dolbey, Acquaah-Mensah, and Hunter] Named Entities location vs. identification Variability somatotropin rat somatotropin growth hormone

Variability Non-contrast (synonyms) tumor protein homolog vs tumour protein homologue Contrast (diffonyms?) ACE1 vs ACE2

Transformations 1. Remove first character 2. Remove first word 3. Remove last character 4. Remove last word 5. Replace sequence of vowels with one letter 6. Replace hyphen with space 7. Remove parenthesized material 8. Convert to lowercase

Experiment Collect groups of synonym gene names Get mouse, rat, and human genes from LocusLink Group OFFICIAL GENE NAME, PREFERRED GENE NAME, OFFICIAL SYMBOL, PREFERRED SYMBOL, PRODUCT, PREFERRED PRODUCT, ALIAS SYMBOL, ALIAS PROT entries together as synonyms

Results LMW, RMC, RMW identify contrastive variability Contrasts likely marked at name boundaries VS, HYPH, CASE, PM identify non- contrastive variability

Pattern Heuristics 1. Equivalence of vowel sequences 2. Optionality of hyphens 3. Optionality of parenthesized material 4. Case insensitivity

Tagging Genes and Proteins [Tanabe and Wilbur] ABGene Trained on MEDLINE abstracts Tested on PUBMED full texts

ABGene Transformation-based tagger False-positive and false-negative filters Compound term recovery Document ranking

Transformation-Based Tagging Learns sequence of transformation rules of the form A -> B / C greedily, based on number of errors corrected in training data tags Applies rules sequentially to tag new text

Gene Transformations GENE added as additional POS tag NNP -> GENE / gene fgoodleft * -> GENE / hassuf –A * -> GENE / haspref c- NNP -> GENE / prev1or2wd genes NNP -> GENE / nextbigram ( GENE VBG -> JJ nexttage GENE

Results Precision up to 0.74 Recall up to 0.64 depending on score threshold

Problems in Full Text Terms that do not appear in abstracts restriction enzyme site, lab protocol kits, primers, vectors, supply companies, chemical reagents Figures and tables

Summary Common thread in biomedical information extraction: normalization is hard!

Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.

Similar presentations

Presentation on theme: "Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.

Similar presentations

Presentation on theme: "Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name."— Presentation transcript:

Similar presentations

About project

Feedback