BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11 th of July 2005
Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction (R. McDonald, F. Pereira, S. Kulick, S. Winters, Y. Jin, P. White) Complex relations: John Smith is the CEO at Inc. Corp. (John Smith, CEO, Inc. Corp.) John Smith goes to his office at Inc. Corp. (John Smith, , Inc. Corp.)
Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction Complex Relation Extraction: 1. Recognition of pairs of entity mentions (binary relations are edges in a graph and named entities are nodes) Create set of positive (valid) and negative (invalid) relations using a standard maxent classifier (Berger et al. ’96, McCallum ’02) 2. Reconstruction of complex relations by making tuples from maximal cliques in the graph
Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction Complex relation reconstruction methods: 1. Maximal cliques (MC) Consider all cliques in graph consistent with definition of the relation and add For overlapping cliques, only return maximal cliques (those that are not a subset of other cliques). Use branch and bound algorithm to find all maximal cliques (Bron and Kerbosch ’73) = very efficient
Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction 2. Probabilistic Cliques (PC) Assign weight to each binary relation (taken from classifier) Weight of a cliques w(C) is the mean weight of the edges in the clique Cliques is valid if w(C) 0.5
Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction Extraction of genomic variation events from biomedical text (variation type, location, initial state, altered state) “At codons 12 and 16, the occurrence of point mutations from G/A to T/G were observed. (point mutation, codon 12, G, T) (point mutation, codon 16, A, G)
Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction 447 Medline abstracts 4691 sentence, 4773 entities, 1218 relations (38% not binary) ary relations ary relations ary relations Gold standard named entities (56% of entity pairs not related)
Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction Results: MC and PC significantly faster and more accurate than NE (naïve enumeration) PrecisionRecallF-score Binary classifier NE MC PC
Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing (P. Nakov and M. Hearst) Unsupervised method for noun compound bracketing [[liver cell] antibody] vs. [liver [cell line]] Use of bigram estimates with ² measure Use of surface features for querying web search engines Experiments with paraphrases Evaluation on encyclopaedia and bioscience text
Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing Web-driven surface features Dash: cell-cycle analysis, donor T-cell Possessive marker: brain’s stem cell, brain stem’s cells Internal capitalisation: Plasmodium vivax Malaria, brain Stem cells Embedded slashes: leukaemia/lymphoma cell Brackets: growth factor (beta), (brain) stem cells Collected surface features using regular expressions in summaries of returned documents of exact NC queries
Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing Other features: Abbreviations: “tumor necrosis factor (NF)”, tumor necrosis (TN) factor Concatenation: “health care reform” -> healthcare, carereform Reordering Internal inflection variability
Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing Paraphrases: “brain stem cells” “stem cells in the brain” “cells from the brain stem” Used queries with a set of selected paraphrase patterns to see how often they occurred for bracketing prediction
Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing Evaluation Lauer’s data set (Lauer ‘95) 244 three noun NCs Biomedical data set Extracted 500 three noun NCs from Medline abstracts 430 unambiguous (361 with left, 69 with right bracketing) Inter-annotator agreement: 88% and 82% (kappa:.606 and.442)
Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing Results: Surface features perform best Enc.: P=85.51% with 87.70% coverage Bio: P=88.84% with 100% coverage Best overall scores by combining most reliable models (majority vote)
Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing ModelAcc. % (Enc. Data) Baseline (LEFT)66.80 Lauer ‘95 dependency77.50 ² dependency Lauer ’95 tuned80.70 “Upper bound” (humans - Lauer ’95) Majority vote -> left89.34 Keller & Lapata: best Alta Vista 78.68
Dynamically Generating a Protein Entity Dictionary Using Online Resources (H. Liu, Z. Hu and C. Wu) Available at: saurus 4,046,733 terms and 1,640,082 entities
Dynamically Generating a Protein Entity Dictionary Using Online Resources Use of large biological databases incl. 3 NCBI databases (GenPept, RefSeq, Entrez GENE) PSD database from Protein Information Resources (PIR) Uniprot Model organism databases Nomenclature databases
Dynamically Generating a Protein Entity Dictionary Using Online Resources Automatically gathered fields containing annotation information for each iProtClass record Extracted terms associated with one or more UniProt unique identifiers => raw dictionary Automated curation using UMLS to flag UMLS semantic types and remove high frequency nonsensical terms