BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11 th of July 2005.

Slides:



Advertisements
Similar presentations
1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University.
Advertisements

Tricks for Statistical Semantic Knowledge Discovery: A Selectionally Restricted Sample Marti A. Hearst UC Berkeley.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
1. Elements of the Genetic Algorithm  Genome: A finite dynamical system model as a set of d polynomials over  2 (finite field of 2 elements)  Fitness.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley Joint work.
Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
Predicting Gene Functions from Text Using a Cross- Species Approach Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Graphical models for part of speech tagging
Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.
Text Classification, Active/Interactive learning.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Tutorial session 2 Network annotation Exploring PPI networks using Cytoscape EMBO Practical Course Session 8 Nadezhda Doncheva and Piet Molenaar.
1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Probabilistic Text Structuring: Experiments with Sentence Ordering Mirella Lapata Department of Computer Science University of Sheffield, UK (ACL 2003)
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
6/11/20161 Graph models and efficient exact algorithms in studying cancer signaling pathways Songjian Lu, Lujia Chen, Chunhui Cai Department of Biomedical.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Language Identification and Part-of-Speech Tagging
CRF &SVM in Medication Extraction
CIS Term Project Proposal November 1, 2002 Sharon Diskin
Supported by NSF DBI and a gift from Genentech
Presented by: Prof. Ali Jaoua
N-Gram Model Formulas Word sequences Chain rule of probability
PolyAnalyst Web Report Training
Applying principles of computer science in a biological context
Presentation transcript:

BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11 th of July 2005

Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction (R. McDonald, F. Pereira, S. Kulick, S. Winters, Y. Jin, P. White) Complex relations: John Smith is the CEO at Inc. Corp. (John Smith, CEO, Inc. Corp.) John Smith goes to his office at Inc. Corp. (John Smith, , Inc. Corp.)

Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction Complex Relation Extraction: 1. Recognition of pairs of entity mentions (binary relations are edges in a graph and named entities are nodes)  Create set of positive (valid) and negative (invalid) relations using a standard maxent classifier (Berger et al. ’96, McCallum ’02) 2. Reconstruction of complex relations by making tuples from maximal cliques in the graph

Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction Complex relation reconstruction methods: 1. Maximal cliques (MC) Consider all cliques in graph consistent with definition of the relation and add  For overlapping cliques, only return maximal cliques (those that are not a subset of other cliques). Use branch and bound algorithm to find all maximal cliques (Bron and Kerbosch ’73) = very efficient

Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction 2. Probabilistic Cliques (PC) Assign weight to each binary relation (taken from classifier) Weight of a cliques w(C) is the mean weight of the edges in the clique Cliques is valid if w(C)  0.5

Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction Extraction of genomic variation events from biomedical text (variation type, location, initial state, altered state) “At codons 12 and 16, the occurrence of point mutations from G/A to T/G were observed. (point mutation, codon 12, G, T) (point mutation, codon 16, A, G)

Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction 447 Medline abstracts 4691 sentence, 4773 entities, 1218 relations (38% not binary) ary relations ary relations ary relations Gold standard named entities (56% of entity pairs not related)

Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction Results: MC and PC significantly faster and more accurate than NE (naïve enumeration) PrecisionRecallF-score Binary classifier NE MC PC

Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing (P. Nakov and M. Hearst) Unsupervised method for noun compound bracketing [[liver cell] antibody] vs. [liver [cell line]] Use of bigram estimates with  ² measure Use of surface features for querying web search engines Experiments with paraphrases Evaluation on encyclopaedia and bioscience text

Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing Web-driven surface features Dash: cell-cycle analysis, donor T-cell Possessive marker: brain’s stem cell, brain stem’s cells Internal capitalisation: Plasmodium vivax Malaria, brain Stem cells Embedded slashes: leukaemia/lymphoma cell Brackets: growth factor (beta), (brain) stem cells Collected surface features using regular expressions in summaries of returned documents of exact NC queries

Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing Other features: Abbreviations: “tumor necrosis factor (NF)”, tumor necrosis (TN) factor Concatenation: “health care reform” -> healthcare, carereform Reordering Internal inflection variability

Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing Paraphrases: “brain stem cells” “stem cells in the brain” “cells from the brain stem” Used queries with a set of selected paraphrase patterns to see how often they occurred for bracketing prediction

Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing Evaluation Lauer’s data set (Lauer ‘95)  244 three noun NCs Biomedical data set  Extracted 500 three noun NCs from Medline abstracts 430 unambiguous (361 with left, 69 with right bracketing) Inter-annotator agreement: 88% and 82% (kappa:.606 and.442)

Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing Results: Surface features perform best Enc.: P=85.51% with 87.70% coverage Bio: P=88.84% with 100% coverage Best overall scores by combining most reliable models (majority vote)

Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing ModelAcc. % (Enc. Data) Baseline (LEFT)66.80 Lauer ‘95 dependency77.50  ² dependency Lauer ’95 tuned80.70 “Upper bound” (humans - Lauer ’95) Majority vote -> left89.34 Keller & Lapata: best Alta Vista 78.68

Dynamically Generating a Protein Entity Dictionary Using Online Resources (H. Liu, Z. Hu and C. Wu) Available at: saurus 4,046,733 terms and 1,640,082 entities

Dynamically Generating a Protein Entity Dictionary Using Online Resources Use of large biological databases incl. 3 NCBI databases (GenPept, RefSeq, Entrez GENE) PSD database from Protein Information Resources (PIR) Uniprot Model organism databases Nomenclature databases

Dynamically Generating a Protein Entity Dictionary Using Online Resources Automatically gathered fields containing annotation information for each iProtClass record Extracted terms associated with one or more UniProt unique identifiers => raw dictionary Automated curation using UMLS to flag UMLS semantic types and remove high frequency nonsensical terms