SMBM Talks SMBM, Cambridge, April 11-13 (Edinburgh May 2) NLP for Biomedical Text Mining.

Slides:



Advertisements
Similar presentations
Extracting Disease-Gene Associations from MEDLINE abstracts Tsujii laboratory University of Tokyo.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
HPSG parser development at U-tokyo Takuya Matsuzaki University of Tokyo.
Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Mark A. Greenwood Mark Stevenson Yikun Guo Henk Harkema Angus Roberts.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
A Memory-Based Approach to Semantic Role Labeling Beata Kouchnir Tübingen University 05/07/04.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Introduction to Machine Learning Approach Lecture 5.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Mark A. Greenwood Mark Stevenson Yikun Guo Henk Harkema Angus Roberts.
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
BioLINK Talks BioLINK,Detroit, June 24 (Edinburgh July 11) Linking Literature, Information and Knowledge for Biology.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
COLING 2012 Extracting and Normalizing Entity-Actions from Users’ comments Swapna Gottipati, Jing Jiang School of Information Systems, Singapore Management.
Natural language processing tools Lê Đức Trọng 1.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
An Entity-Mention Model for Coreference Resolution with Inductive Logic Programming Xiaofeng Yang 1 Jian Su 1 Jun Lang 2 Chew Lim Tan 3 Ting Liu 2 Sheng.
Tokenization & POS-Tagging
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
CPSC 503 Computational Linguistics
MedKAT Medical Knowledge Analysis Tool December 2009.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Supertagging CMSC Natural Language Processing January 31, 2006.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Language Identification and Part-of-Speech Tagging
CSCI 5832 Natural Language Processing
Automatic Extraction of Hierarchical Relations from Text
Stance Classification of Ideological Debates
Presentation transcript:

SMBM Talks SMBM, Cambridge, April (Edinburgh May 2) NLP for Biomedical Text Mining

Resources and Tools for Biomedical Text Mining Junichi Tsujii (U of Tokyo) Keywords: GENIA corpus; annotation Main point: progress in text mining depends on the integration of growing GENIA annotation (coreference, eg) with lexical resources for domain knowledge (ontologies) and software development. Take home message: see main point above

annotated corpus POS NER coreference (670 abstracts, Singapore) interaction (biological events; cooperation with CNRS) parse trees (1.5 million GENIA abstracts parsed in 10 days using a 100 PC cluster) ontology top nodes: substance; source; other software development POS tagger NER tagger parser IR system (Medusa) IE (event extraction: relation gene/disease) system

POS tagger MaxEnt model (Kazama and Tsujii 2003, 2005) Trained on WSJ (>39,000 sent.) and GENIA (18,500 sent.) WSJ GENIA WSJ+GENIA train test WSJGENIA combines a rule-based and statistical approach on BioNLP: 70.8% (?) -- our system got 70.1% NER tagger

HPSG-based parser (Enju) see Miyao et al. ACL05 available on website XML output dependency relations predicate-argument accuracy: PTB: prec=88.3% rec=87.2 GENIA: lower... gene/disease relation extraction pred/arg works better than bag of words or local context (gives best precision)

Recognising noun phrases in biomedical text: an evaluation of lab prototypes and a commercial chunker J. Wermter, J. Fluck, J.Stroetgen, S.Geissler, U. Hahn (U. Jena, Temis) Keywords: chunking, portability Main point:take several existing chunkers trained on (or developed for) newspaper text and evaluate their performance on biomedical data (beta version of GENIA syntactic annotation). Take home messages: overall performance drop (~3-6 points) for ML systems when shifting to bio domain no significant difference between statistical and rule-based systems

Three statistical chunkers: YamCha (support vector machine) Tbl (transformation-based error-driven learning) BoSS (boundaries predictor by combining observed probabilities of NP boundaries and POS patterns in trainset) One rule-based commercial system Temis 1. Uses words rather than GENIA POS tags 2. Computes morphological information (XeLDA toolkit) 3. HMM POS tagger disambiguates chain of POS tags hand-coded grammar had to be modified (on PTB) tagset had to be translated (not straightforward)

Training and Test Sets Train sections of Penn Treebank for training (over 200,000 POS-tagged tokens and IOB-chunked) Test GENIA treebank (beta version) (200 MedLine abstracts with syntactic annotation) the GENIA treebank was automatically converted into the IOB format just under 45,000 tokens ~11,000 = devtest for settting Temis’ IOB output ~34,000 = actual test set

Results and Errors GENIA CorpusPTB Corpus YamCha BoSS Tbl Temis Rec Prec F Errors Coordination bracketed elements... Temis BoSS After domain adaptations

Automatic Term List Generation for Entity Tagging Ted Sandler, Andrew Schein, and Lyle Ungar (CS, UPenn) Keywords:NER, automatic gazetteer creation Main point: term lists can be obtained automatically, and when integrated in a NER (gene)tagger (CRF) boost its performance to a level comparable with hand-modelled lists Take home messages: unsupervised gazetteer creation is feasible and useful supervised methods for obtaining terms outperform unsupervised methods

4 related methods for generating term lists; they differ wrt: (see table) word representation clustering algorithms to partition the words choice of feature weighting Overall Approach choose set of vocabulary items (nouns) to partition into classes choose set of useful syntactic relations frequent informative relatively noise-free parse corpus to extract relations and collect statistics use clustering algorithm to partition the vocabulary resulting partitions are term lists

Representation of the base vocabulary vector space where each item is represented by set of syn configurations it occurs in affinity matrix where each item is represented as its similarities to other items in the vocabulary Weighting Schemes Pearson’s chi-square test Generalized Likelihood Ratio (G-square; Dunning 1993; better with sparse data) first better at “common sense” generalisations; second better at domain-specific generalisations Clustering Algorithms kmeans clustering for words in vector space (high recall) agglomerative clustering for data in affinity matrix (high prec) Corpus 15,000 sentences from BioCreative + 1,800,547 Medline abs parsed using Minipar; vocabulary=7782 single token nouns

NER (Gene) Tagging McDonald and Pereira’s CRF tagger automatically generated 2,164 overlapping term lists incorporated as features in the model binary feature (0/1) for each term list (in=1; not=0) baseline tagger without lists tagger augmented with hand-compiled lists of genes (57,563) tagger augmented with large list of genes obtained via supervised learning (Tanabe and Wilbur Gene.Lexicon:1,145,913) TRAIN/TEST: 5-fold Xvalidation on 394,661 words of BioCreative (1/5 for training and 4/5 for testing) precrecf-score Baseline Unsupervised Supervised Manual

Protein-Protein Interaction Extraction: A Supervised Learning Approach Keywords:relation extraction Main point: a MaxEnt approach to protein-protein relation extraction that exploits simple local features performs better than co-occurrence and rule-based approaches, achieving nearly 94% recall and 88% precision on 303 MedLine abstracts. Take home message: supervised learning with shallow features work well for protein-protein interaction extraction J. Xiao, J. Su, G. Zhou, C. Tan (Inst. For Infocomm Research, Singapore)

Task: extract couple of interacting proteins no direction perfect NER (manual annotation) Procedure tokenisation and morphological analysis POS tagging NER sentence analysis (parsing) coreference resolution (including abbreviations and aliases) MaxEnt classifier

Features Words all words that appear in two protein names words in between two protein names previous/next words in a n-words window (unordered) Overlap number of protein names in between 2 protein names Keywords occurrence of word from keyword list in surroundings Chunks all heads of base phrases in between 2 protein names all heads surrounding the protein name pair all phrase types between 2 protein names Parse Tree Dependency Tree dependency between two proteins Pair of heads of protein names Pair of abbreviations of two proteins

Experiment and Results corpus: IEPA (Iowa University) 303 Medline abstracts 633 positive instances 1080 negative instances POS tagger trained on GENIA using an HMM model Collin’s parser 10-fold Xvalidation best result: rec=93.9%; prec=88%; f=90.9 GOOD Features - words (esp. surrounding) - chunks - pairs of protein heads - pairs of abbreviations - keywords (so and so) NOTSOGOOD Features - overlap - parse trees - dependency relations

Challenges of Information Mining in a Pharmaceutical Environment Philippe Sanseau (Glaxo-Smith-Kline, UK) Main point: Q:How do you see the role of NLP in your field? A:Excuse me, could someone explain what NLP is, please. Take home question: are NLP and pharmaceutical communities on the same track?