Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.

Slides:

Advertisements

Similar presentations

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.

Advertisements

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Distant Supervision for Emotion Classification in Twitter posts 1/17.

Mining External Resources for Biomedical IE Why, How, What Malvina Nissim

BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.

TEMPLATE DESIGN © Identifying Noun Product Features that Imply Opinions Lei Zhang Bing Liu Department of Computer Science,

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.

In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.

LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.

Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.

Document Centered Approach to Text Normalization Andrei Mikheev LTG University of Edinburgh SIGIR 2000.

Introduction to Machine Learning Approach Lecture 5.

Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.

Albert Gatt Corpora and Statistical Methods Lecture 9.

Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.

Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.

Bayesian Networks. Male brain wiring Female brain wiring.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Survey of Semantic Annotation Platforms

Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.

Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

A Language Independent Method for Question Classification COLING 2004.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor ： Dr.

COLING 2012 Extracting and Normalizing Entity-Actions from Users’ comments Swapna Gottipati, Jing Jiang School of Information Systems, Singapore Management.

13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

An Entity-Mention Model for Coreference Resolution with Inductive Logic Programming Xiaofeng Yang 1 Jian Su 1 Jun Lang 2 Chew Lim Tan 3 Ting Liu 2 Sheng.

Tokenization & POS-Tagging

Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.

Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.

Hedge Detection with Latent Features SU Qi CLSW2013, Zhengzhou, Henan May 12, 2013.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.

Language Identification and Part-of-Speech Tagging

CRF &SVM in Medication Extraction

Category-Based Pseudowords

Presentation transcript:

Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute for Infocomm Research, Singapore 2 School of Computing, National Univ. of Singapore (Bioinformatics, Vol.20, No.7, 2004, p )

2/29 Abstract Present a named entity recognition system in the biomedical domain: PowerBioNE. Evidential features (1) word formation pattern; (2) morphological pattern, such as prefix and suffix; (3) POS; (4) head noun trigger; (5) special verb trigger and (6) name alias feature Hidden Markov model (HMM) Use k-Nearest Neighbor algorithm to resolve the data sparseness problem Use pattern-based post-processing to deal with the cascaded entity name phenomenon.

3/29 Special Naming Conventions (1/2) Descriptive naming convention normal thymic epithelial cells->difficulty for identifying the left boundaries 18.6% consist of at least four words in GENIA 3.0 Conjunction and disjunction 91 and 84 kDa proteins 2.06% have such construction in GENIA 3.0 Non-standardized naming convention N-acetylcysteine, N-acetyl-cysteine, NAcetylCysteine

4/29 Special Naming Conventions (2/2) Abbreviation Frequently used in the biomedical domain. Ambiguous: 81.2% are ambiguous and have an average of 16.6 senses in MEDLINE abstracts Cascaded construction kappa 3 binding factor 16.7% have such construction in GENIA 3.0

5/29 GENIA Corpus GENIA V MEDLINE abstracts of 123K words. Use it for comparison of their work with others. GENIA V2.1 Incorporate POS to GENIA V1.1. Used to train POS tagger and evaluate the usefulness of POS. GENIA V3.0 2,000 MEDLINE abstracts of 360K words. Used it to do the great scope of experiments. GENIA ontology includes 23 distinct classes.

6/29 Features Word formation pattern (F WFP ) Morphological pattern (F MP ) Part-of-speech (F POS ) Semantic triggers Name alias feature (F ALIAS )

7/29 Word Formation Pattern It is useful to distinguish between biomedical entity names and others.

8/29 Morphological Pattern (1/2)

9/29 Morphological Pattern (2/2) They count the frequency of each prefix/suffix in each entity class and group prefixes/suffixes with the similar distribution among the entity classes into one category. Average 37 prefixes/suffixes are selected from the training data and further grouped into 23 categories.

10/29 Part-of-Speech POS may provide useful evidence about the boundaries of biomedical entity names. Authors adapt an HMM-based POS tagger to GENIA V2.1 by training on PENN TreeBank (2,500 WSJ articles, 1M words) and 590 GENIA abstracts.

11/29 Semantic Triggers (1/2) Head noun trigger (F HEAD ) The major noun of a noun phrase, often describes the function or the property of the NP. E.g. activated human B cells Extract unigram and bigram head nouns, rank by frequency. Select 60% as head noun trigger.

12/29 Semantic Triggers (2/2) Special verb trigger (F VERB ) They may provide the evidence on the boundaries and the classes of biomedical entity names.

13/29 Name Alias Feature Inter-sentential name alias phenomenon TCF: proposed as an entity name candidate. The name alias algorithm is invoked. If ‘T cell Factor’ is a ‘Protein’ name recognized earlier in the document, ‘TCF’ is determined as an alias of ‘T cell Factor’ with the name alias feature Protein3L3. Inner-sentential abbreviation When an abbreviation with parentheses is detected, remove the abbreviation and the parentheses. Applying the HMM-based named entity recognizer to the sentence, restore the abbreviation with parentheses to its original position. The abbreviation is classified as the same class of the expanded form. The expanded form and its abbreviation are stored in the recognized list of biomedical entity names.

14/29 HMM-based Biomedical Named Entity Recognition (1/2) Given an output sequence, the purpose of an HMM is to find the most likely tag (state) sequence that maximizes. Here, o i =, where w i is the word and is the feature set of the word w i, and s i = BOUNDARY i _ENTITY i _FEATURE i, where BOUNDARY i denotes the position of the current word in the entity; ENTITY i indicates the class of the entity; and FEATURE i is the feature set.

15/29 HMM-based Biomedical Named Entity Recognition (2/2) Assume MI independence =>

16/29 k-NN Algorithm for Computing (1/2) Assume, where the pattern entry E i =o i-N …o i …o i+N. The k-NN algorithm estimates P(·|E i ) by first finding the K-nearest neighbors of frequently occurring pattern entries to the initial pattern entry E i and then aggregating them to make a proper estimation of P(·|E i ).

17/29 k-NN Algorithm for Computing (2/2) Conditional state probability distribution likelihood(E, E i ), the likelihood of a pattern entry E, is one of the K nearest neighbors to the initial pattern entry E i.

18/29 Post-processing: Cascaded Entity Name Resolution (1/2) Six patterns are extracted from GENIA := + head noun, e.g. binding motif  := +, e.g.  := modifier +, e.g. anti  := + word +, e.g. infected 

19/29 Post-processing: Cascaded Entity Name Resolution (2/2) := modifier + + head noun := + + head noun In the experiments, all the rules of above six patterns are extracted from the cascaded entity names in the training data to deal with the cascaded entity name phenomenon.

20/29 Experiments (1/5) Evaluate the PowerBioNE on GENIA V3.0 and V1.1. For GENIA V1.1, they select 80 abstracts for testing and the remaining 590 abstracts as the training data. For GENIA V3.0, they select 200 abstracts as the test data and the remaining 1800 abstracts as the training data. All the experimentations are done 10 times and the evaluations are averaged over the test data. Average 63 rules are extracted from the cascaded entity names from GENIA V1.1 while average 102 rules are extracted from the cascaded entity names in the training data of GENIA V3.0.

21/29 Experiments (2/5)

22/29 Experiments (2/4)

23/29 Experiments (3/5)

24/29 Analysis for Table 8 The contribution of the word formation pattern feature in the biomedical domain is very limited compared with that in the newswire domain. The morphological pattern feature is useful. POS after adaptation is proven to be very useful in the biomedical domain. The head noun trigger feature is very useful. The use of the special verb trigger feature decreases the recall rate while keeping the precision. The name alias feature only slightly improves the F-measure by 0.6. This may be due to the complexity of the name alias phenomenon and the simple strategy applied in the system. The pattern-based post-processing for cascaded entity name resolution is proven to be very useful.

25/29 Experiments (4/5) More verb triggers only decrease the performance more.

26/29 Experiments (5/5)

27/29 Analysis for Table 10 It suggests that stable and significant performance improvement can only be achieved for inclusion of POS with enough accuracy. The performance of HMMs is the highest HMMs have the better ability of capturing the locality of various biomedical entity names. The feature vector-based classifiers, such as SVM, C4.5, C4.5 rules and RIPPER, cannot effectively capture the local context dependence by assuming the independence between the features while the baseline naïve Bayes classifier fails to capture local context dependence by assuming the conditional probability independence among the local context.

28/29 Error Analysis Randomly choose 100 errors from results. Errors that are due to the strict annotation scheme and the annotation inconsistence in the GENIA corpus, can be considered acceptable. (total/acceptable) Left boundary errors (15/12) Cascaded entity name errors (17/13) Misclassification errors (16/3) True negative (29/12) False positive (18/10) Miscellaneous (11/1)

29/29 Conclusion Propose and integrate various evidential features, including word formation pattern, morphological pattern, POS, head noun trigger, special verb trigger and name alias feature. K-NN algorithm effective resolves the data sparseness problem. The pattern-based post-processing deals with the cascaded entity name phenomenon. It is the first system which deals with the cascaded entity name phenomenon.