1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.

Slides:

Advertisements

Similar presentations

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Advertisements

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.

UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.

Similar Sequence Similar Function Charles Yan Spring 2006.

A Memory-Based Approach to Semantic Role Labeling Beata Kouchnir Tübingen University 05/07/04.

Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun

Mining the Medical Literature Chirag Bhatt October 14 th, 2004.

Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.

Overview of Search Engines

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.

Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.

Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.

Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

A Language Independent Method for Question Classification COLING 2004.

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.

Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.

University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.

Seeking Abbreviations From MEDLINE Jeffrey T. Chang Hinrich Schütze Russ B. Altman Presented by: Bo Han.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Tokenization & POS-Tagging

Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.

Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.

Results for all features Results for the reduced set of features

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Erasmus University Rotterdam

Natural Language Processing of Knee MRI Reports

Evaluating classifiers for disease gene discovery

SVMs for Document Ranking

Evaluating Classifiers for Disease Gene Discovery

Austin Karingada, Jacob Handy, Adviser : Dr

Presentation transcript:

1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford Medical Center, 2 Enkata Technologiesl, USA (Bioinformatics, Vol.20(2), pp , 2004)

2 Abstract GAPSCORE is to identify gene and protein name in text. GAPSCORE scores words based on a statistical model of gene names that quantifies their appearance, morphology and context. Evaluate the Yapex data set and achieve an F-score 82.5% (83.3% recall, 81.5% precision) for partial matches and 57.6% (58.5 recall, 56.7% precision) for exact matches.

3 1. Introduction (1/4) Gene and protein name identification algorithms use combinations of approaches including –Dictionary: search from a list of known genes. –Appearance: deduce word type based on its makeup of characters –Syntax: filter words based on POS –Context: use nearby words to infer gene and protein names –Abbreviation: use abbreviation in text to help identify names

4 Dictionary –Easy to understand and implement. –Maintain dictionaries is difficult. Appearance –Many gene names ‘look like’ other gene names. –Some scientific naming conventions, such as those for cell lines or viruses, are similar to those of genes. 1. Introduction (2/4)

5 Context –A NP with a gene name often contains related words, such as those that describe molecular function or interactions. –Such heuristics miss the many occurrences of gene names without context clues. Morphology –The cdk4 and cdk7 genes share the stem ‘cdk’. –Morphology is analogous to appearance. 1. Introduction (3/4)

6 GAPSCORE combines syntax, appearance, context and morphology using supervised machine learning methods: Naïve Bayes (NB), Maximum Entropy (ME) and Support Vector Machines (SVMs). Accessible from the web at 1. Introduction (4/4)

7 2. Methods GAPSCORE does not distinguish between genes and proteins. The algorithm consists of five steps –Tokenize: Split the document into sentences and words. –Filter: Remove from consideration any word that is clearly not a gene name. –Score: Using a machine learning classifier. –Extend: Extend each word to the full gene name. –Match abbreviation: Score abbreviations of the gene names identified.

8

9 2.1 Tokenize Sentence boundaries: period, question mark or exclamation point followed by a space and then a capitalized letter is a sentence boundary. (Periods in ‘e.g.’ are exceptions) Any space and most punctuation are word boundaries. They handle dashes separately since many gene names contain them (e.g. c-jun, IL-2). Dashes are not boundaries when the previous token is a single letter, or the next token is a number or roman numeral.

Filter Apply Brill’s tagger and remove words that are not nouns, adjectives, participles, proper nouns or foreign words. Discard numbers, roman numerals, greek letters, amino acids, seven virus names and 13 chemical compounds. Discard names of organisms found in the SWISS-PROT database. Discard words from a manually created list of 49 regular expression patterns: e.g. protein, DNA, peptide, ATP, receptor.

Score They score separately two classes of proteins that are common and easy to recognize unambiguously: enzyme names and cytochrome P450 proteins. –-ase: 327 words that end with ‘-ase’ or ‘-ases’ from Webster’s Second international dictionary. Then manually remove gene names form the list and add ‘gases’: 196 words that are not gene names. –They use 4 regular expression patterns to recognize names with the form: ‘cytochrome P450 2D6’, ‘p450 IID6’, ‘CYP2d6’, ‘CYPs’. Most words do not match these two special cases. They encode their appearance, morphology, and context as a feature vector for a machine learning classifier.

Appearance (1/2) These features encode a 13 dimension vector that describes the appearance of a word. For a specific word, the valued for each feature is 1 if it describes the word and 0 otherwise.

Appearance (2/2) Recognize gene names that end with ‘-in’ –Use a generic statistical model that learns variable length N-grams to classify phrases. To train the N-gram model: –Create a training set of words end with ‘-in’ from Medical Subject Headings (MeSH): 708 unique words. –A word was a protein if it belonged to one of 15 MeSH classes. For words that do not end with ‘-in’, assign score 0. Otherwise, use the score from the classifier; those scores constitute the final dimension of the appearance feature vector.

Morphology (1/2) This table shows variations of gene and protein names that they score in a feature vector. Each variant is either a prefix or suffix of the word stem.

Morphology (2/2) The value of each morphology feature is log max (1/1000, #Vars/#Stems) Where #Stems is the number of times a stem appears by itself in MEDLINE, and #Vars is the total number of times the stem appears with a variation. Empirically, the ratio of these counts, when plotted for all words in MEDLINE, follows an exponential distribution. Therefore, to improve discrimination in machines learning, they take the log of that ratio.

Context (1/3) Gene names should appear most often next to positive signals and least next to negative ones. To find the signal words, they created a training set of 1,025 words, which include 574 gene names. –They randomly chose 500 nouns that appeared in year 2001 MEDLINE abstracts containing the word ‘gene’ or ‘protein’. –To increase the prevalence of gene names, they added 525 more words that appeared before ‘gene’, ‘protein’ or ‘mrna’.

Context (2/3) A 2x2 contigency table (A) contains # of genes from the training set found before ‘expression’ anywhere in MEDLINE, (B) is # of genes never found before ‘expression’, (C) is # of non-genes found before ‘expression’ and (D) is # of non-genes never found before ‘expression’. If ‘expression’ is a strong signal that the previous word is a gene name, then the ratio of genes to non- genes would be higher in the 1 st column than the 2 nd.

Context (3/3) Calculate the significance of the difference in the ratio using a  2 test. =>2,567 words. Select the most common signal words: 9 positive and 9 negative signal words. Each feature is the number of times that a word occurs with each signal word across all of MEDLINE. They calculated the distribution across signal words by normalizing the feature vector to 1.0.

19

Classifier From 634 MEDLINE abstracts cited by a review article on pharmachognomics, they manually categorized each word as either a gene named or non-gene. For a multiple word gene name: core gene- meaning words. Include 8,617 words from MeSH. 19,952 unique labeled words for training. Three types of classifiers: NB, ME and SVM.

Extend to NP Identify multi-word gene names similar to heuristics used in Fukuda et al. (1998) –Include the nouns, adjectives and participles preceding the putative gene name. –Lengthen the name to include the following words that are single letters, greek letters and roman numerals. –Remove extraneous punctuation at the beginning or end of the name, except for open or close parenthesis characters required to complete a pair.

Match Abbreviations Search for abbreviations in the document using the algorithm described in Chang et al. (2002). If the long form of an abbreviation has a higher score, it transfers that score to the abbreviation.

Evaluation Yapex test gold standard. –Exact match: equivalent to the corresponding name in the gold standard. –Sloppy match: overlap the name in the gold standard.

24 3. Results (1/4) Using optimal parameters, compare the performance of, ME and SVM on the Yapex training, scoring with sloppy matches.

25 3. Results (2/4) With the best classifier parameters, test the algorithm with various modules disabled.

26 3. Results (3/4) Yapex: 75.4% F-score (70.3% recall, 81.4% precision) GAPSCORE: 82.5% F-score (83.3% recall, 81.8% precision)

27 3. Results (4/4) Yapex: 54.3% F-score (50.1% recall, 59.3% precision) GAPSCORE: 57.6% F-score (58.5% recall, 56.7% precision)

28 6. Conclusion GAPSCORE finds gene and protein names by combining novel formulations of features in a machine learning framework. SVMs slightly outperform other popular methods. When applied to the Yapex text collection, the method achieves high performance due to its sophisticated analysis of single words and the high prevalence of single word gene names. The algorithm produces confidence scores that can be adjusted for either high recall or precision.