1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.

1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford Medical Center, 2 Enkata Technologiesl, USA (Bioinformatics, Vol.20(2), pp.216-225, 2004)

2 Abstract GAPSCORE is to identify gene and protein name in text. GAPSCORE scores words based on a statistical model of gene names that quantifies their appearance, morphology and context. Evaluate the Yapex data set and achieve an F-score 82.5% (83.3% recall, 81.5% precision) for partial matches and 57.6% (58.5 recall, 56.7% precision) for exact matches.

3 1. Introduction (1/4) Gene and protein name identification algorithms use combinations of approaches including –Dictionary: search from a list of known genes. –Appearance: deduce word type based on its makeup of characters –Syntax: filter words based on POS –Context: use nearby words to infer gene and protein names –Abbreviation: use abbreviation in text to help identify names

4 Dictionary –Easy to understand and implement. –Maintain dictionaries is difficult. Appearance –Many gene names ‘look like’ other gene names. –Some scientific naming conventions, such as those for cell lines or viruses, are similar to those of genes. 1. Introduction (2/4)

5 Context –A NP with a gene name often contains related words, such as those that describe molecular function or interactions. –Such heuristics miss the many occurrences of gene names without context clues. Morphology –The cdk4 and cdk7 genes share the stem ‘cdk’. –Morphology is analogous to appearance. 1. Introduction (3/4)

6 GAPSCORE combines syntax, appearance, context and morphology using supervised machine learning methods: Naïve Bayes (NB), Maximum Entropy (ME) and Support Vector Machines (SVMs). Accessible from the web at http://bionlp.stanford.edu/gapscore/ 1. Introduction (4/4)

7 2. Methods GAPSCORE does not distinguish between genes and proteins. The algorithm consists of five steps –Tokenize: Split the document into sentences and words. –Filter: Remove from consideration any word that is clearly not a gene name. –Score: Using a machine learning classifier. –Extend: Extend each word to the full gene name. –Match abbreviation: Score abbreviations of the gene names identified.

9 2.1 Tokenize Sentence boundaries: period, question mark or exclamation point followed by a space and then a capitalized letter is a sentence boundary. (Periods in ‘e.g.’ are exceptions) Any space and most punctuation are word boundaries. They handle dashes separately since many gene names contain them (e.g. c-jun, IL-2). Dashes are not boundaries when the previous token is a single letter, or the next token is a number or roman numeral.

10 2.2 Filter Apply Brill’s tagger and remove words that are not nouns, adjectives, participles, proper nouns or foreign words. Discard numbers, roman numerals, greek letters, amino acids, seven virus names and 13 chemical compounds. Discard names of organisms found in the SWISS-PROT database. Discard words from a manually created list of 49 regular expression patterns: e.g. protein, DNA, peptide, ATP, receptor.

11 2.3 Score They score separately two classes of proteins that are common and easy to recognize unambiguously: enzyme names and cytochrome P450 proteins. –-ase: 327 words that end with ‘-ase’ or ‘-ases’ from Webster’s Second international dictionary. Then manually remove gene names form the list and add ‘gases’: 196 words that are not gene names. –They use 4 regular expression patterns to recognize names with the form: ‘cytochrome P450 2D6’, ‘p450 IID6’, ‘CYP2d6’, ‘CYPs’. Most words do not match these two special cases. They encode their appearance, morphology, and context as a feature vector for a machine learning classifier.

12 2.3.1 Appearance (1/2) These features encode a 13 dimension vector that describes the appearance of a word. For a specific word, the valued for each feature is 1 if it describes the word and 0 otherwise.

13 2.3.1 Appearance (2/2) Recognize gene names that end with ‘-in’ –Use a generic statistical model that learns variable length N-grams to classify phrases. To train the N-gram model: –Create a training set of words end with ‘-in’ from Medical Subject Headings (MeSH): 708 unique words. –A word was a protein if it belonged to one of 15 MeSH classes. For words that do not end with ‘-in’, assign score 0. Otherwise, use the score from the classifier; those scores constitute the final dimension of the appearance feature vector.

14 2.3.2 Morphology (1/2) This table shows variations of gene and protein names that they score in a feature vector. Each variant is either a prefix or suffix of the word stem.

15 2.3.2 Morphology (2/2) The value of each morphology feature is log max (1/1000, #Vars/#Stems) Where #Stems is the number of times a stem appears by itself in MEDLINE, and #Vars is the total number of times the stem appears with a variation. Empirically, the ratio of these counts, when plotted for all words in MEDLINE, follows an exponential distribution. Therefore, to improve discrimination in machines learning, they take the log of that ratio.

16 2.3.3 Context (1/3) Gene names should appear most often next to positive signals and least next to negative ones. To find the signal words, they created a training set of 1,025 words, which include 574 gene names. –They randomly chose 500 nouns that appeared in year 2001 MEDLINE abstracts containing the word ‘gene’ or ‘protein’. –To increase the prevalence of gene names, they added 525 more words that appeared before ‘gene’, ‘protein’ or ‘mrna’.

17 2.3.3 Context (2/3) A 2x2 contigency table (A) contains # of genes from the training set found before ‘expression’ anywhere in MEDLINE, (B) is # of genes never found before ‘expression’, (C) is # of non-genes found before ‘expression’ and (D) is # of non-genes never found before ‘expression’. If ‘expression’ is a strong signal that the previous word is a gene name, then the ratio of genes to non- genes would be higher in the 1 st column than the 2 nd.

18 2.3.3 Context (3/3) Calculate the significance of the difference in the ratio using a  2 test. =>2,567 words. Select the most common signal words: 9 positive and 9 negative signal words. Each feature is the number of times that a word occurs with each signal word across all of MEDLINE. They calculated the distribution across signal words by normalizing the feature vector to 1.0.

20 2.3.4 Classifier From 634 MEDLINE abstracts cited by a review article on pharmachognomics, they manually categorized each word as either a gene named or non-gene. For a multiple word gene name: core gene- meaning words. Include 8,617 words from MeSH. 19,952 unique labeled words for training. Three types of classifiers: NB, ME and SVM.

21 2.4 Extend to NP Identify multi-word gene names similar to heuristics used in Fukuda et al. (1998) –Include the nouns, adjectives and participles preceding the putative gene name. –Lengthen the name to include the following words that are single letters, greek letters and roman numerals. –Remove extraneous punctuation at the beginning or end of the name, except for open or close parenthesis characters required to complete a pair.

22 2.5 Match Abbreviations Search for abbreviations in the document using the algorithm described in Chang et al. (2002). If the long form of an abbreviation has a higher score, it transfers that score to the abbreviation.

23 2.6 Evaluation Yapex test gold standard. –Exact match: equivalent to the corresponding name in the gold standard. –Sloppy match: overlap the name in the gold standard.

24 3. Results (1/4) Using optimal parameters, compare the performance of, ME and SVM on the Yapex training, scoring with sloppy matches.

25 3. Results (2/4) With the best classifier parameters, test the algorithm with various modules disabled.

26 3. Results (3/4) Yapex: 75.4% F-score (70.3% recall, 81.4% precision) GAPSCORE: 82.5% F-score (83.3% recall, 81.8% precision)

27 3. Results (4/4) Yapex: 54.3% F-score (50.1% recall, 59.3% precision) GAPSCORE: 57.6% F-score (58.5% recall, 56.7% precision)

28 6. Conclusion GAPSCORE finds gene and protein names by combining novel formulations of features in a machine learning framework. SVMs slightly outperform other popular methods. When applied to the Yapex text collection, the method achieves high performance due to its sophisticated analysis of single words and the high prevalence of single word gene names. The algorithm produces confidence scores that can be adjusted for either high recall or precision.

1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.

Similar presentations

Presentation on theme: "1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.

Similar presentations

Presentation on theme: "1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford."— Presentation transcript:

Similar presentations

About project

Feedback