Predicting Gene Functions from Text Using a Cross-Species Approach Emilia Stoica and Marti Hearst School of Information University of California, Berkeley.

Slides:

Advertisements

Similar presentations

Microarray statistical validation and functional annotation

Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Annotation of Gene Function …and how thats useful to you.

Annotation standards in ORegAnno (Draft) Obi Griffith The RegCreative Jamboree Nov 29, 2006 Ghent, Belgium.

Journal Club Jenny Gu October 24, Introduction Defining the subset of Superfamilies in LUCA Examine adaptability and expansion of particular superfamilies.

Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,

Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.

Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.

WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.

Improving miRNA Target Genes Prediction Rikky Wenang Purbojati.

LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.

Gene Ontology John Pinney

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.

UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.

Presenting: Asher Malka Supervisor: Prof. Hermona Soreq.

09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.

Predicting Gene Functions from Text Using a Cross- Species Approach Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Internet tools for genomic analysis: part 2

An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,

Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.

Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007

MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)

COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.

Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:

1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:

Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.

Cell Signaling Ontology Takako Takai-Igarashi and Toshihisa Takagi Human Genome Center, Institute of Medical Science, University of Tokyo.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Sunday, July 22, 2012 Plan Areas of coverage: high-level neurological system process, inc. sensory perception, sensory processing, cognition transmission.

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.

Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.

Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.

Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.

Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.

Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,

Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.

Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.

Language Identification and Part-of-Speech Tagging

Concept Grounding to Multiple Knowledge Bases via Indirect Supervision

Experimental Psychology

Text Based Information Retrieval

GO : the Gene Ontology & Functional enrichment analysis

Mental Functioning and the Gene Ontology

Genome Annotation Continued

Predicting Gene Functions from Text Using a Cross-Species Approach

Presentation transcript:

Predicting Gene Functions from Text Using a Cross-Species Approach Emilia Stoica and Marti Hearst School of Information University of California, Berkeley Research Supported by NSF DBI and a gift from Genentech

Goal Annotate genes with functional information derived from journal articles.

Gene Ontology (GO) Gene Ontology (GO) controlled vocabulary for functional annotation ~ 17,600 terms (circa July 2004) Organized into 3 distinct acyclic graphs molecular functions biological processes cellular locations More general terms are “parents” of less general terms: development (GO: ) is the parent of embryonic development (GO: )

Challenges GO tokens might not appear explicitly Example: PubMed GO: : negative regulation of cell proliferation Occurs as: inhibition of cell proliferation GO tokens might not occur contiguously Example: PubMed , GO: : G-protein coupled receptor protein signaling pathway Occurs as: Results indicate that CCR1-mediated responses are regulated …in the signaling pathway, by receptor phosphorylation at the level of receptor G/protein coupling … CCR1 binds MIP-1 alpha.

Challenges The simplest strategy (assigning GO codes to genes simply because the GO tokens occur near the gene) yields a large number of false positives. Issues: a)The text does not contain evidence to support the annotation, b)The text contains evidence for the annotation, but the curator knows the gene to be involved in a function that is more general or more specific than the GO code matched in text.

Challenges GO contains hints about what kinds of evidence are required for annotation, e.g.: The text should mention co-purification, co- immunoprecipitation experiments Requiring these evidence terms does not seem to improve algorithms.

Related Work Mainly in the context of BioCreative competition (2004) Chiang and Yu 2003, 2004: Find phrase patterns commonly used in sentences describing gene functions (e.g., “gene plays an important role in”, “gene is involved in”) Final assignments made with a Naïve Bayes classifier Ray and Craven 2004, 2005: Learn a statistical model for each GO code (which words are likely to co-occur in the paragraphs containing GO codes); Decide among candidates via a multinomial Naïve Bayes classifier Rice et al. 2004: Train an SVM for each GO code. Target genes assigned best-scoring GO code.

Related Work, cont. Couto et al Determine if the “information content” of the matching GO terms is larger than for all the candidate GO terms. Verspoor et al Expand GO tokens with words that frequently co-occur in a training set; use a categorizer that explores the structure of the Gene Ontology to find best hits. Ehler and Ruch 2004: Treat each document as a query to be categorized Create a score based on a combination of pattern matching and TF*IDF weighting Annotate gene with top-scoring GO codes.

Our Approach Two main contributions: Use cross-species information (CSM) Check for biological (in) consistencies (CSC)

Cross-Species Match Main Idea Use orthologous genes [Genes of different species that have evolved directly from a common ancestor.] Assumption: Since there is an overlap between the genomes of the two species, their orthologs may share some functions, and consequently some GO codes Idea: to predict GO codes for target genes in target species, use the GO codes assigned to their orthologous genes We use Mouse vs. Human genes

General procedure Analyze text at sentence level Eliminate stop words, punctuation characters and divide the text into tokens using space as delimiter Normalize and match different variations of gene names using the algorithm of Bhalotia et al.’03 For every sentence that contains the target gene: A GO code is matched if the sentence contains a percentage of GO tokens larger than a threshold (0.75 for CSM and 1 for CSC)

Cross Species Match Algorithm CSM(g, a): For a target gene g, search in article a for only the GO codes annotated to its ortholog If at least 75% of the GO code terms are found in a sentence containing the gene name, the code is matched. Note: we must eliminate annotations of orthologs marked with IEA and ISS codes to avoid circular references.

Cross-Species Correlation Main Idea Observation: Since GO codes indicate gene function, it is logical for some to often co-occur in annotations and for others to rarely do so. Assumption: If one GO code tends to occur in the orthologous genes’ annotations when another one does not, then assume the second is not a valid assignment for the target species Example: If text seems to contain evidence for rRNA transcription ( GO: ) nucleolus (GO: ) and extracellular (GO: ), then extracellular is suspicious. The algorithm identifies the “suspicious” cases.

Cross-Species Correlation Algorithm For every pair of GO codes in the orthologous genes database, compute a X 2 coefficient. N: the total number of GO codes O 11 : # of times the ortholog is annotated with both GO 1 and GO 2 O 12 : # of times the ortholog is annotated with GO 1 but not GO 2 O 21 : # of times the ortholog is annotated with GO 2 but not GO 1 O 12 : # of times the ortholog is not annotated with GO 1 or GO 2 X2X2

Cross-Species Correlation Algorithm M(g,a) = GO codes matched in article a for gene g O(g) = GO codes assigned to the ortholog of g o = size of O(g), p = percentage (0.2) For every potentially matching GO code GO 1 in M(g,a) For every GO code GO 2 in O(g) Count how often X 2 ( GO 1,GO 2 ) is significant If this count is < p*o then assume GO 1 is not valid. Else assign GO 1 to g

Information Flow

Evaluation using BioCreative Task 2.2: Annotate 138 human genes with GO codes using 99 full text articles; For each annotation, provide the passage of text that the annotation was based upon. Annotations from participants were manually judged by human curators A prediction was considered “perfect” if the text passage contained the gene name, and provided evidence for annotating the gene with the GO code

Results on BioCreative Our research was conducted after the competition had past, so our annotations could not be judged by the same curators Used the “perfect predictions” (unfair to our system; ignores relevant predictions we find that other systems do not) Our prediction is correct if it matches a perfect prediction (e.g., vhl is annotated with transcription (GO: ) in PubMed “ vhl inhibits transcription elongation, mRNA stability and PKC activity ”)

BioCreative Results SystemPrecisionTP (Recall)F-measure CSM (0.07)0.11 CSC (0.19)0.18 CSM+CSC (0.21)0.23 Ray and Craven (0.22)0.22 Chiang and Yu (0.16)0.21 Ehler and Ruch (0.33)0.18 Couto et al (0.25)0.13 Verspoor et al (0.08)0.07 Rice et al (0.07)0.05

Results on Larger Dataset A much larger test set has been made publicly available by Chiang and Yu. EBI human test set 4,410 genes 13,626 GO code annotations MGI mouse test set 2,188 genes 6,338 GO code annotations Note that Chiang and Yu used the same data for both training and testing.

Results on EBI Human and MGI datasets EBI human: 4,410 genes and 5,714 abstracts MGI: 2,188 genes and 1,947 abstracts DatasetSystemPrecisionRecallF-measure EBICSM CSM+CSC Chiang and Yu MGICSM CSC+CSC Chiang and Yu

Conclusions and Future Work We propose an algorithm that annotates genes with GO codes using the information available from other species Experimental results on three datasets show that our algorithm consistently achieves higher F-measures than other solutions Future improvements to our algorithm: - combine or use a voting scheme between the predictions our system makes and the predictions of a machine learning system - investigate how effective are other genes with sequences similar to the target gene (but not orthologous to the gene) for predicting the GO codes

Thank you! Research Supported by NSF DBI and a gift from Genentech

Example The marked accumulation of lipid droplets in LNCaP cells...is accompanied by an increase in phospholipid synthesis. The increase in PAP-2 might be related to changes in lipid metabolism… Since PAP-2 plays a pivotal role in the control of signal transduction by lipid mediator mediators, the ability of androgens to stimulate this enzyme in prostatic cells may provide opportunity for cross-talk between signaling pathways involving lipid mediators and androgens.

CSC Algorithm M(g,a) = GO codes matched in article a for gene g O(g) = GO codes annotated to the ortholog of g o = size of O(g), p = percentage (0.2) CSC(g,a) ={}; for every GO 1 in M(g,a) count = 0; for every GO 2 in O(g) if((X 2 ( GO 1,GO 2 )>3.84) && ( GO 1 ne GO 2 )) count++; if(count > p*o) add GO 1 to CSC(g,a);