Answering Gene Ontology terms to proteomics questions by supervised macro reading in MEDLINE Julien Gobeill 1, Emilie Pasche 2, Douglas Teodoro 2, Anne-Lise.

Slides:



Advertisements
Similar presentations
FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
Advertisements

Microarray Data Analysis Day 2
Towards a zoomable cell abstract cell natural coordinate system Data > D Protein Structures from PDB ? A IHGFBCDE > Images from scientific.
Provenance in a Collaborative Bio-database RAASWiki Donald Dunbar & Jon Manning Queen’s Medical Research Institute University of Edinburgh Use Cases for.
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.
Chapter 5 Ligand gated ion channels, intracellular receptors and phosphorylation cascades.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Gene Ontology Luis Tari. Gene Ontology (GO) URL: Gene Ontology is A hierarchy of roles of genes.
COG and GO tutorial.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
UniProt - The Universal Protein Resource
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Analysis Environments For Scientific Communities From Bases to Spaces Bruce R. Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
I529: Lab5 02/20/2009 AI : Kwangmin Choi. Today’s topics Gene Ontology prediction/mapping – AmiGo –
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Cell Signaling Ontology Takako Takai-Igarashi and Toshihisa Takagi Human Genome Center, Institute of Medical Science, University of Tokyo.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Ontology based analyses methods ++ develop a grammar for making productions using mf, bp, cl: –derive a higher level grammar for next level of productions.
Protein and RNA Families
Expanding GO annotations with text classification Nicko Goncharoff Reel Two, Inc.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.
A collaborative tool for sequence annotation. Contact:
Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee.
Ferran Sanz – GRIB (IMIM-UPF) Bioinformatics: How it can support the Family of International Classifications? Ferran Sanz Research Programme on Biomedical.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
An example of GO annotation from a primary paper Rebecca E. Foulger (UniProt Curator) GO Annotation Camp, June 2005 PMID:
University of Illinois at Urbana-Champaign. BeeSpace Project 5-year NSF-funded project Project Goals  Develop open bioinformatics resources  Support.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
An example of GO annotation from a primary paper GO Annotation Camp, July 2006 PMID:
Protein. Protein and Roles 1: biological process unknown 1.1 Structural categories 1.2 organism categories 1.3 cellular component o unlocalized.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
11 Thoughts on STS regarding Machine Reading Ralph Weischedel 12 March 2012.
Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.
Universal descriptions of gene and gene product attributes
TDM in the Life Sciences Application to Drug Repositioning *
BME435 BIOINFORMATICS.
Protein databases Henrik Nielsen
Triage by Ranking to Support the Curation of Protein Interactions
What is Pattern Recognition?
UniProt: the Universal Protein Resource
PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.
Volume 125, Issue 1, Pages (April 2006)
Literature retrieval for personalized cancer treatment
Presentation transcript:

Answering Gene Ontology terms to proteomics questions by supervised macro reading in MEDLINE Julien Gobeill 1, Emilie Pasche 2, Douglas Teodoro 2, Anne-Lise Veuthey 3, Patrick Ruch 1 1 University of Applied Sciences, Information Sciences, Geneva 2 Hospitals and University of Geneva, Geneva 3 Swiss-Prot group, Swiss Institute of Bioinformatics, Geneva

2 Data deluge… “ What is the subcellular location of protein MEN1 ? ” “ What molecular functions are affected by Ryanodine ? ”

3 Ontology-based search engines

Question Answering (EAGLi system) Redundancy hypothesis: The number of associated/co-occurring answers dominate other dimensions

Comparison based in two categorizers : – Thesaurus-Based (EAGL) Competitive with MetaMap (Trieschnigg et al., 2009) Compute lex. similarity between text and GO terms – Machine Learning (GOCat) k-NN Similarity between inpur text and already curated abstracts KB derived from GOA : ~90’000 instances Best way for extracting GO terms from a set of abstracts ? (1/3)

Two tasks : – Classical categorization (micro reading ~ biocuration) – Redundancy-based QA (macro reading) Best way for extracting GO terms from a set of abstracts ? (2/3) one abstract/paper GO terms a set of n (=100) abstracts GO terms Σ

One benchmark for micro reading evaluation – 1’000 abstracts and GO descriptors from GOA Two benchmarks for macro reading evaluation – 50 questions derived from a set of biological databases: What molecular functions are affected by [chemical] ? What cellular component is the location of [protein] ? Best way for extracting GO terms from a set of abstracts ? (3/3)

Results micro reading task macro reading task Benchmark1’000 abstractsCTDUniProt MetricsP0R10P0R100P0R10 EAGL (Thesaurus Based) GOCat (k-NN).43 (+86%).47 (+193%).69 (+102%).33 (+120%).58 (+75%).73 (+62%) + 75/120% for k-NN (sup. learning)  Redundancy hypothesis insufficient Why/Where is the power ? Size does or does not matter ?

Deluge is self-compensated GOCat EAGL

Deluge is self-compensated GOCat EAGL

Magic ! The automatic categorization based on a PMID 2007 performed in 2011 is of higher quality than a categorization on the same PMID 2007 performed in 2007 No concept drift at all and even some improvement!

Example in toxicogenomics: CTD vs. GOCat GO Level GO Term 9 GO : ryanodine-sensitive calcium- release channel activity 7 GO : calcium-release channel activity 7GO : calcium channel activity 6GO : ligand-gated channel activity 6 GO : ligand-gated ion channel activity 3GO : calmodulin binding “ What molecular functions are affected by Ryanodine ? ” RankGO Term 1.GO : protein binding 2.GO : ryanodine-sensitive calcium- release channel activity 3.GO : voltage-gated calcium channel activity 4.GO : calcium ion binding 5.GO : calcium channel activity 6.GO : receptor binding 7.GO : calmodulin binding 8.GO calcium-transporting ATPase activity 9.GO : calcium-release channel activity 10.GO : FK506 binding GOCat

Example in UniProt GO Level GO Term 6 GO : histone methyltransferase complex 5GO : chromatin 5GO : nuclear matrix 4GO : cytosol 3GO : cleavage furrow “ What is the subcellular location of protein MEN1 ? ” RankGO Term 1.GO : nucleus 2.GO : cytoplasm 3.GO : plasma membrane 4.GO : extracellular space 5.GO : integral to plasma membrane 6.GO : mitochondrion 7.GO : cytosol 8.GO : extracellular region 9.GO : histone methyltransferase complex 10.GO : chromatin … 15.GO : nuclear matrix GOCat

Qualitative evaluation Relevant vs irrelevant : 82% - 18% Guha R., Gobeill J. and Ruch P. Automatic Functional Annotation of PubChem BioAssays

Automatic assignment of GO categories ~ 43% [Camon et al 2003: GO kappa ~ 40%] Classification model improves faster than drift [  Consistency of annotation guidelines ] Next: Effective integration into the EAGLi’ question-answering platform Conclusion and future work

Collaborations Automatic Functional Annotation of PubChem BioAssays  Generates semantic similarity clusters Automatically populating large protein datasets Genes with unvalidated predicted functions

Please visit EAGLi, the Bio-medical question answering engine !

The Gene Ontology Categorizer: Other resources… TWINC (patent retrieval…)

Acknowledgments Swiss-prot group (SIB): Anne-Lise Veuthey, Yoannis Yenarios U. Indiana/SCRIPPS: Rajarshi Guha / Stephan Schurer The COMBREX project: Martin Steffen NextProt: Pascale Gaudet SNF Grant: EAGL # EU FP7: # www.KHRESMOI.eu