Maintaining Ontologies as They Scale Across Multiple Species Darren A. Natale Protein Information Resource.

Slides:



Advertisements
Similar presentations
1 Gesture recognition Using HMMs and size functions.
Advertisements

Biotechnology Chapter 11.
Image Analysis Phases Image pre-processing –Noise suppression, linear and non-linear filters, deconvolution, etc. Image segmentation –Detection of objects.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Profiles for Sequences
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Lecture 5: Learning models using EM
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First,
Protein Modules An Introduction to Bioinformatics.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Similar Sequence Similar Function Charles Yan Spring 2006.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, and Jiawei Han SIGMOD 2002 Presented by: Eddie Date: 2002/12/23.
Hosted by The Greatest Biology teachers at Rider.
SC.L.16.3 Describe the basic process of DNA replication and how it relates to the transmission and conservation of the genetic information.
Scaffold Download free viewer:
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Comparative Genomics of the Eukaryotes
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
GENETIC ENGINEERING B-4.9. GENETIC ENGINEERING GENETIC ENGINEERING IS THE PROCESS OF SPECIFIC GENES IN AN ORGANISM IN ORDER TO ENSURE THAT THE ORGANISM.
Automatic methods for functional annotation of sequences Petri Törönen.
Integration of PRO and UniProtKB Amherst, NY May 16, 2013 Cathy H. Wu, Ph.D. PRO-PO-GO Meeting.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Multiplication of cells takes place by division of pre- existing cells. Cell multiplication is equally necessary after the birth of the individual for.
Protein Sequence Alignment and Database Searching.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Companion site for Biotechnology. by Clark Copyright © 2009 by Academic Press. All rights reserved. 1 Expression of Eukaryotic Proteins A bacterial Promoter/terminator.
Human awareness.  M16.1 Know that the DNA can be extracted from cells  Genetic engineering and /or genetic modification have been made possible by isolating.
Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Discovering Descriptive Knowledge Lecture 18. Descriptive Knowledge in Science In an earlier lecture, we introduced the representation and use of taxonomies.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Motif discovery and Protein Databases Tutorial 5.
You are performing mitosis. Where is this occurring? Describe what is happening.
Using Exons to Define Isoforms in PRO Timothy Danford Novartis Institutes for Biomedical Research PRO / AlzForum Kickoff Meeting Oct. 4, 2011.
Slides for “Data Mining” by I. H. Witten and E. Frank.
You can request PRO terms by using the SourceForge PRO tracker (Fig 3A) or by directly contributing to PRO by providing the information in the RACE-PRO.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Helicase loading occurs at all replicators during G1.
You are performing mitosis. Where is this occurring? Describe what is happening.
Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.
1 DNA and Biotechnology. 2 Outline DNA Structure and Function DNA Replication RNA Structure and Function – Types of RNA Gene Expression – Transcription.
(H)MMs in gene prediction and similarity searches.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
HomologyIf twp proteins are homologous, they have a common fold and a common ancestor If two proteins have >25% identity across their entire length, they.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
+ Cell checkpoints and Cancer. + Introduction Catastrophic genetic damage can occur if cells progress to the next phase of the cell cycle before the previous.
Hidden Markov Models BMI/CS 576
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
DNA 2.7 Replication, transcription and translation
Nucleic Acids Large polymers Made of linked nucleotides 2 types
Diverse Transcriptional Programs Associated with Environmental Stress and Hormones in the Arabidopsis Receptor-Like Kinase Gene Family  Lee Chae, Sylvia.
Proteins!!! More than just meat.
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Basic Local Alignment Search Tool
Presentation transcript:

Maintaining Ontologies as They Scale Across Multiple Species Darren A. Natale Protein Information Resource

The Issue Many ontologies are designed, at least in part, to address entities in a cross-species manner – Examples: GO, IDO, PRO How does one account for species with disparate biological mechanisms? Regardless of solution chosen, the problem becomes more acute as we try to account for more and more species

The Approaches: GO ~40000 terms Originally, used “sensu” (“in the sense of”) to indicate that there are differences based on taxa (these have been removed) – e.g., secretin (sensu Bacteria is a protein transporter, sensu Mammalia is a hormone) Currently, definitions are refined to ensure that they can apply to all species (by removing any taxa-specific information) GO strives to have no species-specific terms at all

GO: traversing start control point of mitotic cell cycle OLD def: "Passage through a cell cycle control point late in G1 phase of the mitotic cell cycle just before entry into S phase; in most organisms studied, including budding yeast and animal cells, passage through start normally commits the cell to progressing through the entire cell cycle." NEW def: “A cell cycle process by which a cell commits to entering S phase via a positive feedback mechanism between the regulation of transcription and G1 CDK activity.”

The Approaches: IDO ~500 terms ,800,1700… IDO does have both generic and specific terms, but are separately maintained: IDO-Core is restricted to those terms that can apply to anything – e.g., host, toxin IDO extensions contain terms specific to a particular species or closely-related species – e.g., Malaria, Influenza, Brucellosis organism host malaria host IDO-core IDOMAL

The Approaches: PRO PRO also allows for both generic and specific terms, but these are maintained together For the most part only the generic (organism non-specific) terms are explicit; the classification of species-specific terms are inferred

Eh? PR: explicitly states that ORC6 = A protein that is a translation product of the human ORC6L gene or a 1:1 ortholog thereof.

Eh? PR: explicitly states that ORC6 = A protein that is a translation product of the human ORC6L gene or a 1:1 ortholog thereof. Thus, if we can identify 1:1 orthologs of the human ORC6L gene, we can infer that the resulting proteins are instances of this class

Growth of PRO mapped entities (inferred) main PRO

What was mapped 12 reference organisms: 7.5% = pitiful

Filling the Gaps Fit UniProtKB entries into the PRO hierarchy – genes and isoforms Possible approaches: – Allow generation skipping (i.e., not require mapping to 1:1 ortholog) and allow mapping to family-level terms We’ll need a good relation from protein -> family – Define some classes based on paralogs (to handle lineage-specific expansions in plants) – Add function-based hierarchy in addition to evolution-based hierarchy

The New Relation? x sequence_matches_hmm y = [def] if x is a linear sequence of letters and y is a hidden Markov model (HMM) that describes the probability of observing a particular sequence, then, given the parameters of the model, the probability of observing x (or some significant portion thereof) falls above the threshold defined for y. x matches_hmm y= [def] if x is an amino acid chain with a sequence representation s and y is a hidden Markov model (HMM) that describes the probability of observing a particular sequence, then, given the parameters of the model, the probability of observing s (or some significant portion thereof) falls above the threshold defined for y. x belongs_to y = [def] if x is an amino acid chain with a sequence representation s and y is a protein family for which a hidden Markov model h has been derived, then s sequence_matches_hmm h, and there is no other HMM o for which s exhibits a better match over the part of s that sequence_matches_hmm h. x has_domain y = [def] if x is an amino acid chain with a sequence representation s and y is a protein domain for which a hidden Markov model h has been derived, then s sequence_matches_hmm h, and there is no other HMM o for which s exhibits a better match over the part of s that sequence_matches_hmm h.

Problems The calculation of 1:1 orthology when based on proteins strongly depends on the protein set used The accession for the mapped entities (from UniProtKB) sometimes cease to exist – In some cases, they disappear completely – In some cases, they change (e.g., when a TrEMBL entry is merged into a Swiss-Prot entry to become a new isoform)