BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD

Slides:



Advertisements
Similar presentations
Pfam(Protein families )
Advertisements

Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Tutorial 5 Motif discovery.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Protein and RNA Families
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Motif discovery and Protein Databases Tutorial 5.
Protein Domain Database
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK Bioinformatics:
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Copyright OpenHelix. No use or reproduction without express written consent1 1.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
Protein Families, Motifs & Domains.
Sequence based searches:
Department of Genetics • Stanford University School of Medicine
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
There are four levels of structure in proteins
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
InterPro An Introduction
Presentation transcript:

BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD

WHAT YOU NEED TO LEARN: What are protein pattern/fingerprint/motif databases and why are they important? What are the benefits using ontologies in database design? How do model organism databases support human health research?

PATTERN DATABASES Sometimes alignment-based methods find no hits to provide us with clues about a novel gene/protein’s function Then we turn to finding MOTIFS - common conserved sequence elements in protein families In many cases a motif consists of distinct subparts that are highly conserved in the sequences, while the regions between these subparts have little in common. If we have a database of these patterns, we can assign potential function to a novel protein by finding one or more known motifs…

P ROTEIN Similar sequence  Similar function Also true for subsections of a protein Motifs or signature sequences e.g. DNA binding motifs 4 Sequence A Sequence B EVOLUTIONARY CONSTRAINT!

INTERPRO: INTEGRATED PATTERN DATABASE Integrated resource for protein families, domains, regions and sites Combines several databases that use different methodologies well-characterised proteins to derive protein signatures. Capitalises on their individual strengths => powerful integrated database and diagnostic tool (InterProScan)

MEMBER DATABASES ProDom : provider of sequence-clusters PROSITE patterns: regular expressions. PRINTS provide protein ‘fingerprints’ PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY : are providers of hidden Markov models (HMMs).

INTERPRO PROTEIN ‘SITES’ Conserved Site - any short sequence pattern that may contain one or more unique residues Active sites - one or more signatures cover all the active site residues Binding sites bind chemical compounds A Post-translational Modification modifies the primary protein structure, eg. glycosylation, phosphorylation, etc.

INTERPRO SEQUENCE ANALYSIS: INTERPROSCAN Searching against different functional site databases has become a vital for the prediction of protein function (where e.g. BLAST fails). Different DB’s have different strengths and weaknesses of their underlying analysis methods. Ideally, all of the secondary databases should be searched against to ensure the best results. This is exactly what InterProScan does (part of todays practical topic)InterProScan

BIO-ONTOLOGIES Community developed agreements on terms/concepts describing a topic and also the relationships between them The Gene Ontology (GO) is the most widely used The GO provides common language to describe a gene product's biology in terms of: Molecular Function Biological Process Cellular Location Several others e.g. anatomy, cell types, disease, phenotype, pathway, …

GENE-X involves

ADVANTAGES OF GO (AND MANY OTHER BIO-ONTOLOGIES) IN DB DESIGN A common language applicable to any organism Represents and organises information in a way that both humans and machines can understand GO terms can be used to annotate gene products from any species Enables easy comparison of information across species

ADVANTAGES OF GO (AND MANY OTHER BIO-ONTOLOGIES) IN DB DESIGN (2) Terms make good entry points for database searches Researchers can search for what they really mean (and meaning is more consistent between individuals) Transitive links of biological objects query term via it’s child terms ensures that ALL relevant results are returned automatically Reverse’ queries can easily be done to return terms when biological objects are used as queries

GENE-X involves GENE-X will be returned even if query is done at this level Using GENE-X as the query can return ‘cytokinesis’ and even all its parent terms

MODEL ORGANISM GENETIC DATABASES Very useful for collecting results from genetic (and other) experiments that cannot be done on humans Disease models Gene knockouts Drug testing Environmental manipulation In terms of genomics, model organism data is invaluable to unravel: Gene and protein functions Gene to phenotype relationships Gene to disease associations The aim of these databases is to integrate all relevant information in one place More easy to mine database for novel associations Enables linking between databases

RAT AND MOUSE GENOME DB’S – DATA TYPES Genes, proteins and their annotations including Gene Ontology links and expression information Phenotypes – described by terms in the Mammalian Phenotype Ontology From gene knockout models produced by the project and their partners From evidence mined from the literature Disease, Pathway and Behaviour ontologies and relevant gene associations also present in RGD

DESIGNED FOR EASE OF USE Web query interfaces are intuitive Several traditional ways to query – gene names, symbols, chromosomal location Query interfaces for ontologies (Disease, Phenotype, Pathway, Behaviour) Ontology annotations can easily be retrieved for any gene or protein Both databases have links to human genes, which simplifies mouse and rat evidence-driven in-silico exploration into human diseases and phenotypes