Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas

Slides:



Advertisements
Similar presentations
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Advertisements

Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
COG and GO tutorial.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
An introduction to using the AmiGO Gene Ontology tool.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Daniel Rico, PhD. Daniel Rico, PhD. ::: Introduction to Functional Analysis Course on Functional Analysis Bioinformatics Unit.
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Introduction to Gene Mining Part B: How similar are plant and human versions of a gene? After completing part B, you will demonstrate How to use NCBI BLASTp.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
1 LSM2241 AY0910 Semester 2 MiniProject Briefing Round 5.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein and RNA Families
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Motif discovery and Protein Databases Tutorial 5.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Advanced SRS Course 12/12/02 -Linking -Subentries -Applications.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
InterPro Sandra Orchard.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Protein families, domains and motifs in functional prediction May 31, 2016.
Human Genome Project.
Functional manual annotation including GO
Sequence based searches:
Department of Genetics • Stanford University School of Medicine
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas

Why protein sequences? most (laboratory) analysis is done with nucleotide sequences therefore the analysis at the nucleotide level is natural

But there are drawbacks: -divergence in codons => same protein, different nucleotide sequence! -similarity between different aminoacids Therefore all the similarity is not visible at the nucleotide level!

…more… Protein databases also include often more detailed information. Protein (not the RNA) is often the actual functional unit that has a biological function. -note the exceptions like structural RNAs.

Various protein (related) databases Databases including protein sequences –UniProt Databases including protein domains –PFAM –PROSITE Databases including protein sequence patterns, motifs –PROSITE

Differences between databases ”Size” of included data components: ”Large” components: –Whole sequences ”Medium” components –Protein domains – ”Small” components –Protein sequence motifs – Protein sequence can include many domains and domains can have many motifs

Differences between databases Some include all the available information (more or less reliable information) –large coverage, everything is stored in the database –small reliablity, information has not been confirmed –computer annotation => updating fast Some cover only the reliable information –small coverage –information is reliable –expert curation => updating slow SwissProt (curated) ↔ TREMBL (uncurated)

Differences between databases Why previous division? Some protein features/functions are linked to domains Some features/functions are linked to specific sequence motifs Some features can be best described at the whole sequence level

Protein sequence databases UniProt SwissProt + TREMBL PIR-PSD Lets focus on SwissProt

Why Swissprot is nice? Sequences are manually annotated and checked No multiple entries for the same sequence Annotations include protein function, modifications after translation, active sites etc. Linked to many other databases Similarity to RefSeq

So how to search protein sequences from available databases? Search with a protein name Search with a proteins function or descriptive words Search with a protein/RNA sequence WWW link for first two options…

Searching Uniprot Demonstrate the search by looking protein kinase proteins from human

Type query here Choose database Here you limit search to SwissP. Lets first go to Advanced Search

Select field here Type query here 1.Select field as protein name 2.Type query: protein kinase We get all sequences that have both words (protein AND kinase) in their description

After previous results open new search row from Advanced Search Next select organism from field and type homo sapiens. Click Add&Search

Here you can look common features among the obtained sequences Here limit to Swissprot More info on hits by clicking the gene name Lets open one for better view… RESULTS:

Different fields of information can be found when scrolling down the page NOTICE: Detailed description of function → General annotation Alternative splice variants and mutations reported → Alternative products → Natural variations

Obtained result demonstrated the detailed information available from the SwissProt Note that the stored information includes –information on the organism –gene name, gene description –links to the articles discussing about the seq. –Comment part has a detailed description on function tissue localization –features part has a detailed description on domains various functional components

Go back to search results Select keyword, and open Disease list for better viewing… Test these Extra Slide

You can view which genes have been reported to be involved in some diseases Note that 18 are linked to tumor suppressors and 36 to Proto-oncogenes Extra Slide

Summary protein databases show detailed information of protein sequences Uniprot/Swissprot is recommended protein database -manually curated -non-overlapping Swissprot can show very detailed information on sequences

Sequence Motifs Motifs are conserved areas in the functionally similar proteins These are crucial parts for protein function –protein cannot change them without changing the function Analysis of sequences with motifs can be more efficient when no close sequence relatives are found –recommended when normal sequence search gives no results

What is motif? modified from Terri Attwood, 2002 modified from Eija korpelainen... Areas with strong conservation between alingned sequences Multiple sequence alingment of sequences with similar function

Domain databases Domain is a sub-component of protein It can exist and function independently from the rest of the protein sequence Domains form often a building blocks in the evolution that are combined to form proteins Same domain can occur in various proteins

Domain and motif databases PFAM PROSITE PRINTS TIGRFAM PRODOM … and many more

Domain and motif databases PFAM PROSITE PRINTS TIGRFAM PRODOM … All are combined Into one service → InterPro

What is InterPro Collection of many protein related databases All aim to report various features that can be used to analyze sequences Features: Domains, Sequence motifs, Global sequence homology Different databases can queried simultaneously via InterPro

What is InterPro This generates large amount of information for single query Good chance to get useful information for unknown sequence Some databases are well annotated Drawback is the repetition in the results from different databases Queries are also SLOW

How to use InterPro Sequence queries to InterProScan Sequence here Lets use Serine/threonine protein kinase N1 sequence as query This sequence was in Uniprot results

Results Sequence here Lets check more information on reported domains…. Query name Visualization of results Domain associated with one region of sequence Click titles for more info

Results Sequence signatures, found by InterProScan, usually have a detailed description Contributing signatures from many databases

Results InterProScan gives us matches in the sequence to various sequence features –Domains, motifs These features are often well annotated Features associate functions to specific regions of sequence

Other Databases Databases describing gene functions –Gene Ontology databases –Reaction pathway databases Databases describing associations to phenotypes –Disease gene databases –Phenotype databases

Databases describing functions Why do we need these databases? Earlier databases were helpful when analysis starts from unknown single gene These databases help us to find all genes known to be linked to certain task –Say, all apoptosis-related genes in human They are also helpful when we analyze large sets of genes –Is there something common among 100 genes that are most active in cancer cell?

Databases describing functions Gene Ontology databases –Classify genes into categories that describe gene function –Standardized classification applicable to all species –Classes represent involvement in biological tasks (like protein synthesis), chemical activities (like carbohydrate binding) or localization in cell (like nucleus)

Databases describing functions Pathway databases –Classify genes into biochemical pathways –Classify genes into signalling pathways Example databases: –KEGG: –REACTOME: way

The Gene Ontology (GO) is a hierarchical structure for categorizing gene products in terms of their association with: 1. biological processes 2. cellular components 3. molecular functions in a species-independent manner

Structure of Gene Ontology Hierarchical structure of linked nodes Smaller classes: child classes Precise, detail information Larger classes: parent classes Broad, unspecific information Smaller classes belong to larger classes Viral protein biosynthesis => Protein biosynthesis => Biosynthesis Starting node root of hierarchical structure

Gene Ontology databases AmiGO bin/amigo/go.cgi QuickGO

AmiGO Server maintained by GO consortium for analysis gene annotations across the species

AmiGO Query here Select: GO-terms Or gene names This limits to exact match

AmiGO We get the precise definition of the class Assosiated genes

AmiGO Here you can limit the species Lets have a view on genes associated to apoptosis in yeast (Saccharomyces Cerevisiae) Selected genes could be taken to a more detailed laboratory analysis…

Databases describing functions These group genes into classes or pathways Databases can be queried to see which genes are in certain class / pathway You can also check to which classes a certain gene belongs to

Databases summary Nucleotide databases Genome databases Protein databases Protein motif / domain databases Function related databases

WAKE UP!