Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis www.hytti.uku.fi/~toronen/Gradu_verkkoon.zip.

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
On line (DNA and amino acid) Sequence Information Lecture 7.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
©CMBI 2007 Search tools Google, MRS, (SRS). ©CMBI 2007 Search tools Google= Thé best generic search and retrieval system MRS= Maarten’s Retrieval System.
The Protein Data Bank (PDB)
©CMBI 2005 Search tools Google, MRS, SRS. ©CMBI 2004 Search tools SRS = Sequence Retrieval System MRS = Maarten’s Retrieval System Google = Thé best generic.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Protein and Function Databases
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
On line (DNA and amino acid) Sequence Information
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
The Ensembl Gene set The “Genebuild” 21 April 2008.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.
Biological Databases By : Lim Yun Ping E mail :
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Part I: Identifying sequences with … Speaker : S. Gaj Date
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
You have worked for 2 years to isolate a gene involved in axon guidance. You sequence the cDNA clone that contains axon guidance activity. What do you.
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein and RNA Families
Condor: BLAST Rob Quick Open Science Grid Indiana University.
Motif discovery and Protein Databases Tutorial 5.
Copyright OpenHelix. No use or reproduction without express written consent1.
SRS Introductory Course 5/12/ Temporary and permanent sessions - Simple querying - Browsing indices - Standard and extended query forms - User defined.
Copyright OpenHelix. No use or reproduction without express written consent1.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Advanced SRS Course 12/12/02 -Linking -Subentries -Applications.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Copyright OpenHelix. No use or reproduction without express written consent1.
InterPro Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
What is BLAST? Basic BLAST search What is BLAST?
Welcome to the combined BLAST and Genome Browser Tutorial.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
DNA / protein sequence analysis 第九組成員: 吳宇軒 侯卜夫 朱子豪 王俊偉
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Take a REST from manual searching: PDBe, programmatically
Protein Families, Motifs & Domains.
Basics of BLAST Basic BLAST Search - What is BLAST?
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Genome Center of Wisconsin, UW-Madison
GDSS – Digital Signature
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis and from CSC bio-opas

Why protein sequences? most (laboratory) analysis is done with nucleotide sequences therefore the analysis at the nucleotide level is natural

But there are drawbacks -divergence in codons => same protein, different nucleotide sequence! -similarity between different aminoacids Therefore all the similarity is not visible at the nucleotide level!

…more… Protein databases also include often more detailed information. Protein (not the RNA) is often the actual functional unit that has a biological function. -note the exceptions like structural RNAs.

Protein databases SwissProt TrEMBL PIR-PSD Swissprot and TrEMBL (Translated EMBL) have been unified to UniProt THIS INFO IN PART ERRONEOUS! SwissProt still also available as a separate entity.

Differences between databases Some include all the available information (more or less reliable information) –large coverage, everything is stored in the database –small reliablity, information has not been confirmed –computer annotation => updating fast Some cover only the reliable information –small coverage –information is reliable –expert curation => updating slow SwissProt – TREMBL – RemTREMBL

Why Swissprot is nice? Sequences are manually annotated and checked No multiple entries for the same sequence Annotations include protein function, modifications after translation, active sites etc. Linked to many other databases

So how to search protein sequences from available databases? Search with a protein name Search with a proteins function/derscriptive words Search with a protein/RNA sequence Next slides handle first two options…

Ways to access Swiss/UniProt Expasy server for Uniprot Note that the page includes links to ’full text search’ and to ’advanced search’ Power Search to UniProt database One of the SRS servers availble in WWW

SRS Sequence Retrieval System Allows search from several databases not limited to SwissProt! AND, OR, BUTNOT type boolean operations can be used in the search (useful with keywords) => Works with sequence name and with complex keyword queries. Obtained results can be further processed: –linking to new set of databases –includes sequence analysis, sequence alingment

Select ’start a temporary project’

Select database(s). Here I select SwissProt Note that also other databases can be searched with SRS! Available databases vary between the different SRS servers.

Insert the query for looking the sequence. Here I search with the sequence name (csk_mouse). Search goes through all the text fields (AllText) in the SwissProt files These are available fields that can be searched with the search term

obtained result Available information on the sequence. More information from here

Obtained result demonstrated the detailed information available from the SwissProt Note that the stored information includes –information on the organism –gene name, gene description –links to the articles discussing about the seq. –part comments has a detailed description on function tissue localization –part features has a detailed description on domains various functional components

SRS Search with boolean operators (AND, OR, BUTNOT) Queries can be combined with & (= AND), | (= OR), ! (=NOT) Different rows are also combined (by default) with AND The example looks for proteins with organism Name either mouse OR rat. Also the description field must include words receptor AND kinase BUTNOT tyrosine.

Further linking to other databases We can link the obtained results with the other databases by going further from this link Go to the results of the previous search..

Selection of sequences that have a known 3D structure 2. The box next to PDB database is selected with mouse 1. The sub folder with protein databases is opened by selecting protein function structure and interactions databases 3. Lets select here the filtering of the obtained results to the ones that have a link to 3D structure

Summary protein databases show detailed information of protein sequences Uniprot/Swissprot is recommended protein database -manually curated -non-overlapping SRS is a method for searching information from selected databases with search terms Word of warning: Sometimes SRS does not work as nicely as hoped!

Search of the protein databases with sequences So what can be done if we have a sequence that we do not know nothing about? We can look for similar known protein from databases. This can be done directly with protein sequences. (Database searching is probably handled more later. Sorry for wrong order!)

Nucleotide to amino acids If you have produced a nucleotide seq. in laboratory you might still want to compare it to protein sequences for previous reasons (slide n. 3). You’ll have two options:

1.Use tools (like BLASTX, FastX) that automatically compare the nucleotide seq. to amino acid databases. These can search sequence similarities going from one reading frame to another. => Simple, You don’t have to worry about translating the sequence (see below) BLASTX and FastX are explained more in detail later 2.Translate the seq. using available tools (for example ) -required with tools that accept only protein sequence -remember that you do not know the reading frame! Correct reading frame can move from one frame to another (sequencing errors like addition or deletion of nucleotides)!!

Automatic tools comparing nucl. seq. with protein database BLASTX -looks for most similar protein sequences for your nucleotide sequence by comparing all possible reading frames. -Member of BLAST program family

For nucleotide sequences BLASTX can be obtained here If you do a query with a protein sequence then use this

SEQUENCE: >embl|AB029485|AB Mus musculus ARIP1 mRNA for activin receptor interacting protein protein database (SwissProt) can be selected here You can find the seq from google with AB029485

Next Window is opened here

Web page that is given while the results are being waited.

Colour figure presents where the match to the database was in our query sequence. colour presents the goodness of score. E value tells how many similar results can be expected by random The alingment can be viewed from this link

The alingment enables the manual evaluation of the result This is the link to database that we searched giving the full information on the sequence

Changing the nucleotides to amino acids Transeq requires you to paste the nucleotide sequence, to select the reading frame (1, 2 or 3) and to select forward or reverse direction

An example sequence obtained with randomly typed g,a,c,t: DQLTCQSTVSAGLAWLAG MA The obtained sequences from different reading frames can be used to search protein databases...

Motif databases Motifs are conserved areas in the functionally similar proteins These are crucial parts for protein function –protein cannot change them without changing the function Analysis of sequences with motifs can be more efficient when no close sequence relatives are found –recommended when normal sequence search gives no results

What is motif? modified from Terri Attwood, 2002 modified from Eija korpelainen... Areas with strong conservation between alingned sequences

Motif databases BLOCKS PROSITE more...

Subgroup Pattern and profile searches shows the list of protein motif analysis tools

INTERPRO Combines many motif databases in one search can take DNA or protein sequence. Fragment of the BLASTX test sequence

Kinase associated motifs PDZ domains Important for protein-interactions WW domains Important for binding proteins