Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Pfam(Protein families )
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Sequence alignment, E-value & Extreme value distribution
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||
Sequence Analysis with Artemis & Artemis Comparison Tool (ACT) South East Asian Training Course on Bioinformatics Applied to Tropical Diseases (Sponsored.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Sequence Alignment and Database Searching.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
You have worked for 2 years to isolate a gene involved in axon guidance. You sequence the cDNA clone that contains axon guidance activity. What do you.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Motif discovery and Protein Databases Tutorial 5.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Protein Domain Database
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
(H)MMs in gene prediction and similarity searches.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
InterPro Sandra Orchard.
Annotation of eukaryotic genomes
What is BLAST? Basic BLAST search What is BLAST?
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Protein families, domains and motifs in functional prediction May 31, 2016.
What is BLAST? Basic BLAST search What is BLAST?
Protein families, domains and motifs in functional prediction
Protein Families, Motifs & Domains.
Basics of BLAST Basic BLAST Search - What is BLAST?
Sequence based searches:
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
Bioinformatics and BLAST
BLAST.
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Comparative Genomics.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

PSU Projects Organism Annotated genome Finished genome Database entry Artemis & ACT

Primary DNA sequence Dotter BlastN BlastX Gene finders tRNA scan RepeatsPseudo-genesrRNA CDSs tRNA Preannotation manual curation

Primary DNA sequence Dotter BlastN BlastX Gene finders tRNA scan RepeatsPseudo-genesrRNA CDSs tRNA FastaBlastPPfamPrositePsortSignalPTMHMM Preannotation Manual curation Manual curation Annotated sequence

Gene model annotation Protein function

Annotation of Protein-coding genes: (from gene model to protein function) -search programs: local (BLAST) and global (FASTA) alignments, EST hits -Protein domains and motifs: InterPro (Pfam, Prosite, SMART etc.) -Transmembrane / signal peptide prediction (TMHMM, SignalP) - Base annotation on characterised proteins where possible (manually curated UNIPROT entry) -Read the literature (PUBMED) Use several lines of evidence!

Annotation of non-protein-coding genes: (tRNAs, rRNAs, snRNAs, other ncRNAs) -Initial searches: -BlastN, GC-plots -tRNA scan -sno scan -Others -Search in specialised databases: -Rfam scan -microRNAdb etc. -Comparative ncRNA prediction tools: -RNAZ -Evofold -QRNA etc. -Structure prediction of ncRNAs: - MFOLD -Others Use several lines of evidence Structural conservation of ncRNAs

(Global) FASTA BLAST (Local)

Statistical significance of database hits “P-value” - the estimated probability that the match observed could have occurred by chance or E-value - the number of results with this score expected by chance (assuming a specific distribution of residues). An E-value of 5 would mean that you would expect 5 alignments with the equivalent or higher score to have occurred by random chance More reliable than the % ID Statistical estimates like these are strongly influenced by the size and composition of both the search sequence and the database. Caution: Repeat regions Transitive annotation of non-curated protein sequences

Sequence similarity searching: BLAST (Basic Local Alignment Search Tool) analysis: Nucleotide sequences: blastn: nucleotide sequence compared to nucleotide database blastx: nucleotide sequence translated and all 6 frame translations compared to protein database tblastn: protein query vs translated database Protein sequences blastp: protein query vs protein database tblastx: translated query vs translated database (all 6 frames) FastA: Provides sequence similarity and homology searching against nucleotide and protein databases using the Fasta programs. Fasta can be very specific when identifying long regions of low similarity especially for highly diverged sequences.

..HMPLKHPLHP....RMLLKHRPHP....GMRLKHGHHP....PMGLKHAGHP....-M-LKH--HP.. Profile aligned sequences Protein profiling

..HMPLKHRLHP....RMPLKHRPHP....GMRLKHRHHP....PMGLKHAGHP....-M P LKH R -HP.. Profile aligned sequences More sophisticated protein profiles score each amino acid in the motif Hidden Markov Models (HMMs): The HMM is a statistical model that considers all possible combinations of matches, mismatches and gaps to generate an alignment of a set of sequences.

Profile based predictors of protein domains / motifs Motif database in form of regular expressions. Not necessarily the whole domain. K-x(12)-[DE] = lysine, any 12, Aspartic acid or Glutamic acid. Returns 1 or 0, i.e. very rigid and can be very inaccurate for small simple motifs Motif search tools based on Prosite but with multiple alignment profiling Collection of HMM’s usually covering the whole domain

Functional assignment: domain architecture A B A B C A B C

InterPro Server: The ‘one-stop shop’ for accessing all major protein databases InterPro provides an integrated view of the commonly used signature databases, and has an interface for text- and sequence-based searches.

InterPro: member databases

Retrieving a sequence using SRS

The SignalP 3.0 Server:

The SignalP 3.0 output:

The TMHMMv2.0 Server:

The TMHMM v3.0 output: Tabular part Graphical part

Module 3 Exercises: Section A: Sequence retrieval of a P. falciparum protein (cyclophilin) using SRS BLAST and Fasta searches by cutting & pasting the sequence. Section B: Exercise 1 Part I (row 1): Search PROSITE server by cutting & pasting the cyclophilin sequence Exercise 1 Part II (row2): Pfam server Exercise 1 Part III (row3): SMART server Exercise 1 Part IV (row4): InterPro server Exercise 2: Sequence retrieval of P. falciparum PFC0125w protein using SRS. TMHMMv2.0 server. SignalPv3.0 server. Section C: Other web resources