Protein Bioinformatics Course

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Russell Group, Protein Evolution _________ ____. Russell Group, Protein Evolution _________ ____ Rob Russell Cell Networks University of Heidelberg Putting.
Large scale genomes comparisons Bioinformatics aspects (Introduction) Fredj Tekaia Institut Pasteur EMBO Bioinformatic and Comparative.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Measuring the degree of similarity: PAM and blosum Matrix
Basics of Comparative Genomics Dr G. P. S. Raghava.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Structural bioinformatics
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Protein Modules An Introduction to Bioinformatics.
Sequence similarity.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein Classification A comparison of function inference techniques.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Comparative Genomics of the Eukaryotes
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Protein Bioinformatics Course
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
PART II. Prediction of functional regions within disordered proteins Zsuzsanna Dosztányi MTA-ELTE Momentum Bioinformatics Group Department of Biochemistry.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Sequencing a genome and Basic Sequence Alignment
Comp. Genomics Recitation 3 The statistics of database searching.
You have worked for 2 years to isolate a gene involved in axon guidance. You sequence the cDNA clone that contains axon guidance activity. What do you.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Motif discovery and Protein Databases Tutorial 5.
Neurotrophins & NTs Receptors Loredana Lombardi Ariadna Laguna Molecular Mechanisms of Development (EMBO), Barcelona 2006.
Using blast to study gene evolution – an example.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Russell Group, Protein Evolution _________ ____ Rob Russell Cell Networks University of Heidelberg Interactions and Modules: the how and why of molecular.
Protein Properties Function, structure Residue features Targeting Post-trans modifications BIO520 BioinformaticsJim Lund Reading: Chapter , 11.7,
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Protein Structure Prediction
You have worked for 2 years to isolate a gene involved in axon guidance. You sequence the cDNA clone that contains axon guidance activity. The sequence.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
Protein families, domains and motifs in functional prediction May 31, 2016.
Intrinsically disordered proteins Zsuzsanna Dosztányi EMBO course Budapest, 3 June 2016.
Sequence similarity, BLAST alignments & multiple sequence alignments
Protein families, domains and motifs in functional prediction
Protein Families, Motifs & Domains.
Demo: Protein Information Resource
Basics of Comparative Genomics
Sequence based searches:
Genome Annotation Continued
BLAST.
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Presentation transcript:

Protein Bioinformatics Course Matthew Betts & Rob Russell AG Russell (Protein Evolution) Course overview Day 1 - Modularity Day 2 - Interactions Day 3 - Modularity & Interactions Day 4 - Structure Day 5 - Structure & Interactions Daily schedule 10:00-11:00 lecture 11:00-12:00 work on exercises in pairs 12:00-13:00 lunch 13:00-15:30 work on exercises in pairs 16:00-17:00 presentations by you

Protein Sequence Databases

Database Searching Homologues = proteins with a common ancestor Homology --> similar function Sequence similarity --> homology Find homologues using: BLAST Profile Searching www.proteinmodelportal.org

Scores and E-values How much would I expect to get >= this score by chance alone? How similar is my sequence to one in the database? cf. random sequences E = 1: one such match by chance E < 0.01: significant Depends on database: size: larger = better composition (random assumed) Alignment Substitution matrix Gap penalties

Homology comes in two main types: Orthology and Paralogy What is the difference and why does this matter?

Paralogues Paralogues Duplication - Speciation - - Speciation Orthologues Speciation - - Speciation Paralogues Duplication -

Different Fates Orthologues: Both copies required (one in each species) conservation of function (‘same gene’) adaptation to new environment Easier to transfer knowledge of function between orthologues Paralogues: Both copies useful conservation of function One copy freed from selection disabled new function Different parts of each free from selection function split between them

Assignment of orthology / paralogy can be complicated by: duplication preceding speciation lineage-specific deletions of paralogs complete genome duplications many-to-one relationship multi-domain proteins

Homology usually found by sequence similarity, but …proteins with dissimilar sequences can still be homologous Betts, Guigo, Agarwal, Russell, EMBO J 2001

Proteins are modular Since the early 1970s it has been observed that protein structures are divided into discrete elements or domains that appear to fold, function and evolve independently.

Given a sequence, what should you look for? Functional domains (Pfam, SMART, COGS, CDD, etc.) Intrinsic features Signal peptide, transit peptides (signalP) Transmembrane segments (TMpred, etc) Coiled-coils (coils server) Low complexity regions, disorder (e.g. SEG, disembl) Hints about structure?

Given a sequence, what should you look for? “Low sequence complexity” (Linker regions? Flexible? Junk? Transmembrane segment (crosses the membrane) Signal peptide (secreted or membrane attached) Tyrosine kinase (phosphorylates Tyr) Immunoglobulin domains (bind ligands?) SMART domain ‘bubblegram’ for human fibroblast growth factor (FGF) receptor 1 (type P11362 into web site: smart.embl.de)

Protein Modularity discrete structural and functional units found in different combinations in different proteins Receptor-related tyrosine-kinase Non-receptor tyrosine-kinases consider separately in predictions

Finding Protein Domains through partial matches to whole sequences: compare to databases of domains (Pfam, SMART, Interpro) can be separated by: low-complexity and disordered regions (SEG) trans-membrane regions (TMAP) coiled-coils (COILS) query sequence: match Repeat searches using each domain separately

12 000 domain alignments make sequence searching easier WPP domain alignment Alignments provide more information about a protein family and thus allow for more sensitive sequences than a single sequence. Domain alignments also lack low-complexity or disorder (normally) and other domains that can make single sequence searches confusing.

Finding domains in a sequence

at the border of sequence detectability Cryptic domains: at the border of sequence detectability Identified using more sensitive fold recognition methods that use structure to help find weak members of sequence families. If Pfam or SMART or similar do not find a domain, and the region is probably not disordered, then fold recognition might help. Gallego et al, Mol Sys Biol 2010

Domain peptide interactions Recognition of ligands or targeting signals Post-translational modifications

Linear motifs Peptides interacting with a common domain often show a common pattern or motif usually 3-8 aas. 3BP1_MOUSE/528-537 APTMPPPLPP PTN8_MOUSE/612-629 IPPPLPERTP SOS1_HUMAN/1149-1157 VPPPVPPRRR NCF1_HUMAN/359-390 SKPQPAVPPRPSA PEXE_YEAST/85-94 MPPTLPHRDW SH3-interacting motif PxxP “instance” “motif” “perpetrator” “victim” Puntervol et al, NAR, 2003; www.elm.org (Eukaryotic Linear Motif DB)

Linear motifs versus domains Domains: large globular segments of the proteome that fold into discrete structures and belong in sequence families. Linear motifs: small, non-globular segments that do not adopt a regular structure, and aren’t homologous to each other in the way domains are. Motifs lie in the disordered part of the proteome.

Intrinsically unstructured or disordered proteins or protein fragments

(IUPred, RONN, DisORPred, etc) Disorder predictors (IUPred, RONN, DisORPred, etc)

Linear motif mediated interactions 2424 Linear motif mediated interactions are everywhere Include motifs for: Targeting – e.g. KDEL Modifications – e.g. phosphorylation Signaling – e.g. SH3 About 200 are currently known, likely many more still to be discovered Neduva & Russell, Curr. Opin. Biotech, 2006

Finding linear motifs in a sequence Linear motifs are much harder to find than domains. Long (>30 AA), belong to sequence families that help detect new family members Short (typically < 8AA), simple patterns, e.g. PxxP will occur in most sequences randomly.

www.russelllab.org/wiki