Repeats and composition bias

Slides:

Advertisements

Similar presentations

The FREAKS Session 3.1: Repeats Session 3.2: Biased regions Miguel Andrade Johannes-Gutenberg University of Mainz of PROTEIN SEQUENCE.

Advertisements

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.

Secondary structure prediction. Amino acid sequence -> Secondary structure Alpha helix Beta strand Disordered/coil 70% accuracy 1991, 81% accuracy in.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.

Protein Modules An Introduction to Bioinformatics.

Sequence comparisons June 23, 2009 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.

Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.

PREDICTION OF PROTEIN FEATURES Beyond protein structure (TM, signal/target peptides, coiled coils, conservation…)

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.

© Wiley Publishing All Rights Reserved.

Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.

Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)

Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.

Protein Bioinformatics Course

Protein domains. Protein domains are structural units (average 160 aa) that share: Function Folding Evolution Proteins normally are multidomain (average.

Multiple Sequence Alignment School of B&I TCD May 2010.

Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.

Protein Sequence Alignment and Database Searching.

Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n

The FREAKS Session 3.1: Repeats Session 3.2: Biased regions Miguel Andrade Johannes-Gutenberg University of Mainz of PROTEIN SEQUENCE.

Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Repeats and composition bias. Repeats Frequency 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)

Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.

Copyright OpenHelix. No use or reproduction without express written consent1.

Manually Adjusting Multiple Alignments Chris Wilton.

Basic Local Alignment Search Tool BLAST Why Use BLAST?

Russell Group, Protein Evolution _________ ____ Rob Russell Cell Networks University of Heidelberg Interactions and Modules: the how and why of molecular.

Repeats and composition bias. Repeats Frequency 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Multiple Sequence Alignment Carlow IT Bioinformatics November 2006.

Homology 3D modeling and effect of mutations. X-ray crystallography (70,714 in PDB) need crystals Nuclear Magnetic Resonance (NMR) (9,312) proteins in.

InterPro Sandra Orchard.

Graphical representation of an amino acid or nucleic acid multiple sequence alignment developed by Tom Schneider and Mike.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Homology 3D modeling and effect of mutations Miguel Andrade Faculty of Biology, Johannes Gutenberg University Institute of Molecular Biology Mainz, Germany.

Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.

Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,

Repeats and composition bias

Homology 3D modeling Miguel Andrade Mainz, Germany Faculty of Biology,

Protein 3D representation

Prediction of protein features. Beyond protein structure

Secondary structure prediction

polyQ and other homorepeats

Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,

Protein 3D representation

Repeats and composition bias

Secondary structure prediction

Protein 3D representation

Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,

Homology 3D modeling and effect of mutations

Protein Bioinformatics Course

Dot Plots Dot Plots provide a graphic view of the amount of similarity between two sequences. The two axes represent the two sequences. In its simplest.

Basic Local Alignment Search Tool

Protein structure prediction.

Homology 3D modeling Miguel Andrade Mainz, Germany Faculty of Biology,

Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,

Protein 3D representation

Examining repeats with databases

Nora Pierstorff Dept. of Genetics University of Cologne

SUR-8, a Conserved Ras-Binding Protein with Leucine-Rich Repeats, Positively Regulates Ras-Mediated Signaling in C. elegans Derek S Sieburth, Qun Sun,

Presentation transcript:

Repeats and composition bias Miguel Andrade Faculty of Biology, Johannes Gutenberg University Mainz, Germany andrade@uni-mainz.de

Repeats

Frequency 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect tandem repeats. Assemble in structure.

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIMCHEKSPSVCSPLNMTSSVCSPAGINSVSSTTASF GSFPVHSPITQGTPLTCSPNVENRGSRSHSPAHASNVGSPLSSPLSSMKSSISSPPSHCSVKSPVSSPNNVTLRSSVSSPANINN

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIMCHEKSPSVCSPLNMTSSVCSPAGINSVSSTTASFGSFPVHSPITQGTPLTCSPNVENRGSRSHSPAHASNVGSPLSSPLSSMKSSISSPPSHCSVKSPVSSPNNVTLRSSVSSPANINN

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIM CHE KSPSVCSPLN MTSSVCSPAG INSVSSTTASF GSFPVHSPIT Q GTPLTCSPNV EN RGSRSHSPAH ASN VGSPLSSPLS S MKSSISSPPS HCS VKSPVSSPNN VT LRSSVSSPAN INN

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIM CHE KSPSVCSPLN MTSSVCSPAG INSVSSTTASF GSFPVHSPIT Q GTPLTCSPNV EN RGSRSHSPAH ASN VGSPLSSPLS S MKSSISSPPS HCS VKSPVSSPNN VT LRSSVSSPAN INN

Tandem repeats fold together

Tandem repeats fold together

Tandem repeats fold together

Tandem repeats fold together

Tandem repeats fold together

Tandem repeats fold together

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIM CHE KSPSVCSPLN MTSSVCSPAG INSVSSTTASF GSFPVHSPIT Q GTPLTCSPNV EN RGSRSHSPAH ASN VGSPLSSPLS S MKSSISSPPS HCS VKSPVSSPNN VT LRSSVSSPAN INN

http://weblogo.berkeley.edu (Vlassi et al, 2013) 15

A subunit PP2A structure PDB:1b3u Groves et al. (1999) Cell

Ap1 Clathrin Adaptor Core PDB:1w63 Heldwein et al. (2004) PNAS

Ap1 Clathrin Adaptor Core PDB:1w63 Heldwein et al. (2004) PNAS

i-TASSER model of D. melanogaster thr protein Based on PDB 4BUJ chain B

PDB 4BUJ Ski complex (yeast)

Andrade et al. (2001) J Struct Biol

Definition CBRs Perfect repeat: QQQQQQQQQQQ Imperfect: QQQQPQQQQQQ Amino acid type: DDDDDEEEDEDEED Compositionally biased regions (CBRs) High frequency of one or two amino acids in a region. Particular case of low complexity region

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection repeats Sometimes straightforward. N-terminal human Huntingtin. How many repeats can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection repeats Often NOT straightforward. N-terminal human Huntingtin. How many repeats can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection repeats Often NOT straightforward. N-terminal human Huntingtin. How many repeats can you find? EFQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKA CRPYLVNLLPCLTRTSKRP-EESVQETLAAAVPKIMAS NDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHS TQYFYSWLLNVLLGLLVPVEDEHSTLLILGVLLTLRYL PSAEQLVQVYELTLHHTQHQDHNVVTGALELLQQLFRT

Detection repeats Often NOT straightforward. N-terminal human Huntingtin. How many repeats can you find? EFQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKA CRPYLVNLLPCLTRTSKRP-EESVQETLAAAVPKIMAS NDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHS TQYFYSWLLNVLLGLLVPVEDEHSTLLILGVLLTLRYL PSAEQLVQVYELTLHHTQHQDHNVVTGALELLQQLFRT

Detection of repeats Dotplots Comparing a sequence against itself

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV

Detection of repeats Dotplots TLRSSVSSPANINNS | 1 match NMTSSVCSPANISV

Detection of repeats Dotplots TLRSSVSSPANINNS ||| ||||| NMTSSVCSPANISV 8 matches NMTSSVCSPANISV

Detection of repeats Dotplots TLRSSVSSPANINNS | | NMTSSVCSPANISV | | 2 matches NMTSSVCSPANISV

Detection of repeats Dotplots TLRSSVSSPANINNS | 1 match NMTSSVCSPANISV

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV 8

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV 1821

Exercise 1

Exercise 1. Using Dotlet with the human mineralocorticoid receptor (MR) Go to the Dotlet web page: http://dotlet.vital-it.ch Click on the input button and paste the sequence of the human mineralocorticoid receptor (UniProt id P08235) Click on the “compute” button Try to find combinations of parameters that show patterns in the dot plot (Hint: You can adjust this finely using the arrows) Find repetitions clicking in the diagonal patterns

Exercise 1. Using Dotlet with the human mineralocorticoid receptor (MR)

Detection of repeats Using a multiple sequence alignment helps. Conserved repeated patterns JalView with Regular Expression searches

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches Regular Expressions: [LS]P.A matches L or S, followed by P, followed by anything, followed by A

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches Regular Expressions: [LS]P.A matches L or S, followed by P, followed by anything, followed by A Which one is not matched? LPTA, SPAA, LPPA, LPAP, SPLA

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches Regular Expressions: [LS]P.A matches L or S, followed by P, followed by anything, followed by A Which one is not matched? LPTA, SPAA, LPPA, LPAP, SPLA

Exercise 2. Using JalView with a MSA of the MR with orthologs Load the multiple sequence alignment of the MR in JalView: MR1_fasta.txt (from URL: https://cbdm.uni-mainz.de/ files/2015/02/MR1_fasta.txt) Use the “Select > find" (of Ctrl+F) option with a regular expression and mark all matches (click the “Find all” option!) Try to find the expression that matches more repeats. How many repeats do you see? How long are they? Would you correct the alignment based on these findings?

Vlassi et al. (2013) BMC Struct. Biol. * * * * #F1 #F2 #F3 #F4 #T8 #T9 #T10 #T11 #T12 #T13 #T14 #T15 * * * #F5 #F6 #F7 #F8 #F9 #F10 #F11 Vlassi et al. (2013) BMC Struct. Biol. 49

Mineralocorticoid receptor Repeat region AF1a ID AF1b DBD LBD 984 aa NTD 100 200 300 400 500 600 700 800 900 1000 aa Vlassi et al. (2013) BMC Struct. Biol. 50

51 51

Composition bias

Definition 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect tandem repeats. Assemble in structure.

Definition CBRs Perfect repeat: QQQQQQQQQQQ Imperfect: QQQQPQQQQQQ Amino acid type: DDDDDEEEDEDEED Compositionally biased regions (CBRs) High frequency of one or two amino acids in a region. Particular case of low complexity region

Function CBRs Conservation => Function Length, amino acid type not necessarily conserved Frequency: 1 in 3 proteins contains a compositionally biased region (Wootton, 1994), ~11% conserved (Sim and Creamer, 2004)

Function CBRs Conservation => Function Length, amino acid type not necessarily conserved Functions: Passive: linkers Active: binding, mediate protein interaction, structural integrity (Sim and Creamer, 2004)

Structure of CBRs Often variable or flexible: do not easily crystalize

1CJF: profilin bound to polyP

2IF8: Inositol Phosphate Multikinase Ipk2

2IF8: Inositol Phosphate Multikinase Ipk2 RVSETTTSGSL

2CX5: mitochondrial cytochrome c B subunit N-terminal

2CX5: mitochondrial cytochrome c B subunit N-terminal FFFFIFVFNF

Amino acid repeats Distribution is not random: Eukaryota: Most common: poly-Q, poly-N, poly-A, poly-S, poly-G Prokaryota: Most common: poly-S, poly-G, poly-A, poly-P Relatively rare: poly-Q, poly-N Very rare or absent in both eukaryota and prokaryota: Poly-I, Poly-M, Poly-W, Poly-C, Poly-Y Toxicity of long stretches of hydrophobic residues. (Faux et al 2005)

Amino acid repeats Pablo Mier Mier et al. (2017) Proteins

Filtering out CBRs Normally filtered out as low complexity region: they give spurious BLAST hits QQQQQQQQQQ |||||||||| QQQQQQQQQQ 10/10 id IDENTITIES IDENTITIES 10/10 id

Filtering out CBRs Normally filtered out as low complexity region: they give spurious BLAST hits QQQQQQQQQQ |||||||||| QQQQQQQQQQ Shuffle: 10/10 id IDENTITIES IDENTITIES 10/10 id

Filtering out CBRs Normally filtered out as low complexity region: they give spurious BLAST hits QQQQQQQQQQ |||||||||| QQQQQQQQQQ Shuffle: 10/10 id IDENTITIES | | SIINDIETTE Shuffle: 2/10 id

Filtering out CBRs Option for pre-BLAST treatment SEG algorithm: 1) Identify sequence regions with low information content over a sequence window 2) Merge neighbouring regions Eliminates hits against common acidic-, basic- or proline-rich regions (Wootton and Federhen, 1993)

AIR9 Microtubule localization of Δx-GFP (1708 aa) Ser rich + basic conserved region LRR A9 repeats Δ1 Δ3 Δ2 Δ15 Δ9 Δ10 Δ6 Δ11 Δ12 Δ14 Δ16 Buschmann, et al (2006). Current Biology. Buschmann, et al (2007). Plant Signaling & Behavior Microtubule localization of Δx-GFP

Homorepeats are frequent but difficult to characterize Pablo Mier e.g. polyQ: MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQP 10% of human proteins have homorepeats lack sequence conservation not possible to predict function by homology 30% residues are disordered 50% human proteins have one long disordered segment Homorepeats need to be studied in context

Schaefer et al (2012) Nucleic Acids Res. Function of polyQ Martin Schaefer polyQ in Huntingtin Human Dog Mouse Opossum Chicken Frog Zebrafish Trout Fugu Stickleback Lancelet Capitella Limpet Nematostella Trichoplax Ciona intestinalis Ciona savignyi D. melanogaster D. mojavensis D. sechellia D. erecta D. yakuba D. grimshawi D. pseudoobscura D. persimilis D. ananassae D. willistoni D. virilis Schaefer et al (2012) Nucleic Acids Res.

Exercise 3. Search for a polyQ insertion in the MR family Open in jalview the alignment of the mineralocorticoid receptor: MR1_fasta.txt Find a polyQ insertion. Do you see any other biased region nearby?