Repeats and composition bias

Slides:



Advertisements
Similar presentations
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
The FREAKS Session 3.1: Repeats Session 3.2: Biased regions Miguel Andrade Johannes-Gutenberg University of Mainz of PROTEIN SEQUENCE.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Secondary structure prediction. Amino acid sequence -> Secondary structure Alpha helix Beta strand Disordered/coil 70% accuracy 1991, 81% accuracy in.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
Protein Modules An Introduction to Bioinformatics.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
PREDICTION OF PROTEIN FEATURES Beyond protein structure (TM, signal/target peptides, coiled coils, conservation…)
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
© Wiley Publishing All Rights Reserved.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
Protein Bioinformatics Course
Protein domains. Protein domains are structural units (average 160 aa) that share: Function Folding Evolution Proteins normally are multidomain (average.
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
Protein Sequence Alignment and Database Searching.
Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
The FREAKS Session 3.1: Repeats Session 3.2: Biased regions Miguel Andrade Johannes-Gutenberg University of Mainz of PROTEIN SEQUENCE.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
1 PyMOL Evolutionary Trace Viewer 1.1 Lichtarge Lab Sept. 13, 2010.
Repeats and composition bias. Repeats Frequency 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Copyright OpenHelix. No use or reproduction without express written consent1.
Motif discovery and Protein Databases Tutorial 5.
Manually Adjusting Multiple Alignments Chris Wilton.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Repeats and composition bias. Repeats Frequency 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Homology 3D modeling and effect of mutations. X-ray crystallography (70,714 in PDB) need crystals Nuclear Magnetic Resonance (NMR) (9,312) proteins in.
InterPro Sandra Orchard.
Graphical representation of an amino acid or nucleic acid multiple sequence alignment developed by Tom Schneider and Mike.
Homology 3D modeling and effect of mutations Miguel Andrade Faculty of Biology, Johannes Gutenberg University Institute of Molecular Biology Mainz, Germany.
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
Repeats and composition bias
Homology 3D modeling Miguel Andrade Mainz, Germany Faculty of Biology,
Protein 3D representation
Prediction of protein features. Beyond protein structure
Secondary structure prediction
polyQ and other homorepeats
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
Protein 3D representation
Getting the Most out of the PDBe
Secondary structure prediction
Protein 3D representation
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
Homology 3D modeling and effect of mutations
Dot Plots Dot Plots provide a graphic view of the amount of similarity between two sequences. The two axes represent the two sequences. In its simplest.
Protein Sequence Analysis - Overview -
BLAST.
Basic Local Alignment Search Tool
The Crystal Structure of the Human Hepatitis B Virus Capsid
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
Protein 3D representation
Repeats and composition bias
Examining repeats with databases
Volume 95, Issue 2, Pages (October 1998)
Presentation transcript:

Repeats and composition bias Miguel Andrade Faculty of Biology, Johannes Gutenberg University Institute of Molecular Biology Mainz, Germany andrade@uni-mainz.de

Repeats

Frequency 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect tandem repeats. Assemble in structure.

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIMCHEKSPSVCSPLNMTSSVCSPAGINSVSSTTASF GSFPVHSPITQGTPLTCSPNVENRGSRSHSPAHASNVGSPLSSPLSSMKSSISSPPSHCSVKSPVSSPNNVTLRSSVSSPANINN

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIMCHEKSPSVCSPLNMTSSVCSPAGINSVSSTTASFGSFPVHSPITQGTPLTCSPNVENRGSRSHSPAHASNVGSPLSSPLSSMKSSISSPPSHCSVKSPVSSPNNVTLRSSVSSPANINN

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIM CHE KSPSVCSPLN MTSSVCSPAG INSVSSTTASF GSFPVHSPIT Q GTPLTCSPNV EN RGSRSHSPAH ASN VGSPLSSPLS S MKSSISSPPS HCS VKSPVSSPNN VT LRSSVSSPAN INN

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIM CHE KSPSVCSPLN MTSSVCSPAG INSVSSTTASF GSFPVHSPIT Q GTPLTCSPNV EN RGSRSHSPAH ASN VGSPLSSPLS S MKSSISSPPS HCS VKSPVSSPNN VT LRSSVSSPAN INN

Tandem repeats fold together

Tandem repeats fold together

Tandem repeats fold together

Tandem repeats fold together

Tandem repeats fold together

Tandem repeats fold together

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIM CHE KSPSVCSPLN MTSSVCSPAG INSVSSTTASF GSFPVHSPIT Q GTPLTCSPNV EN RGSRSHSPAH ASN VGSPLSSPLS S MKSSISSPPS HCS VKSPVSSPNN VT LRSSVSSPAN INN

http://weblogo.berkeley.edu (Vlassi et al, 2013) 15

A subunit PP2A structure PDB:1b3u Groves et al. (1999) Cell

Ap1 Clathrin Adaptor Core PDB:1w63 Heldwein et al. (2004) PNAS

Ap1 Clathrin Adaptor Core PDB:1w63 Heldwein et al. (2004) PNAS

i-TASSER model of D. melanogaster thr protein Based on PDB 4BUJ chain B

PDB 4BUJ Ski complex (yeast)

Andrade et al. (2001) J Struct Biol

Frequency repeats Fraction of proteins annotated with the keyword REPEAT in SwissProt % Archaea 27/3428 0.79 Viruses 81/8048 1.00 Bacteria 299/28438 1.05 Fungi 232/8334 2.78 Viridiplantae 153/6963 2.20 Metazoa 1538/28948 5.31 Rest of Eukaryota 92/2434 3.78 (Andrade et al 2001)

Definition CBRs Perfect repeat: QQQQQQQQQQQ Imperfect: QQQQPQQQQQQ Amino acid type: DDDDDEEEDEDEED Compositionally biased regions (CBRs) High frequency of one or two amino acids in a region. Particular case of low complexity region

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection repeats Sometimes straightforward. N-terminal human Huntingtin. How many repeats can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection repeats Often NOT straightforward. N-terminal human Huntingtin. How many repeats can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection repeats Often NOT straightforward. N-terminal human Huntingtin. How many repeats can you find? EFQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKA CRPYLVNLLPCLTRTSKRP-EESVQETLAAAVPKIMAS NDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHS TQYFYSWLLNVLLGLLVPVEDEHSTLLILGVLLTLRYL PSAEQLVQVYELTLHHTQHQDHNVVTGALELLQQLFRT

Detection repeats Often NOT straightforward. N-terminal human Huntingtin. How many repeats can you find? EFQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKA CRPYLVNLLPCLTRTSKRP-EESVQETLAAAVPKIMAS NDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHS TQYFYSWLLNVLLGLLVPVEDEHSTLLILGVLLTLRYL PSAEQLVQVYELTLHHTQHQDHNVVTGALELLQQLFRT

Detection of repeats Dotplots Comparing a sequence against itself

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV

Detection of repeats Dotplots TLRSSVSSPANINNS | 1 match NMTSSVCSPANISV

Detection of repeats Dotplots TLRSSVSSPANINNS ||| ||||| NMTSSVCSPANISV 8 matches NMTSSVCSPANISV

Detection of repeats Dotplots TLRSSVSSPANINNS | | NMTSSVCSPANISV | | 2 matches NMTSSVCSPANISV

Detection of repeats Dotplots TLRSSVSSPANINNS | 1 match NMTSSVCSPANISV

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV 8

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV 1821

Exercise 1

Exercise 1/4. Using Dotlet with the human mineralocorticoid receptor (MR) Go to the Dotlet web page: http://myhits.isb-sib.ch/cgi-bin/dotlet Click on the input button and paste the sequence of the human mineralocorticoid receptor (UniProt id P08235) Click on the “compute” button Try to find combinations of parameters that show patterns in the dot plot (Hint: You can adjust this finely using the arrows) Find repetitions clicking in the diagonal patterns

Exercise 1/4. Using Dotlet with the human mineralocorticoid receptor (MR)

Detection of repeats Using a multiple sequence alignment helps. Conserved repeated patterns JalView with Regular Expression searches

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches Regular Expressions: [LS]P.A matches L or S, followed by P, followed by anything, followed by A

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches Regular Expressions: [LS]P.A matches L or S, followed by P, followed by anything, followed by A Which one is not matched? LPTA, SPAA, LPPA, LPAP, SPLA

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches Regular Expressions: [LS]P.A matches L or S, followed by P, followed by anything, followed by A Which one is not matched? LPTA, SPAA, LPPA, LPAP, SPLA

Exercise 2/4. Using JalView with a MSA of the MR with orthologs Load the multiple sequence alignment of the MR in JalView: MR1_fasta.txt Use the “Select > find" (of Ctrl+F) option with a regular expression and mark all matches (click the “Find all” option!) Try to find the expression that matches more repeats. How many repeats do you see? How long are they? Would you correct the alignment based on these findings?

Vlassi et al. (2013) BMC Struct. Biol. * * * * #F1 #F2 #F3 #F4 #T8 #T9 #T10 #T11 #T12 #T13 #T14 #T15 * * * #F5 #F6 #F7 #F8 #F9 #F10 #F11 Vlassi et al. (2013) BMC Struct. Biol. 50

Mineralocorticoid receptor Repeat region AF1a ID AF1b DBD LBD 984 aa NTD 100 200 300 400 500 600 700 800 900 1000 aa Vlassi et al. (2013) BMC Struct. Biol. 51

52 52

Composition bias

Definition 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect tandem repeats. Assemble in structure.

Definition CBRs Perfect repeat: QQQQQQQQQQQ Imperfect: QQQQPQQQQQQ Amino acid type: DDDDDEEEDEDEED Compositionally biased regions (CBRs) High frequency of one or two amino acids in a region. Particular case of low complexity region

Function CBRs Conservation => Function Length, amino acid type not necessarily conserved Frequency: 1 in 3 proteins contains a compositionally biased region (Wootton, 1994), ~11% conserved (Sim and Creamer, 2004)

Function CBRs Conservation => Function Length, amino acid type not necessarily conserved Functions: Passive: linkers Active: binding, mediate protein interaction, structural integrity (Sim and Creamer, 2004)

Structure of CBRs Often variable or flexible: do not easily crystalize

1CJF: profilin bound to polyP

2IF8: Inositol Phosphate Multikinase Ipk2

2IF8: Inositol Phosphate Multikinase Ipk2 RVSETTTSGSL

2CX5: mitochondrial cytochrome c B subunit N-terminal

2CX5: mitochondrial cytochrome c B subunit N-terminal FFFFIFVFNF

Amino acid repeats Distribution is not random: Eukaryota: Most common: poly-Q, poly-N, poly-A, poly-S, poly-G Prokaryota: Most common: poly-S, poly-G, poly-A, poly-P Relatively rare: poly-Q, poly-N Very rare or absent in both eukaryota and prokaryota: Poly-I, Poly-M, Poly-W, Poly-C, Poly-Y Toxicity of long stretches of hydrophobic residues. (Faux et al 2005)

Amino acid repeats Pablo Mier

Filtering out CBRs Normally filtered out as low complexity region: they give spurious BLAST hits QQQQQQQQQQ |||||||||| QQQQQQQQQQ 10/10 id IDENTITIES IDENTITIES 10/10 id

Filtering out CBRs Normally filtered out as low complexity region: they give spurious BLAST hits QQQQQQQQQQ |||||||||| QQQQQQQQQQ Shuffle: 10/10 id IDENTITIES IDENTITIES 10/10 id

Filtering out CBRs Normally filtered out as low complexity region: they give spurious BLAST hits QQQQQQQQQQ |||||||||| QQQQQQQQQQ Shuffle: 10/10 id IDENTITIES | | SIINDIETTE Shuffle: 2/10 id

Filtering out CBRs Option for pre-BLAST treatment SEG algorithm: 1) Identify sequence regions with low information content over a sequence window 2) Merge neighbouring regions Eliminates hits against common acidic-, basic- or proline-rich regions (Wootton and Federhen, 1993)

AIR9 A particular analysis… Microtubule localization of Δx-GFP (1708 aa) Ser rich + basic conserved region LRR A9 repeats Δ1 Δ3 Δ2 Δ15 Δ9 Δ10 Δ6 Δ11 Δ12 Buschmann, et al (2006). Current Biology. Buschmann, et al (2007). Plant Signaling & Behavior Δ14 Δ16 Microtubule localization of Δx-GFP

A particular analysis… …triggers a tool

A particular analysis… …triggers BiasViz A particular analysis… http://matthuska.github.io/biasviz/ Huska, et al. (2007). Bioinformatics

A particular analysis… …triggers BiasViz A particular analysis… http://matthuska.github.io/biasviz/ Huska, et al. (2007). Bioinformatics

http://matthuska.github.io/biasviz/ ADAM15 Huska, et al. (2007). Bioinformatics

Binds SH3 of endophilin and SH3 PX1 PMID:10531379 Binds SH3 of endophilinI and SH3 PX1 PMID:10531379 Binds SH3 of Fish PMID:12615925 Binds SH3 of Grb2 PMID:11127814 Binds SH3 of Fish PMID:12615925 Binds SH3 of Fish PMID:12615925 Binds SH3 of ArgBP1/ABI2 PMID:12463424

a b 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 ADAM19 ADAM11 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 ADAM9 ADAM20 c

Allowing BiasViz2 to run Type JavaRE8 Settings on start screen and run javaRE8 Settings v (NOT: JavaRE Settings!) Go to Security Add an exception by clicking for http://matthuska.github.io/ Run “Firefox with JRE”

Exercise 3/4. Viewing CBRs in an alignment with BiasViz2 Go to the BiasViz2 web page: http://matthuska.github.io/biasviz/ Launch BiasViz2 Load the alignment little_MSA_aln.txt on the step 1 section Hit the "Go to graphical view" button Try to find combinations of parameters that reveal CBRs Try hydrophobic residues and window size 10. Remember that this is a transmembrane protein. What is this result telling you? Can you see other biased regions?

Exercise 4/4. Viewing CBRs in an alignment with BiasViz2 Launch BiasViz2 Load the alignment MR1_fasta.txt on the step 1 section Try to detect the region with repeats as a compositionally biased region using a selection of two amino acids over-represented in the repeats. Use display option “Threshold” What is the optimal window size and threshold value to represent the region with repeats?