Repeats and composition bias. Repeats Frequency 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect.

Slides:

Advertisements

Similar presentations

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.

Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

The FREAKS Session 3.1: Repeats Session 3.2: Biased regions Miguel Andrade Johannes-Gutenberg University of Mainz of PROTEIN SEQUENCE.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Secondary structure prediction. Amino acid sequence -> Secondary structure Alpha helix Beta strand Disordered/coil 70% accuracy 1991, 81% accuracy in.

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.

Protein Modules An Introduction to Bioinformatics.

Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.

PREDICTION OF PROTEIN FEATURES Beyond protein structure (TM, signal/target peptides, coiled coils, conservation…)

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Sequencing a genome and Basic Sequence Alignment

Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.

© Wiley Publishing All Rights Reserved.

Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.

Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.

Protein Bioinformatics Course

Protein domains. Protein domains are structural units (average 160 aa) that share: Function Folding Evolution Proteins normally are multidomain (average.

Multiple Sequence Alignment School of B&I TCD May 2010.

Protein 3D-structure analysis Exercises. Practicals Find update frequency for RCSB PDB: weekly. When was the last update? How many protein structures.

Protein Sequence Alignment and Database Searching.

Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n

Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.

The FREAKS Session 3.1: Repeats Session 3.2: Biased regions Miguel Andrade Johannes-Gutenberg University of Mainz of PROTEIN SEQUENCE.

Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.

Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)

1 P6a Extra Discussion Slides Part 1. 2 Section A.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.

Copyright OpenHelix. No use or reproduction without express written consent1.

Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.

Motif discovery and Protein Databases Tutorial 5.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Repeats and composition bias. Repeats Frequency 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Sequence Alignment.

Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.

Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Homology 3D modeling and effect of mutations. X-ray crystallography (70,714 in PDB) need crystals Nuclear Magnetic Resonance (NMR) (9,312) proteins in.

InterPro Sandra Orchard.

Graphical representation of an amino acid or nucleic acid multiple sequence alignment developed by Tom Schneider and Mike.

Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,

Repeats and composition bias

Homology 3D modeling Miguel Andrade Mainz, Germany Faculty of Biology,

Prediction of protein features. Beyond protein structure

polyQ and other homorepeats

Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,

Repeats and composition bias

Secondary structure prediction

Lab 8.3: RNA Secondary Structure

Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,

Homology 3D modeling and effect of mutations

Protein Bioinformatics Course

Dot Plots Dot Plots provide a graphic view of the amount of similarity between two sequences. The two axes represent the two sequences. In its simplest.

Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,

Repeats and composition bias

Examining repeats with databases

Presentation transcript:

Repeats and composition bias

Repeats

Frequency 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect tandem repeats. Assemble in structure.

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIMCHEKSPSVCSPLNMTSSVCSPAGINSVSSTTASF GSFPVHSPITQGTPLTCSPNVENRGSRSHSPAHASNVGSPLSSP LSSMKSSISSPPSHCSVKSPVSSPNNVTLRSSVSSPANINN

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIMCHEKSPSVCSPLNMTSSVCSPAGINSVSSTTASF GSFPVHSPITQGTPLTCSPNVENRGSRSHSPAHASNVGSPLSSP LSSMKSSISSPPSHCSVKSPVSSPNNVTLRSSVSSPANINN

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIM CHE KSPSVCSPLN MTSSVCSPAG INSVSSTTASF GSFPVHSPIT Q GTPLTCSPNV EN RGSRSHSPAH ASN VGSPLSSPLS S MKSSISSPPS HCS VKSPVSSPNN VT LRSSVSSPAN INN

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIM CHE KSPSVCSPLN MTSSVCSPAG INSVSSTTASF GSFPVHSPIT Q GTPLTCSPNV EN RGSRSHSPAH ASN VGSPLSSPLS S MKSSISSPPS HCS VKSPVSSPNN VT LRSSVSSPAN INN

Tandem repeats fold together

Definition repeats Sequence, long, imperfect, tandem MRAVVKSPIM CHE KSPSVCSPLN MTSSVCSPAG INSVSSTTASF GSFPVHSPIT Q GTPLTCSPNV EN RGSRSHSPAH ASN VGSPLSSPLS S MKSSISSPPS HCS VKSPVSSPNN VT LRSSVSSPAN INN

(Vlassi et al, 2013)

Andrade et al. (2001) J Struct Biol

Definition CBRs Perfect repeat: QQQQQQQQQQQ Imperfect: QQQQPQQQQQQ Amino acid type: DDDDDEEEDEDEED Compositionally biased regions (CBRs) High frequency of one or two amino acids in a region. Particular case of low complexity region

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection CBRs Sometimes straightforward. N-terminal human Huntingtin. How many CBRs can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection repeats Sometimes straightforward. N-terminal human Huntingtin. How many repeats can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection repeats Often NOT straightforward. N-terminal human Huntingtin. How many repeats can you find? >sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQP LLPQPQPPPPPPPPPPGPAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNSPE FQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKALMDSNLPRLQLELYKEIKKNGAP RSLRAALWRFAELAHLVRPQKCRPYLVNLLPCLTRTSKRPEESVQETLAAAVPKIMASFG NFANDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHSRRTQYFYSWLLNVLLGLLV PVEDEHSTLLILGVLLTLRYLVPLLQQQVKDTSLKGSFGVTRKEMEVSPSAEQLVQVYEL TLHHTQHQDHNVVTGALELLQQLFRTPPPELLQTLTAVGGIGQLTAAKEESGGRSRSGSI VELIAGGGSSCSPVLSRKQKGKVLLGEEEALEDDSESRSDVSSSALTASVKDEISGELAA SSGVSTPGSAGHDIITEQPRSQHTLQADSVDLASCDLTSSATDGDEEDILSHSSSQVSAV PSDPAMDLNDGTQASSPISDSSQTTTEGPDSAVTPSDSSEIVLDGTDNQYLGLQIGQPQD EDEEATGILPDEASEAFRNSSMALQQAHLLKNMSHCRQPSDSSVDKFVLRDEATEPGDQE NKPCRIKGDIGQSTDDDSAPLVHCVRLLSASFLLTGGKNVLVPDRDVRVSVKALALSCVG AAVALHPESFFSKLYKVPLDTTEYPEEQYVSDILNYIDHGDPQVRGATAILCGTLICSIL

Detection repeats Often NOT straightforward. N-terminal human Huntingtin. How many repeats can you find? EFQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKA CRPYLVNLLPCLTRTSKRP-EESVQETLAAAVPKIMAS NDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHS TQYFYSWLLNVLLGLLVPVEDEHSTLLILGVLLTLRYL PSAEQLVQVYELTLHHTQHQDHNVVTGALELLQQLFRT

Detection repeats Often NOT straightforward. N-terminal human Huntingtin. How many repeats can you find? EFQKLLGIAMELFLLCSDDAESDVRMVADECLNKVIKA CRPYLVNLLPCLTRTSKRP-EESVQETLAAAVPKIMAS NDNEIKVLLKAFIANLKSSSPTIRRTAAGSAVSICQHS TQYFYSWLLNVLLGLLVPVEDEHSTLLILGVLLTLRYL PSAEQLVQVYELTLHHTQHQDHNVVTGALELLQQLFRT

Repeats

Frequency repeats Fraction of proteins annotated with the keyword REPEAT in SwissProt % Archaea 27/ Viruses 81/ Bacteria 299/ Fungi 232/ Viridiplantae 153/ Metazoa 1538/ Rest of Eukaryota 92/ (Andrade et al 2001)

Detection of repeats Dotplots Comparing a sequence against itself

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV | 1 match

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV ||| ||||| 8 matches

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV | | 2 matches

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV | 1 match

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV 8

Detection of repeats Dotplots TLRSSVSSPANINNS NMTSSVCSPANISV 1821

Exercise 1

Go to the Dotlet web page: sib.ch/cgi-bin/dotlethttp://myhits.isb- sib.ch/cgi-bin/dotlet Click on the input button and paste the sequence of the human mineralocorticoid receptor (UniProt id P08235) Click on the “compute” button Try to find combinations of parameters that show patterns in the dot plot (Hint: You can adjust this finely using the arrows) Find repetitions clicking in the diagonal patterns Exercise 1/3. Using Dotlet with the human mineralocorticoid receptor (MR)

Detection of repeats Using a multiple sequence alignment helps. Conserved repeated patterns JalView with Regular Expression searches

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches Regular Expressions: [LS]P.A matches L or S, followed by P, followed by anything, followed by A

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches Regular Expressions: [LS]P.A matches L or S, followed by P, followed by anything, followed by A Which one is not matched? LPTA, SPAA, LPPA, LPAP, SPLA

Detection of repeats Using a multiple sequence alignment helps Conserved repeated patterns JalView with Regular Expression searches Regular Expressions: [LS]P.A matches L or S, followed by P, followed by anything, followed by A Which one is not matched? LPTA, SPAA, LPPA, LPAP, SPLA

Load the multiple sequence alignment of the MR in JalView: MR1_fasta.txt Use the “Select > find" (of Ctrl+F) option with a regular expression and mark all matches (click the “Find all” option!) Try to find the expression that matches more repeats. How many repeats do you see? How long are they? Would you correct the alignment based on these findings? Exercise 2/3. Using JalView with a MSA of the MR with orthologs

#T1 #T13#T12#T11#T10#T9#T8 #T7 #T2#T3#T4#T5#T6 #F1 #F2#F3 #F4 #F5#F10#F9#F8#F7#F6 #T14 #T15 #F11 *** **** (Vlassi et al, 2013)

Composition bias

Definition 14% proteins contains repeats (Marcotte et al, 1999) 1: Single amino acid repeats. 2: Longer imperfect tandem repeats. Assemble in structure.

Definition CBRs Perfect repeat: QQQQQQQQQQQ Imperfect: QQQQPQQQQQQ Amino acid type: DDDDDEEEDEDEED Compositionally biased regions (CBRs) High frequency of one or two amino acids in a region. Particular case of low complexity region

Conservation => Function Length, amino acid type not necessarily conserved Frequency: 1 in 3 proteins contains a compositionally biased region (Wootton, 1994), ~11% conserved (Sim and Creamer, 2004) Function CBRs

Conservation => Function Length, amino acid type not necessarily conserved Functions: Passive: linkers Active: binding, mediate protein interaction, structural integrity (Sim and Creamer, 2004)

Structure of CBRs Often variable or flexible: do not easily crystalize

1CJF: profilin bound to polyP

2IF8: Inositol Phosphate Multikinase Ipk2

RVSETTTSGSL

2CX5: mitochondrial cytochrome c B subunit N-terminal

2CX5: mitochondrial cytochrome c B subunit N-terminal FFFFIFVFNF

Types of CBRs More than 6 aa in length, 1.4% of all, 87% of them in Euk (Faux et al 2005)

Types of CBRs (Faux et al 2005) Distribution is not random: Eukaryota: Most common: poly-Q, poly-N, poly-A, poly-S, poly-G Prokaryota: Most common: poly-S, poly-G, poly-A, poly-P Relatively rare: poly-Q, poly-N Very rare or absent in both eukaryota and prokaryota: Poly-I, Poly-M, Poly-W, Poly-C, Poly-Y Toxicity of long stretches of hydrophobic residues.

Filtering out CBRs Normally filtered out as low complexity region: they give spurious BLAST hits QQQQQQQQQQ |||||||||| QQQQQQQQQQ 10/10 id IDENTITIES |||||||||| IDENTITIES 10/10 id

Filtering out CBRs Normally filtered out as low complexity region: they give spurious BLAST hits QQQQQQQQQQ |||||||||| QQQQQQQQQQ Shuffle: 10/10 id IDENTITIES |||||||||| IDENTITIES 10/10 id

Filtering out CBRs Normally filtered out as low complexity region: they give spurious BLAST hits QQQQQQQQQQ |||||||||| QQQQQQQQQQ Shuffle: 10/10 id IDENTITIES | | SIINDIETTE Shuffle: 2/10 id

Filtering out CBRs Option for pre-BLAST treatment SEG algorithm: 1) Identify sequence regions with low information content over a sequence window 2) Merge neighbouring regions Eliminates hits against common acidic-, basic- or proline-rich regions (Wootton and Federhen, 1993)

A particular analysis… AIR9 (1708 aa) Ser rich + basic LRR A9 repeats conserved region Δ1Δ1 Δ15 Δ9Δ9 Δ12 Δ14 Δ10 Δ11 Δ16 Δ3Δ3 Δ2Δ2 Δ6Δ6 Buschmann, et al (2006). Current Biology. Buschmann, et al (2007). Plant Signaling & Behavior Microtubule localization of Δx-GFP

…triggers a tool A particular analysis…

Huska, et al. (2007). Bioinformatics …triggers BiasViz A particular analysis…

…triggers BiasViz Huska, et al. (2007). Bioinformatics A particular analysis…

ADAM15 Huska, et al. (2007). Bioinformatics

Binds SH3 of endophilin and SH3 PX1 PMID: Binds SH3 of endophilinI and SH3 PX1 PMID: Binds SH3 of Fish PMID: Binds SH3 of Grb2 PMID: Binds SH3 of Fish PMID: Binds SH3 of ArgBP1/ABI2 PMID:

ADAM19 ADAM9 ADAM11 ADAM20 a b c

Go to the BiasViz2 web page: Launch BiasViz2 Load the alignment little_MSA_fasta.txt on the step 1 section Hit the "Go to graphical view" button Try to find combinations of parameters that reveal CBRs Try hydrophobic residues and window size 10. Remember that this is a transmembrane protein. What is this result telling you? Can you see other biased regions? Exercise 3/3. Viewing CBRs in an alignment with BiasViz2

Allowing BiasViz2 to run Type JavaRE8 Settings on start screen and run javaRE8 Settings v Go to Sicherheit Add an exception for Restart Firefox mit JRE Exercise 3/3. Viewing CBRs in an alignment with BiasViz2