Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n
n DNA Sequences – What would be the expected number of occurrences of a particular sequence in a genome? Size: human genome 6*10 9 considering both strands Base frequency: equal Sequence length: 20 nucleotides – Bernouli Model: = – But: (GT) n with n>10 = 10 5 Sequence Composition *6
Low-complexity Regions n Simple Sequence Regions (SSR) – MICRO- or MINISATELLITES – Regions that have significant biases in AA or nucleotide composition : repeats of simple motifs – (GT) n (AAC) n (P) n (NANP) n n Low-Complexity Regions/Segments – Complexity can be measured by Shannon’s Entropy Regarding an amino acid sequence – For each composition of a complexity state, there exists a large number of possible sequences
Low-Complexity Regions n Locally abundant residues may be – continuous or loosely clustered irregular or aperiodic n >25% of AA in currently sequenced genome is in LC regions – non-globular domains SSR n Examples: myosins, pilins, segments in antigens, short subsequences of residues with unknown function – Beta-pleated sheets – Alpha helices – Coiled-coils
Low-Complexity Regions n Locally abundant residues may be – continuous or loosely clustered irregular or aperiodic n >25% of AA in currently sequenced genome is in LC regions – non-globular domains SSR n Examples: myosins, pilins, segments in antigens, short subsequences of residues with unknown function – Beta-pleated sheets – Alpha helices – Coiled-coils
Detecting Low-Complexity n SEG and PSEG/NSEG algorithms – Wootton and Federhen Methods in Enzymology 266:33 (1996) Computers and Chemistry 17:149 (1993) n SEG – UNIX Executable available on ncbi servers seg FASTAfile Window TriggerComplexity Extension K 2 (1) K 2 (2) Longer Window lengths define more sustained regions, but overlook short biased subsequences
clobber> seg hu.piron.fa >gi|730388|sp|P40250|PRIO_CERAE MAJOR PRION PROTEIN PRECURSOR (PRP) 1-49 MANLGCWMLVVFVATWSDLGLCKKRPKPGG WNTGGSRYPGQGSPGGNRY ppqggggwgqphgggwgqphgggwgqphgg gwgqggg THNQWHKPSKPKTSMKHM agaaaagavvgglggymlgsams RPLIHFGNDYEDRYYRENMYRYPNQVYYRP VDQYSNQNNFVHDCVNITIKQH tvttttkgenftet DVKMMERVVEQMCITQYEKESQAYYQRGSS MVLFS sppvillisflifliv G clobber> seg hu.piron.fa l >gi|730388|sp|P40250|PRIO_CERAE(50-86) complexity=1.90 (12/2.20/2.50) ppqggggwgqphgggwgqphgggwgqphgggwgqggg >gi|730388|sp|P40250|PRIO_CERAE( ) complexity=2.47 (12/2.20/2.50) agaaaagavvgglggymlgsams >gi|730388|sp|P40250|PRIO_CERAE( ) complexity=2.26 (12/2.20/2.50) tvttttkgenftet >gi|730388|sp|P40250|PRIO_CERAE( ) complexity=2.50 (12/2.20/2.50) sppvillisflifliv
SEG piron with different window lengths question-based – exploratory tool – optimization step
– Intuitive explanation Take a 20-residue long sequence –( ) –( ) –( ) – Complexity can be described by Shannon’s Entropy (K 2 ) Regarding an amino acid sequence – For each composition of a complexity state, there exists a large number of possible sequences (K 1 ) Detecting Low-Complexity
How SEG works n seg FASTAfile Window TriggerComplexity Extension K 2 (1) K 2 (2) n Looks within window length: if complexity < K 2 (1) then extends until complexity < K 2 (2) n Uniform prior probabilities – Protein sequence data base is a heterogeneous statistical mixture such that the initially-unknown AA frequencies in Low-complexity subsets need have no similarity to frequencies in total data base – Unbiased view of low-complexity regions – Gives equiprobable compositions for any complexity state
How SEG works, continued n How do you correct for the background AA/nuc composition bias? – After randomly shuffling all the residues, determine the trigger complexity that results in 4% of the data base being within Low-complexity regions – Then use this trigger complexity and subtract 4% from %AA in Low-complexity regions
Detecting Low-complexity with repetitive motif: SSR n PSEG or NSEG n Repetition of residue types or k-grams n Period 3 (n V E n K N n V D n K D n V N n K S n K) (n m i n m i n m i n m i n m i n m i n m) (n m E n m N n m D n m D n m N n m S n m) n Sliding window along sequence in single residue steps
Evolutionary Mechanisms n Evolution of sequences in general – Evolution rate of – Base pair substitution (10 -9 ) Insertion/deletions Recombination n In SSR, Low-complexity regions, mutations are in length – with steps typically +/- one repeat unit – Evolution rate Biased nucleotide substitution due to increased recombination in repetitive regions Unequal crossing over (recombination) Replication slippage n Alignment of repeats does not imply relationships/ancestory
Low-Complexity and BLAST searches n Low-complexity regions results in BLAST searches being dominated by Low-complexity regions – biased AA/nuc composition n BLAST added “mask low-complexity” by default – Seg parameters: n BLAST now also uses a compositional bias filter on the whole database – Masks if composition bias using seg n YOU MAY WANT TO TURN THESE OPTIONS OFF and use your own organism-specific seg paramenters when doing protein homology searching n YOU WILL NEED TO TURN THESE OPTIONS OFF if you are interested in looking at sequence similarities of repetitive/low complexity regions.
Example:Plasmodium falciparum n Using whole genome sequences is important to limit pcr sequencing bias for antigens: hydrophilic proteins n Considering GC-content / AA bias – P. falciparum is approximately 28 % GC n Visualization of individual proteins
A helpful tool here and in general n SEALS: A system for Easy Analysis of Lots of Sequences, R. Walker and E. Koonin, NCBI n CBBresearch/Walker/SEALS/index.html n Demonstrate getting an appropriate data set – Taxnode2gi, gi2fasta – Daffy – Purge – Gref – Fanot n Use cleaned data set of P. falciparum proteins
Protein Analysis n Setting the trigger complexity: – Dbcomp – Shuffledb – Seg n Run SEG on P. falciparum MSP1, PfEMP2, Cg2 – Options –p (tree form output) -l (only report Low-C segs) -h (don’t report Low-C segs) -x (substitute Low-C with x) n Run PSEG on P. falciparum MSP1, PfEMP2, Cg2 with different –z (periodicity)
Usefulness of studying Low-Complexity Within a protein secondary structure, homology searchers, protein location genetic disorders Within taxa microsatellite markers polymorphism comparisons between proteins Between taxa Synteny, orthologs different selection pressures upon different organisms parasites: immunogenicity, rapid evolution of antigens, recombination