Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n

Slides:

Advertisements

Similar presentations

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

Lecture 8 Alignment of pairs of sequence Local and global alignment

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.

Sequence Similarity Searching Class 4 March 2010.

Heuristic alignment algorithms and cost matrices

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

We continue where we stopped last week: FASTA – BLAST

Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.

Sequence Analysis Tools

Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.

Protein Modules An Introduction to Bioinformatics.

Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Alignments 2: Local alignment Sequence Analysis

Heuristic Approaches for Sequence Alignments

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.

Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.

Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

© Wiley Publishing All Rights Reserved.

Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.

C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms.

An Introduction to Bioinformatics

Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.

What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.

Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.

Genome alignment Usman Roshan. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology – Evolutionary.

Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.

BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

Basic Local Alignment Search Tool BLAST Why Use BLAST?

COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Sequence Alignment.

Construction of Substitution matrices

Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Copyright OpenHelix. No use or reproduction without express written consent1.

1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Pipelines for Computational Analysis (Bioinformatics)

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool

BLAST Slides adapted & edited from a set by

Sequence alignment, E-value & Extreme value distribution

BLAST Slides adapted & edited from a set by

1-month Practical Course Genome Analysis Iterative homology searching

Presentation transcript:

Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n

n DNA Sequences – What would be the expected number of occurrences of a particular sequence in a genome? Size: human genome 6*10 9 considering both strands Base frequency: equal Sequence length: 20 nucleotides – Bernouli Model: = – But: (GT) n with n>10 = 10 5 Sequence Composition *6

Low-complexity Regions n Simple Sequence Regions (SSR) – MICRO- or MINISATELLITES – Regions that have significant biases in AA or nucleotide composition : repeats of simple motifs – (GT) n (AAC) n (P) n (NANP) n n Low-Complexity Regions/Segments – Complexity can be measured by Shannon’s Entropy Regarding an amino acid sequence – For each composition of a complexity state, there exists a large number of possible sequences

Low-Complexity Regions n Locally abundant residues may be – continuous or loosely clustered irregular or aperiodic n >25% of AA in currently sequenced genome is in LC regions – non-globular domains  SSR n Examples: myosins, pilins, segments in antigens, short subsequences of residues with unknown function – Beta-pleated sheets – Alpha helices – Coiled-coils

Low-Complexity Regions n Locally abundant residues may be – continuous or loosely clustered irregular or aperiodic n >25% of AA in currently sequenced genome is in LC regions – non-globular domains  SSR n Examples: myosins, pilins, segments in antigens, short subsequences of residues with unknown function – Beta-pleated sheets – Alpha helices – Coiled-coils

Detecting Low-Complexity n SEG and PSEG/NSEG algorithms – Wootton and Federhen Methods in Enzymology 266:33 (1996) Computers and Chemistry 17:149 (1993) n SEG – UNIX Executable available on ncbi servers seg FASTAfile Window TriggerComplexity Extension K 2 (1) K 2 (2) Longer Window lengths define more sustained regions, but overlook short biased subsequences

clobber> seg hu.piron.fa >gi|730388|sp|P40250|PRIO_CERAE MAJOR PRION PROTEIN PRECURSOR (PRP) 1-49 MANLGCWMLVVFVATWSDLGLCKKRPKPGG WNTGGSRYPGQGSPGGNRY ppqggggwgqphgggwgqphgggwgqphgg gwgqggg THNQWHKPSKPKTSMKHM agaaaagavvgglggymlgsams RPLIHFGNDYEDRYYRENMYRYPNQVYYRP VDQYSNQNNFVHDCVNITIKQH tvttttkgenftet DVKMMERVVEQMCITQYEKESQAYYQRGSS MVLFS sppvillisflifliv G clobber> seg hu.piron.fa l >gi|730388|sp|P40250|PRIO_CERAE(50-86) complexity=1.90 (12/2.20/2.50) ppqggggwgqphgggwgqphgggwgqphgggwgqggg >gi|730388|sp|P40250|PRIO_CERAE( ) complexity=2.47 (12/2.20/2.50) agaaaagavvgglggymlgsams >gi|730388|sp|P40250|PRIO_CERAE( ) complexity=2.26 (12/2.20/2.50) tvttttkgenftet >gi|730388|sp|P40250|PRIO_CERAE( ) complexity=2.50 (12/2.20/2.50) sppvillisflifliv

SEG piron with different window lengths question-based – exploratory tool – optimization step

– Intuitive explanation Take a 20-residue long sequence –( ) –( ) –( ) – Complexity can be described by Shannon’s Entropy (K 2 ) Regarding an amino acid sequence – For each composition of a complexity state, there exists a large number of possible sequences (K 1 ) Detecting Low-Complexity

How SEG works n seg FASTAfile Window TriggerComplexity Extension K 2 (1) K 2 (2) n Looks within window length: if complexity < K 2 (1) then extends until complexity < K 2 (2) n Uniform prior probabilities – Protein sequence data base is a heterogeneous statistical mixture such that the initially-unknown AA frequencies in Low-complexity subsets need have no similarity to frequencies in total data base – Unbiased view of low-complexity regions – Gives equiprobable compositions for any complexity state

How SEG works, continued n How do you correct for the background AA/nuc composition bias? – After randomly shuffling all the residues, determine the trigger complexity that results in 4% of the data base being within Low-complexity regions – Then use this trigger complexity and subtract 4% from %AA in Low-complexity regions

Detecting Low-complexity with repetitive motif: SSR n PSEG or NSEG n Repetition of residue types or k-grams n Period 3 (n V E n K N n V D n K D n V N n K S n K) (n m i n m i n m i n m i n m i n m i n m) (n m E n m N n m D n m D n m N n m S n m) n Sliding window along sequence in single residue steps

Evolutionary Mechanisms n Evolution of sequences in general – Evolution rate of – Base pair substitution (10 -9 ) Insertion/deletions Recombination n In SSR, Low-complexity regions, mutations are in length – with steps typically +/- one repeat unit – Evolution rate Biased nucleotide substitution due to increased recombination in repetitive regions Unequal crossing over (recombination) Replication slippage n Alignment of repeats does not imply relationships/ancestory

Low-Complexity and BLAST searches n Low-complexity regions results in BLAST searches being dominated by Low-complexity regions – biased AA/nuc composition n BLAST added “mask low-complexity” by default – Seg parameters: n BLAST now also uses a compositional bias filter on the whole database – Masks if composition bias using seg n YOU MAY WANT TO TURN THESE OPTIONS OFF and use your own organism-specific seg paramenters when doing protein homology searching n YOU WILL NEED TO TURN THESE OPTIONS OFF if you are interested in looking at sequence similarities of repetitive/low complexity regions.

Example:Plasmodium falciparum n Using whole genome sequences is important to limit pcr sequencing bias for antigens: hydrophilic proteins n Considering GC-content / AA bias – P. falciparum is approximately 28 % GC n Visualization of individual proteins

A helpful tool here and in general n SEALS: A system for Easy Analysis of Lots of Sequences, R. Walker and E. Koonin, NCBI n CBBresearch/Walker/SEALS/index.html n Demonstrate getting an appropriate data set – Taxnode2gi, gi2fasta – Daffy – Purge – Gref – Fanot n Use cleaned data set of P. falciparum proteins

Protein Analysis n Setting the trigger complexity: – Dbcomp – Shuffledb – Seg n Run SEG on P. falciparum MSP1, PfEMP2, Cg2 – Options –p (tree form output) -l (only report Low-C segs) -h (don’t report Low-C segs) -x (substitute Low-C with x) n Run PSEG on P. falciparum MSP1, PfEMP2, Cg2 with different –z (periodicity)

Usefulness of studying Low-Complexity Within a protein secondary structure, homology searchers, protein location genetic disorders Within taxa microsatellite markers polymorphism comparisons between proteins Between taxa Synteny, orthologs different selection pressures upon different organisms parasites: immunogenicity, rapid evolution of antigens, recombination