Sequence Analysis Topics to be covered What is sequence analysis? Why we need it? How we do it? What are the tools available?

Sequence Databases GenBank at the National Center of Biotechnology Information, National Library of Medicine, Washington, DC accessible from: http://www.ncbi.nlm.nih.gov/Entrez European Molecular Biology Laboratory (EMBL) Outstation at Hinxton, England http://www.ebi.ac.uk/embl/index.html DNA DataBank of Japan (DDBJ) at Mishima, Japan http://www.ddbj.nig.ac.jp/ Protein International Resource (PIR) database at the National Biomedical Research Foundation in Washington, DC http://www-nbrf.georgetown.edu/pirwww/ 5.

Sequence Databases The SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research in Epalinges/Lausanne http://www.expasy.ch/cgi- bin/sprot-search-de

Servers for Sequence Alignment Bytes Block Aligner http://www.wadsworth.org/res&res/bioinfo BCM Search Launcher: Pair-wise sequence alignment http://searchlauncher.bcm.tmc.edu/seq- search/alignment.html SIM—Local similarity program for finding alternative alignments http://www.expasy.ch/tools/sim.html Global alignment programs (GAP, NAP) http://genome.cs.mtu.edu/align/align.html FASTA program suite http://fasta.bioch.virginia.edu/fasta/fasta_list.html BLAST 2 sequence alignment (BLASTN, BLASTP) http://www.ncbi.nlm.nih.gov/gorf/bl2.html Likelihood-weighted sequence alignment (lwa) http://stateslab.bioinformatics.med.umich.edu/service/l wa.html

ALIGNMENT How do we tell whether two macromolecules are similar? Why?

Sequence Comparison Similarity between and among sequences is at the heart of sequence comparison methods. Q: What causes sequences to be similar? A: Evolutionary OR developmental causes.

Similarity, Homology, Orthology, Paralogy Similarity cannot be equated with Homology, Orthology, or Paralogy. Different organisms may have similar characteristics for different causes/reasons: evolutionary OR developmental. We need to concern ourselves with one set only, the set whose similarities are due to evolutionary causes/events; in other words, we take into account ONLY those characteristics that are similar among organisms due to evolution (common ancestry).

Similarity,Homology, Orthology, Paralogy

Homology Homology: designates a relationship of common descent between any entities, without further specification of the evolutionary scenario. Accordingly, the entities related by homology, in particular, genes, are called homologs.

Sequences are said to be homologous if they are related by divergence from a common ancestor Homology is not a measure of similarity, but an absolute statement that sequences have a divergent rather than a convergent relationship.

Orthology Orthologs are homologous genes in different species with analogous functions.

Paralogy Paralogs are similar genes that are the result of a gene duplication. Paralogs are therefore neither homologs nor orthologs.

Orthology vs. Paralogy A phylogeny that includes both orthologs and paralogs is likely to be incorrect. The figure above shows how gene duplication gave rise to two paralogous branches, α and β, species within each branch are orthologs of each other.

ALIGNMENT One-to-One One-to-Database Many-to-Many

Origins of Sequence Similarity Homology –common evolutionary descent Similarity in function –convergence Chance

Identity GAACAAT |||||||7/7 OR 100% GAACAAT ||| |||6/7 OR 84% GAATAAT MISMATCH

Mismatches GAACAAT ||| |||6/7 OR 84% GAATAAT GAACAAT ||| |||6/7 OR 84% GAAGAAT

Terminal Mismatch GAACAATttttt ||| ||| aaaccGAATAAT 6/7 OR 84%

Types of Sequence Analysis Knowledge-based single/multiple sequence analysis for sequence characteristics Pair-wise sequence alignment Multiple sequence alignment ~ Sequence motif discovery in multiple sequence alignment ~Phylogenetic analysis

Sequence Alignment Procedure for comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences.

Sequence Alignment I Fundamental to inferring homology (common ancestry) and function. IIIf two sequences are in alignment  part or all of the pattern of nucleotides and polypeptides match, then they are similar and can be said to be homologous. IIIIf the sequence of a protein or other molecule ‘significantly’ matches the sequence of a protein with a known structure and function, then the molecules may share structure and function.

Pair-wise Sequence Alignment One pair of elements at a time Challenge – Find optimum alignment of 2 sequences with some degree of similarity Optimality is based on SCORE Score reflects the no. of paired characters in the 2 sequences and the no. and length of gaps introduced to adjust the sequences so that max no. of characters are in alignment

Pair-wise Sequence Alignment Example 1 Consider the ideal case of 2 identical nucleotide sequences Score – 1 point per pair of aligned characters SCORE = 20 points ATTCGGCATTCAGTGCTAGA ATTCGGCATTCAGTGCTAGA

Pair-wise Sequence Alignment Example 2 Consider the case when several of the characters are not aligned A) SCORE = 11 points?? Note that the last 6 characters are identical! ATTCGGCATTCAGTGCTAGA ATTCGGCATTGCTAGA

Pair-wise Sequence Alignment Example 2 contd. SCORE = 16 points?? Any cost/penalty of inserting gaps? Penalty = -0.5 / gap Final Score = 16 – 0.5*4 = 14 ATTCGGCATTCAGTGCTAGA ATTCGGCATT----GCTAGA

Pair-wise Sequence Alignment Example 3 Areas of similarity and dissimilarity not obvious as opposed to the previous example where there were two blocks of identical characters SCORE = 12 points ATTCGGCATTCAGAGCTAGA ATTCGACATTGCTAGTGGTA

Pair-wise Sequence Alignment Example 3 contd… Areas of similarity and dissimilarity not obvious SCORE = 14 – 0.5*4 = 12 SCORE = 14 (1 for exact match) – 0.5*4 ( - 0.5 for gap) – 0.5*1 ( - 0.5 for inexact match) = 14 – 2 – 0.5 =11.5 ATTCGGCATTCAGAGCTAGa ATTCGACATT----GCTAGt

Pair-wise Sequence Alignment Example 3 contd. SCORE = 14 – 0.5*4 – 0.5*1 (inexact matches) = 11.5 Penalty(gap)=Cost(opening) + Cost(per ext)* Length(gap)  in Example 2 and above, gap penalty = -0.5 + (-0.5*4) = -2.5  score in the above case becomes 8.5 ATTCGGCATTCAGAGCTAGa ATTCGACATT----GCTAGt

Two types of Sequence Alignment Global Alignment: An attempt is made to align the entire sequence using as many characters as possible, till the ends of both the sequences. Local Alignment: Stretches of sequence with the highest density of matches are aligned generating one or more islands of matches. This is particularly relevant for multi-domain proteins that share a conserved domain.

Local vs. Global Alignment Global alignment An attempt to line up two sequences matching as many characters as possible Considers all characters in a sequence. Bases alignment on the total score, even at the expense of stretches that that share obvious similarity Used for determining whether two protein sequences are in the same family

Local vs. Global Alignment Local Alignment More meaningful – points out conserved regions between two sequences Aligns two partially overlapping sequences Aligns two sequences where one is a subsequence of another

Methods Dot Matrix/Dot Plot Bayesian Method Dynamic Programming Smith-Waterman Algorithm Hidden Markov Models Genetic Algorithms Neural Networks Word-based techniques FASTA BLAST

Dot Matrix Method Visual inspection of linear sequences with 100 or more characters is impractical Dot Matrix method is visually more informative Makes similarities in patterns more obvious to visual inspection A dot is placed at the intersection of matching character pairs Dotter is a graphical dot plot program for detailed comparison of two sequences. Reference A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167:GC1-10 (1995 )

Dot Matrix Method

The biggest asset of dot matrix analysis is it allows you to visualize the entire comparison at once, not concentrating on any one ‘optimal’ region. Since your own mind and eyes are still better than computers at discerning complex visual patterns, especially when more than one pattern is being considered, you can see all these ‘less than best’ comparisons as well as the main one and then you can ‘zoom-in’ on those regions of interest using more detailed procedures.

Dot Matrix Method ‘mutated’ inter-sequence comparison

Dot Matrix Method: Detection of Repeats

Dot Matrix Method: detecting complicated mutations

Dot Matrix Method Complicated MUTATIONS Again, notice the diagonals. However, they have now been displaced off of the center diagonal of the plot. This is an example of a ‘transposition’. Dot matrix analysis is one of the only sensible ways to locate such transpositions in sequences. The ‘deletion’ of ‘PRIMER’ is shown by the lack of a corresponding diagonal.

Noise in Dot plots and filtering SHIBASISHBITSCHOWDHURY SSSSS HHHHH IIII BBB AA SSSSS IIII SSSSS HHHHH BBB IIII TT SSSSS

Dot Matrix Method: filtering noise using the sliding window approach Reconsider the same plot. Notice the extraneous dots that neither indicate runs of identity between the two sequences nor inverted repeats. These merely contribute ‘noise’ to the plot and are due to the ‘random’ occurrence of the letters in the sequences, the composition of the sequences themselves. How can we ‘clean up’ the plots so that this noise does not detract from our interpretations? Consider the implementation of a filtered windowing approach; a dot will only be placed if some ‘stringency’ is met within a user-defined window. What is meant by this is that if within some defined window size, some defined criterion is met, then and only then, will a dot be placed at the middle of that window. Then the window is shifted one position and the entire process is repeated. This very successfully rids the plot of unwanted noise.

Sliding Window Algorithm T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C CTAT GACATACGGTATGCTAT GACATACGGTATG Window Size = 3 Stringency = 3 

Hemoglobin  -chain Hemoglobin  -chain Dotplot (Window = 130 / Stringency = 9)

Dotplot (Window = 18 / Stringency = 10) Hemoglobin  -chain Hemoglobin  -chain

Parameters of Sequence Alignment Scoring Systems: Each symbol pairing is assigned a numerical value, based on a symbol comparison table. Gap Penalties: Opening: The cost for opening a gap Extension: The cost for elongating a gap

Protein Scoring Systems The concept of amino acid substitution matrices : The concept of amino acid substitution matrices : A 20x20 matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. –Pioneered by Margaret Dayhoff (1968) who first described a method to derive scoring matrices from amino acid replacements observed in aligned families –of present-day sequences. –In 1978 her group studied 1572 mutations in 71 families of closely-related protein sequences. Based on these they built matrices known as PAM (percent accepted mutation) matrices.

PAM 250 (log odds form or mutation data matrix or MDM) A R N D C Q E G H I L K M F P S T W Y V B Z A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6 C -8 17 W W Each matrix value is calculated from an odds score, which is the probability that the amino acid pair will be found in alignments of homologous proteins divided by the probability that the pair will be found in alignments of unrelated proteins by random chance.

Using the matrix… 3 Alignment Sequence ATyrCysAspAla Sequence BPheMetGluGly PAM 250 matrix value 7 -5 3 1 Total score for alignment of sequence A with sequence B =?

Using the matrix (example 2) PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Sequence 1 Sequence 2 Scoring matrix T:G= -2 T:T = 5 Score= 48 CSTPAGND.. C 9 S-1 4 T-1 1 5 P-3-1-1 7 A 0 1 0-1 4 G-3 0-2-2 0 6 N-3 1 0-2-2 0 5 D-3 0-1-1-2-1 1 6. CSTPAGND.. C 9 S-1 4 T-1 1 5 P-3-1-1 7 A 0 1 0-1 4 G-3 0-2-2 0 6 N-3 1 0-2-2 0 5 D-3 0-1-1-2-1 1 6.

First pairs of aligned amino acids in verified alignments are used to build a count matrix Then the count matrix is used to estimate a mutation matrix at 1 PAM (evolutionary unit). From the mutation matrix, a Dayhoff scoring matrix is constructed. This Dayhoff matrix along with a model of indel events is then used to score new alignments These alignments can then be used in an iterative process to construct new count matrices.

Extrapolating the PAM 1 matrix to obtain other matrices of the PAM family Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution which will change, on average, 1% of amino acids in a protein sequence.

PAM family PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices can be extrapolated from PAM1. Each PAM matrix gives the substitutions expected for a given period of evolutionary time. It is rare that a PAM matrix would be used for an evolutionary distance any greater than 256 PAMS.

BLOSUM (Block Substitution Matrices) BLOSUM matrices are based on local alignments. BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. Unlike PAM matrices, all BLOSUM matrices are based on observed alignments, they are not extrapolated from comparisons of closely related sequences. BLOSUM 62 is the default matrix in BLAST. Though BLOSUM 62 is tailored for comparisons of moderately distant proteins, it performs well in detecting close relationships. A search for distant relatives may be more sensitive with a different matrix.

Blosum 62 substitution matrix S Henikoff, and JG Henikoff Amino Acid Substitution Matrices from Protein Blocks PNAS 89: 10915-10919, 1992.

The relationship between PAM and BLOSUM matrices BLOSUM 80BLOSUM 62BLOSUM 45 PAM 1 PAM 120 PAM 250 Less divergent More divergent

Nucleic Acid Scoring Systems -very simple actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Sequence 1 Sequence 2 AGCTA1000G0100C0010T0001AGCTA1000G0100C0010T0001 Match: 1 Mismatch: 0 Score = 5 Unitary matrix

T A T G T G G A A T G A Scoring Insertions and Deletions and Gap Penalties A T G T - - A A T G C A A T G T A A T G C A T A T G T G G A A T G A The creation of a gap is penalized with a negative score value. insertion / deletion

1 GTGATAGACACAGACCGGTGGCATTGTGG 29 ||| | | ||| | || || | 1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29 Why Gap Penalties? Gaps allowed but not penalized Score: 88 Gaps not permitted Score: 0 1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29 ||| || | | | ||| || | | || || | 1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29 Match = 5 Mismatch = -4

The optimal alignment of two similar sequences is usually that which maximizes the number of matches and minimizes the number of gaps. There is a tradeoff between these two - adding gaps reduces mismatches Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences. Penalizing gaps forces alignments to have relatively few gaps. Why Gap Penalties?

Gap Penalties How to balance gaps with mismatches? Gaps must get a steep penalty, or else you’ll end up with nonsense alignments. In real sequences, multi-base (or amino acid) gaps are quit common genetic insertion/deletion events “Affine” gap penalties give a big penalty for each new gap, but a much smaller “gap extension” penalty.

Scoring Insertions and Deletions A T G T T A T A C T A T G T G C G T A T A Total Score: 4 Gap parameters: d = 3(gap opening) e = 0.1(gap extension) g = 3(gap length)  (g) = -3 - (3 -1) 0.1 = -3.2 T A T G T G C G T A T A A T G T - - - T A T A C insertion / deletion match = 1 mismatch = 0 Total Score:8 - 3.2 = 4.8 Scoring system 1 Scoring system 2

Building the PAM mutation data matrix Studied 1572 mutations in 71 families of closely-related protein sequences. 1 Similar sequences were organized into a phylogenetic tree in each family. In each of the aligned sequence families, the number of changes of each amino acid to every other amino acid were counted. Example Out of 1572 changes, there were 260 changes between F and Y. Multiply

Sequence Analysis Topics to be covered What is sequence analysis? Why we need it? How we do it? What are the tools available?

Similar presentations

Presentation on theme: "Sequence Analysis Topics to be covered What is sequence analysis? Why we need it? How we do it? What are the tools available?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Analysis Topics to be covered What is sequence analysis? Why we need it? How we do it? What are the tools available?

Similar presentations

Presentation on theme: "Sequence Analysis Topics to be covered What is sequence analysis? Why we need it? How we do it? What are the tools available?"— Presentation transcript:

Similar presentations

About project

Feedback