Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch 4.1-4.7, Ch 5.1, get what you can.

Similar presentations


Presentation on theme: "Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch 4.1-4.7, Ch 5.1, get what you can."— Presentation transcript:

1 Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch 4.1-4.7, Ch 5.1, get what you can out of 5.2, 5.4

2 Pairwise alignment DNA:DNA polypeptide:polypeptide The BASIC Sequence Analysis Operation

3 Alignments Pairwise sequence alignments –One-to-One –One-to-Database Multiple sequence alignments –Many-to-Many

4 Origins of Sequence Similarity Homology –common evolutionary descent Chance –Short similar segments are very common. Similarity in function –Convergence (very rare)

5

6 Visual sequence comparison: Dotplot

7 Visual sequence comparison: Filtered dotplot 4 bp window, 75% identity cutoff

8 Visual sequence comparison: Dotplot 4 bp windw, 75% identity cutoff

9 Dotplots of sequence rearrangements

10 Assessing similarity GAACAAT |||||||7/7 OR 100% GAACAAT GAACAAT | 1/7 or 14% GAACAAT Which is BETTER? How do we SCORE?

11 Similarity GAACAAT |||||||7/7 OR 100% GAACAAT ||| |||6/7 OR 84% GAATAAT MISMATCH

12 Mismatches GAACAAT ||| |||6/7 OR 84% GAATAAT GAACAAT ||| |||6/7 OR 84% GAAGAAT

13 Terminal Mismatch GAACAATttttt ||| ||| aaaccGAATAAT 6/7 OR 84%

14 INDELS GAAgCAAT ||| ||||7/7 OR 100% GAA*CAAT

15 Indels, cont’d GAAgCAAT ||| |||| GAA*CAAT GAAggggCAAT ||| |||| GAA****CAAT

16 Similarity Scoring Common Method: Terminal mismatches (0) Match score (1) Mismatch penalty (-3) Gap penalty (-1) Gap extension penalty (-1) DNA Defaults

17 DNA Scoring GGGGGGAGAA 2 |||||*|*|| 8(1)+2(-3)= 2 GGGGGAAAAAGGGGG GGGGGGAGAA--GGG 3 |||||*|*|| ||| 11(1)+2(-3)+1(-1)+1(-1)= 3 GGGGGAAAAAGGGGG

18 Absurdity of Low Gap Penalty GATCGCTACGCTCAGC A.C.C..C..T Perfect similarity, Every time!

19 Sequence alignment algorithms Local alignment –Smith-Waterman Global alignment –Needleman-Wunsch

20 Alignment Programs Local alignment (Smith-Waterman) –BLAST (simplified Smith-Waterman ) –FASTA (simplified Smith-Waterman ) –BESTFIT (GCG program) Global alignment (Needleman-Wunsch) –GAP

21 Local vs. global alignment 10 gaggc 15 ||||| 3 gaggc 7 1 gggggaaaaagtggccccc 19 || |||| || 1 gggggttttttttgtggtttcc 22 Global alignment: alignment of the full length of the sequences Local alignment: alignment of regions of substantial similarity

22 Local vs. global alignment

23 BLAST Algorithm Look for local alignment, a High Scoring Pair (HSP) Finding word (W) in query and subject. Score > T. Extend local alignment until score reaches maximum-X. Keep High Scoring Segment Pairs (HSPs) with scores > S. Find multiple HSPs per query if present Expectation value (E value) using Karlin-Altschul stats

24 BLAST statistical significance: assessing the likelihood a match occurs by chance Karlin-Altschul statistic: E = k m N exp(-Lambda S) m = Size of query seqeunce N = Size of database k = Search space scaling parameter Lambda = scoring scaling parameter S = BLAST HSP score Low E -> good match

25 BLAST statistical significance: Rule of thumb for a good match: Nucleotide match E < 1e-6 Identity > 70% Protein match E < 1e-3 Identity > 25%

26 Protein Similarity Scoring Identity - Easy WEAK Alignments Chemical Similarity –L vs I, K vs R… Evolutionary Similarity –How do proteins evolve? –How do we infer similarities?

27 BLOSUM62

28 Single-base evolution changes the encoded AA CAU=H CAC=HCGU=RUAU=Y CAA=QCCU=PGAU=D CAG=QCUU=LAAU=N

29 Substitution Matrices Two main classes: PAM-Dayhoff BLOSUM-Henikoff

30 PAM-Dayhoff Built from closed related proteins, substitutions constrained by evolution and function “accepted” by evolution (Point Accepted Mutation=PAM) 1 PAM::1% divergence PAM120=closely related proteins PAM250=divergent proteins

31 BLOSUM- Henikoff&Henikoff Built from ungapped alignments in proteins: “BLOCKS” Merge blocks at given % similar to one sequence Calculate “target” frequencies BLOSUM62=62% similar blocks –good general purpose BLOSUM30 –Detects weak similarities, used for distantly related proteins

32 BLOSUM62

33 Gapped alignments No general theory for significance of matches!! G+L(n) –indel mutations rare –variation in gap length “easy”, G > L

34 Real Alignments

35 Phylogeny

36 Cow-to-Pig Protein

37 Cow-to-Pig cDNA 80% Identity (88% at aa!)

38 DNA similarity reflects polypeptide similarity

39 Coding vs Non-coding Regions 90% in coding (70% in non-coding)

40 Third Base of Codon is Hypervariable

41 Cow-to-Fish Protein 42% identity, 51% similarity

42 Cow-to-Fish DNA 48% similarity

43 Protein vs. DNA Alignments Polypeptide similarity > DNA Coding DNA > Non-coding 3rd base of codon hypervariable Moderate Distance  poor DNA similarity

44 Rules of Thumb DNA-DNA similarities –50% significant if “long” –E < 1e-6, 70% identity Protein-protein similarities –80% end-end: same structure, same function –30% over domain, similar function, structure overall similar –15-30% “twilight zone” –Short, strong match…could be a “motif”

45 Basic BLAST Family BLASTN –DNA to DNA database BLASTP –protein to protein database TBLASTN –DNA (translated) to protein database BLASTX –protein to DNA database (translated) TBLASTX –DNA (translated) to DNA database (translated)

46 DNA Databases nr (non-redundantish merge of Genbank, EMBL, etc…) –EXCLUDES HTGS0,1,2, EST, GSS, STS, PAT, WGS est (expressed sequence tags) htgs (high throughput genome seq.) gss (genome survey sequence) vector, yeast, ecoli, mito chromosome (complete genomes) And more http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#nucleotide_databases

47 Protein Databases nr (non-redundant Swiss-prot, PIR, PDF, PDB, Genbank CDS) swissprot ecoli, yeast, fly month And more

48 BLAST Input Program Database Options - see more Sequence –FASTA –gi or accession#

49 BLAST Options Algorithm and output options –# descriptions, # alignments returned –Probability cutoff –Strand Alignment parameters –Scoring Matrix BLOSUM62, BLOSUM80PAM30, PAM70, BLOSUM45, BLOSUM62, BLOSUM80 –Filter (low complexity) PPPPP->XXXXX

50 Extended BLAST Family Gapped Blast (default)Gapped Blast (default) PSI-Blast (Position-specific iterated blast) –“self” generated scoring matrix PHI BLAST (motif plus BLAST) BLAST2 client (align two seqs) megablast (genomic sequence) rpsblast (search for domains)


Download ppt "Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch 4.1-4.7, Ch 5.1, get what you can."

Similar presentations


Ads by Google