Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Similar presentations


Presentation on theme: "Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST."— Presentation transcript:

1 Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST

2 Sequence Alignment

3 DP – what does it mean? Principle of reduction of number of paths that need to be examined: If a path from X→Z passes through Y, the best path from X→Y is independent of the best path from Y→Z

4 Global vs. Local alignment Dotplot showing identities between short name ( DOROTHYHODGKIN ) and full name ( DOROTHYCROWFOOT HODGKIN ) of a famous protein crystallographer. S 1 = DOROTHYHODGKIN S 2 = DOROTHYCROWFOOTHODGKIN

5 Global vs. Local alignment Dotplot showing identities between short name ( DOROTHYHODGKIN ) and full name ( DOROTHYCROWFOOT HODGKIN ) of a famous protein crystallographer. Global alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN

6 Local Alignment The problem: we want to find the substrings of s and t with highest similarity. Scoring System: just as in global alignment:  Match: +1  Mismatch: -1  Indel: -2

7 Local Alignment – cont ’ d The differences: 1. We can start a new match instead of extending a previous alignment.  This means- at each cell, we can start to calculate the score from 0 (even if this means ignoring the prefix).  We do this only if it’s better than the alternative (which means- only if the alternative is negative). 2. Instead of looking only at the far corner, we look anywhere in the table for the best score (even if this means ignoring the suffix)

8 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0 T 1 A 2 A 3 T 4 A 5

9 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 00 T 1 A 2 A 3 T 4 A 5 T-T-

10 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0000000 T 1 A 2 A 3 T 4 A 5 TACTAA ------

11 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0000000 T 1 0 A 2 0 A 3 0 T 4 0 A 5 0 ----- TAATA

12 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0000000 T 1 01 A 2 0 A 3 0 T 4 0 A 5 0 TTTT

13 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0000000 T 1 01? A 2 0 A 3 0 T 4 0 A 5 0 TA T- TA- --T -2 TA -T 0

14 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0000000 T 1 01001 A 2 0 A 3 0 T 4 0 A 5 0 TACT ---T

15 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0000000 T 1 0100100 A 2 0020021 A 3 0 T 4 0 A 5 0

16 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0000000 T 1 0100100 A 2 0020021 A 3 0011013 T 4 0 A 5 0

17 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0000000 T 1 0100100 A 2 0020021 A 3 0011013 T 4 0000201 A 5 0

18 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0000000 T 1 0100100 A 2 0020021 A 3 0011013 T 4 0000201 A 5 0010031

19 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0000000 T 1 0100100 A 2 0020021 A 3 0011013 T 4 0000201 A 5 0010031 TACTAA TAATA

20 0 T1T1 A2A2 C3C3 T4T4 A5A5 A6A6 0 0000000 T 1 0100100 A 2 0020021 A 3 0011013 T 4 0000201 A 5 0010031 TACTAA TAATA

21 How do your prefer it – right or fast ? Exact methods - the result is guaranteed to be (mathematically) optimal  Needleman-Wunsch (global)  Smith-Waterman (local) Heuristic methods: make some assumptions that hold most, but not all of the time  FASTA  BLAST Still, a typical run takes minutes to complete.

22 FASTA

23 http://www.ebi.ac.uk/fasta33/ Performs a local alignment of the input sequence against a complete database. Finds n subsequences with best alignments. Speed-up: doesn’t really look at all the sequences- just those that ‘look similar’ (details- in the course Algorithms in Computational Biology) Still, a typical run takes minutes to complete.

24 FASTA Variations (programs) fasta3 – DNA sequence – DNA database, protein sequence – protein database fastx/y3 – DNA sequence - protein database.  DNA is translated in forward and reverse frames. tfastx/y3 - protein sequence - translated DNA DB … and more

25 Databases Depend on the type chosen (Nucleic acid / protein) EMBL- all the nucleotide databases of the European Molecular Biology Laboratory Some organism-type specific:  FUNGI  INVERTEBRATES  PLANTS Some content –specific:  ESTs  STSs  MAMALS  MOUSE  HUMAN

26 More FASTA options Gap penalties – different for opening gaps and for continuing them (residue = indel) Scores and Alignments – how many (max) to retrieve? KTUP – see the algorithm description in the lecture DNA Strand Matrix – for searches that involve proteins (next week)

27 E-values The number of hits (with the same similarity score) one can "expect" to see just by chance when searching the given string in a database of a particular size. higher e-value lower similarity From FASTA documentation:  “ sequences with E-value of less than 0.01 are almost always found to be homologous”  “sequences with E-value between 1 and 10 frequently turn out to be related as well” FASTA defaults for upper limit:  10 for FASTA with protein searches  5 for translated DNA/protein comparisons  2 for DNA/DNA searches. The lower bound is normally 0 (we want to find the best)

28 BLAST

29 BLAST – Outline Sequence Alignment Complexity and indexing BLASTN and BLASTP  Basic parameters PAM and BLOSUM matrices Affine gap model E Values (once again)

30 Advanced BLAST Databases BLAST options BLAST output Taxonomic BLAST Pairwise BLAST

31 NameQuery typeDatabase blastnGenomic blastpProtein blastxTranslated genomicProtein tblastnProteinTranslated genomic tblastxTranslated genomic Genomic translations test all 6 possibilities: 3x for codon frames, 2x for reverse complement BLAST Variations

32 BLASTN Databases nr GenBank, EMBL, DDBJ, PDB and NCBI reference sequences (RefSeq) htgsHigh-throughput genomic sequences (draft) patPatented nucleotide sequences mitoMitochondrial sequences vectorVector subset of GenBank monthGenBank, EMBL, DDBJ, PDB from 30 days chromContigs and chromosomes from RefSeq

33 BLASTP Databases nr GenBank CDS translations, RefSeq, PDB, SWISS-PROT, PIR, PRF swisspro t SWISS-PROT patPatented protein sequences pdbProtein Data Bank month GenBank CDS translations, PDB, SWISS-PROT, PIR, PRF from 30 days

34 BLASTN/P Options (1) Only search part of database using NCBI Entrez query format Search specific organism Remove low information content, e.g. short repeats or rich in only 2 nucleotides Remove known human repeats (LINEs, SINEs)

35 BLASTN/P Options (2) Threshold for results significance Use index based on words of 7, 11 or 15 nucleotides Costs to open and extend gap, score for nucleotide match or mismatch. Allowed gap scores: 10/1, 10/2, 11/1, 8/2, 9/2

36 BLASTP Options Scoring matrix: PAM, etc… Search for a motif (PSI-BLAST) Costs to open and extend gap

37 BLASTN/P Formatting (1) Show colored bar chart Number of sequences listed Number of alignments shown Other (less important) options on what to show

38 BLASTN/P Formatting (2) How to display alignments Only show results which match Entrez search or are from specific organism Only show results with E values in this range

39 BLASTN Results Query sequence representation Matched areas of database sequences Multiple matches on sequence

40 BLAST Output Header Request ID for later retrieval Query sequence details Database details Tax BLAST

41 BLAST Alignments (1) Sequence Identifier Sequence description Score and E value

42 BLAST Alignments (2) Several alignments possible for one sequence match Normalized score of alignment Expected number of such hits (2e-11 = 2  10 -11 ) Number of exact matches Number of matches with positive score Number of insertion / deletions

43 BLAST Alignments (3) Query sequenceExact matchInsertion / deletion Matched sequence Mismatch with positive score Position within sequence Masked low complexity region

44 Expectation Values Increases linearly with length of query sequence Increases linearly with length of database Decreases exponentially with score of alignment

45 Tax BLAST Lineage of organism with strongest hit Score of organism’s strongest hit Number of organism hits Shared ancestry in taxonomic tree

46 BLAST2SEQ Scoring scheme Type of program Gap model, Expect Value, Advanced options Sequences Scoring matrix Sequences GO !


Download ppt "Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST."

Similar presentations


Ads by Google