Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Alignment Dilvan Moreira (based on Prof. André Carvalho presentation)

Similar presentations


Presentation on theme: "Sequence Alignment Dilvan Moreira (based on Prof. André Carvalho presentation)"— Presentation transcript:

1 Sequence Alignment Dilvan Moreira (based on Prof. André Carvalho presentation)

2 Reading  Introduction to Computational Genomics: A Case Studies Approach  Chapter 3

3 7/7/2016André de Carvalho - ICMC/USP 3 Topics  Introduction  Homology  Global Alignment  Algorithm  Local Alignment  Multiple Alignment

4 7/7/2016André de Carvalho - ICMC/USP 4 Introduction  Sequence Alignment  One of the most primitive operations in Bioinformatics  It serves as a base for several more complex operations  Finding which parts of two sequences are similar and which parts are different It may sound simple, but different applications and formalities can lead to complex solutions

5 7/7/2016André de Carvalho - ICMC/USP 5 Alignment Utilities  Prediction of protein function  From similar proteins  Search in databases  Look for similar sequences  Genes recognition  Discover gene location in sequences  Sequences divergences  Compare sequences from the same species or different species  Assembly of fragments  Allows the built of genomes

6 6 Eye of Tiger  In 1994, Walter Gehring and colleagues (U. Basel) conducted an "Frankenstein " experiment  They linked "eyeless" gene in various parts of Drosophila melanogaster (fruit fly) Connect 'eyeless' induces formation of an eye (not functional) Genes are usually named by the problems they cause when suffers mutation  Result: many “eyes” were formed  Eyless is a master gene  It produces proteins that control several genes  The eyeless gene controls +/- 2,000 other genes

7 7/7/2016André de Carvalho - ICMC/USP 7 Eye of Tiger

8 7/7/2016André de Carvalho - ICMC/USP 8 Eye of Tiger  All multicellular organisms use master genes  With the same purpose in different species  Slightly different versions of the eyeless gene exist in humans, mice and... tigers  These different versions of the same gene are the homologous genes They have a common ancestor

9 7/7/2016André de Carvalho - ICMC/USP 9 Sequence Similarities  Sequences of homologous genes  They origin from the same ancestor  High similarity is a good evidence of Homology The alignment measures sequence similarities, not homology  There are different types of Homology Orthologous Genes Paralogs Genes Most commons

10 7/7/2016André de Carvalho - ICMC/USP 10 Sequence Similarities  Orthologous Genes  Homologous genes in different species Genes found in different species that share an ancestor  It reflects the division of the species history These sequences results from the species evolution  Differentiation occurs by insertion, deletion or nucleotide substitution Indels: insertions or deletions of bases  Identical functions

11 7/7/2016André de Carvalho - ICMC/USP 11 Sequence Similarities  Paralogs Genes  Duplicated genes in multiple copies Can evolve and assume similar functions Family of specialized genes  Paralogy: relationship between members of a family of genes within a genome  Similar functions

12 7/7/2016André de Carvalho - ICMC/USP 12 Homology  Different organisms with similar structures and the same embryological origin  Same ancestor  Function may or may not be the same  Examples  Man’s arm  Bat’s Wing  Bird’s wing Same function

13 7/7/2016André de Carvalho - ICMC/USP 13 Homology  Homologous structures in most detaildetalhes Fonte: http://www.logic.com.br/prof.cynara/ecologia.htm

14 7/7/2016André de Carvalho - ICMC/USP 14 Homology  Strucutural Homology seal man horse bird bat turtle Humerus Carpus Metacarpus Phalanges Radius and ulna

15 7/7/2016André de Carvalho - ICMC/USP 15 Homology  Wing X Forearm  General morphology Ratio of bones and muscles virtually identical Common ancestors shared these elements  Specific morphology with feathers X featherless fingerless X with fingers  Function flying X object manipulation

16 7/7/2016André de Carvalho - ICMC/USP 16 Sequence Alignment  The advances have occurred in heels  Often separated by decades  Marks  1970: S. Needleman e C. Wunch introduced the concept of Global Alignment  1981: T. Smith e M. Waterman advanced to Local Alignment  1990: S. Altschul, W. Gish and others published fast heuristic methods BLAST

17 7/7/2016André de Carvalho - ICMC/USP 17 Sequence Alignment  Algorithmic and statistical questions to be answered:  Given two sequences, which is the best way to align them?  How the quality of an Alignment can be evaluated?  An alignment can be explained by chance, or it should be concluded that it is due to a common ancestor?

18 7/7/2016André de Carvalho - ICMC/USP 18 Sequence Alignment  Situations where the Alignment is needed  When the same gene is sequenced by two different laboratories It is necessary to compare the results  When the same long string is entered twice It is necessary to look for typos

19 7/7/2016André de Carvalho - ICMC/USP 19 Sequence Alignment  Situations you may need the Alignment  Compare the complete genome of two organisms To identify similarities and differences  Comparing a DNA sequence of hundred bases to a base genome of thousands of bases Examine whether the sequence appears in the genome

20 7/7/2016André de Carvalho - ICMC/USP 20 Sequence Alignment  There is different types of Alignment:  Regarding the type of sequences Among nucleotide sequences (4) Among amino acid sequences (2)  Regarding the number of Alignments Alignment of pairs of sequences Aligning a sequence with another Multiple Alignment Aligns with several other sequences

21 A C T A C C T A G C C - A A G T – A - - - C T G A - - - - - T T – A - - 7/7/2016André de Carvalho - ICMC/USP 21 Sequence Alignment  Alignment can be of different types  Regarding lined sequence Global Uses complete sequences Local Search for the best match Between inner regions of two sequences

22 7/7/2016André de Carvalho - ICMC/USP 22 Example  Align the sequences:  GCGCATGGATTGAGCGA e TGCGCCATTGATGACCA -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three possibilities for each position: Perfect Match Divergences (mismatches) Insertions and deletions (indel)

23 7/7/2016André de Carvalho - ICMC/USP 23 Global Alignment  aligned sequences must have the same size  If the original sequences have different sizes, spaces must be included to match the sizes Must preserve the order of nucleotides (or amino acids) of the sequences Alignment size is greater than or equal to the largest sequence

24 7/7/2016André de Carvalho - ICMC/USP 24 Example  Given the sequences below:  Define a reasonable Alignment a c t g a e a t c t g

25  Given the sequences below:  An acceptable Alignment would be: 7/7/2016André de Carvalho - ICMC/USP 25 Example a - c t g a a t c t g - a c t g a e a t c t g

26 7/7/2016André de Carvalho - ICMC/USP 26 Global Alignment  Alignment of two sequences  Search for the best possible Alignment When the sequences are more similar Optimal Alignment  It is necessary to evaluate possible mappings systematically Alignments must be evaluated automatically

27 7/7/2016André de Carvalho - ICMC/USP 27 Sequence Alignment  It is necessary to define criteria for assessing the possible Alignments  Attribution of note or Score every Alignment  Similarity measure Compares nucleotides (or amino acids) that appear in corresponding positions  An Optimal Alignment may not be unique Several Alignments can have the same maximum score Small variations in the Score function can change the order of the best Alignments

28 7/7/2016André de Carvalho - ICMC/USP 28 Score  The score of each position is calculated independently  Alignment Cost of two symbols x i and y i is defined by a score function σ (x i,y i )  Alignment of the score is the sum of the positions Scores

29 7/7/2016André de Carvalho - ICMC/USP 29 Score Functions  There are several alternatives  One of the most used: Given two sequences x and y

30 7/7/2016André de Carvalho - ICMC/USP 30 Example  Find the alignment between the sequences x= atctat e y = ctcat a t c t a t Score (x, y) = -3 - c t c a t -2 -1 -1 -1 +1 +1 a t c t a t Score (x, y) = -1 c t c a t - -1 +1 +1 -1 +1 -2

31 7/7/2016André de Carvalho - ICMC/USP 31 Example  Find the alignment between the sequences x= atctat e y = ctcat a t c t a t Score (x, y) = +1 c t c - a t -1 +1 +1 -2 +1 +1 Best solution found

32 7/7/2016André de Carvalho - ICMC/USP 32 Score Functions  Used Score in the book Examples  Given 2 sequences x and y

33 7/7/2016André de Carvalho - ICMC/USP 33 Substitution Matrix  It provides a more realistic Score function  It allows you to associate a different cost to each possible replacement A G C T - A +1 -1 -1 -1 -1 G -1 +1 -1 -1 -1 C -1 -1 +1 -1 -1 T -1 -1 -1 +1 -1 - -1 -1 -1 -1 ND

34 7/7/2016André de Carvalho - ICMC/USP 34 Score Functions  Dissimilarity between different nucleotides is generally similar  Thus, weights associated to exchanges may differ  For proteins, different amino acid substitutions have different effects  Physic-chemical properties of the amino acids

35 7/7/2016André de Carvalho - ICMC/USP 35 Optimal Alignment  The global optimum Alignment A(s, t) of two set strings s and t:  Maximizes Global Alignment Score for all possible Alignments

36 7/7/2016André de Carvalho - ICMC/USP 36 Optimal Alignment  Finding the Optimal Alignment A* is similar to a combinatorial optimization problem 1. Generate all possible Alignments 2. Calculate the M(A) Score for each Alignment 3. Select Alignment with maximum Score

37 Statistical Analysis of Alignments  Good Alignment  Chance or biology?  We make a hypothesis test using random sequences (Cap. 2)  We established a de α de value, for Example, 0.05 (5%)  We compare the score achieved with scores of Alignments with randomized sequences  We calculated the p-value as the number of Alignments with score >= original score  If the p-value < α o, the Alignment is considered significant

38 Statistical Analysis of Alignments  Compare with randomized sequences  One of the sequences is used to produce random sequences Exchanging the bases Exchanging or bases blocks (retains local relations)  The other sequence is aligned with the random sequences  We order the scores achieved and the original score  We accept the original score as significant if it is between 5% higher

39 Statistical Analysis of Alignments  Example: It has a score of 2  1000 permutations are produced in the second sequence  The best Global Alignment is found for each random sequence and the first sequence A*(s, t) = V I V A L A S V E G A S V I V A D A - V - - I S

40 Statistical Analysis of Alignments  Only 2 of the 1000 permutations had a >=2 score  p-value = 0.002  As α = 0.05, the Alignment is significant.

41 BLAST  Basic Local Alignment Search Tool  Other Algorithms are slow  For large-scale Alignments, Example public databases: GenBank  Most used tool for Alignment  Fast Approximate Local Alignment

42 “query” sequence “words” (“query” pieces) the “words” are compared to the sequence data base (target sequences) and exact matches identified For each match, the Alignment is extended until there is a higher then a threshold score (MSPs) (Schneider and La Rota 2000) BLAST – How it Works

43 BLAST Algorithm  The "query" sequence is divided into short regions W ("words")  The words that align the sequence in the database ("target sequence") with a score >= T are separated (scores can use substitution matrices).  For each pair of sequences (query and target ) that has one or more words in common:  The Alignment is extended increasing their score to a certain limit S.  These Alignments are high scoring pairs - HSPs ; the HSPs with higher scores are called MSPs.  sequences target com muitos e maiores HSPs  melhores  target sequences with many major HSPs  best

44 44 BLAST  The basic concept is  The greater the number of segments between two similar sequences, and  The longer in length these segments are similar,  The sequences are less divergent and more genetically related (homologous) they are more likely to be.

45 7/7/2016André de Carvalho - ICMC/USP 45 Multiple Alignment CLUSTAL W (1.82) multiple Sequence Alignment GLB1_GLYDI ---------GLSAAQRQVIAATWKDIAGADNGAGVGKDCLIKFLSAHPQMAAVFGFSGAS 51 HBB_HUMAN --------VHLTPEEKSAVTALWGKV----NVDEVGGEALGRLLVVYPWTQRFFESFGDL 48 HBA_HUMAN ---------VLSPADKTNVKAAWGKVG--AHAGEYGAEALERMFLSFPTTKTYFPHF-DL 48 MYG_PHYCA ---------VLSEGEWQLVLHVWAKVE--ADVAGHGQDILIRLFKSHPETLEKFDRFKHL 49 GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVY--STYETSGVDILVKFFTSTPAAQEFFPKFKGL 58 GLB3_CHITP ----------LSADQISTVQASFDKVK------GDPVGILYAVFKADPSIMAKFTQFAGK 44 LGB2_LUPLU --------GALTESQAALVKSSWEEFN--ANIPKHTHRFFILVLEIAPAAKDLFSFLKGT 50 *: : : :..: :.: * * GLB1_GLYDI DP--------GVAALGAKVLAQIGVAVSHLGDE--GKMVAQMKAVGVRHKGYGNKHIKAQ 101 HBB_HUMAN STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHC--DKLHVDPE 101 HBA_HUMAN SH-----GSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHA--HKLRVDPV 96 MYG_PHYCA KTEAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHA--TKHKIPIK 102 GLB5_PETMA TTADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHA--KSFQVDPQ 114 GLB3_CHITP DLES-IKGTAPFETHANRIVGFFSKIIGELPN-----IEADVNTFVASHK---PRGVTHD 95 LGB2_LUPLU SEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHV---SKGVADA 105.. :... *. : GLB1_GLYDI YFEPLGASLLSAMEHRIGGKMNAAAKDAWAAAYADISGALISGLQS----- 147 HBB_HUMAN NFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ 146 HBA_HUMAN NFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ 141 MYG_PHYCA YLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153 GLB5_PETMA YFKVLAAVIADTVAAG---------DAGFEKLMSMICILLRSAY------- 149 GLB3_CHITP QLNNFRAGFVSYMKAHTD---FAGAEAAWGATLDTFFGMIFSKM------- 136 LGB2_LUPLU HFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- 153 :. : :.... :.

46 7/7/2016André de Carvalho - ICMC/USP 46 Sequence Alignment  There are several techniques  Dot plot  FASTA  Blast Several versions  Smith-Waterman Algorithm  Algorithm of Needleman and Wunsch  Genetics Algorithms

47 7/7/2016André de Carvalho - ICMC/USP 47 Algorithm Needleman-Wunsch 1.Create a table (m+1)x(n+1) for seqs. s and t sizes of m and n 2.Fill the table entries (m:1) and (1:n) with the values ​​ : 3.From the top left, calculate the value of each entry using recursion: 4.Perform trace-back procedure from the lower right corner

48 7/7/2016André de Carvalho - ICMC/USP 48 Tracing Back  It allows you to recover Alignment with best Score  Each cell has a pointer to cell used to calculate its value  Cada célula possui um ponteiro para célula utilizada para calcular seu valor  Form a path  Every movement has a meaning  Diagonal: there is a match or divergence  Vertical: Space inclusion in the top sequence  Horizontal: Space insertion in the side sequence

49 7/7/2016André de Carvalho - ICMC/USP 49 Example  Find the best Alignment of sequences x and y:  x = AGC  y = AAAC  Use as Score function:

50 7/7/2016André de Carvalho - ICMC/USP 50 Example AGC A A A C -2-4-60 -2 -4 -6 -8

51 7/7/2016André de Carvalho - ICMC/USP 51 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1

52 7/7/2016André de Carvalho - ICMC/USP 52 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1

53 7/7/2016André de Carvalho - ICMC/USP 53 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4

54 7/7/2016André de Carvalho - ICMC/USP 54 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 C

55 7/7/2016André de Carvalho - ICMC/USP 55 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 -C AC GC AC GC AC

56 7/7/2016André de Carvalho - ICMC/USP 56 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 G-C AAC AGC AAC -GC AAC

57 7/7/2016André de Carvalho - ICMC/USP 57 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 AG-C AAAC -AGC AAAC A-GC AAAC

58 7/7/2016André de Carvalho - ICMC/USP 58 NW – Time and Space Complexity Time:  Filling the matrix:  Backtracing:  Overall: Space:  Holding the matrix: AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 O(n·m) O(n+m) O(n·m)

59 7/7/2016André de Carvalho - ICMC/USP 59 Example  Find the best Alignment of sequences x and y:  x = VIVADAVIS  y = VIVAVEGAS  Use as Score function:

60 7/7/2016André de Carvalho - ICMC/USP 60 Example V I V A D A V I S 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 V -1 1 0 -1 -2 -3 -4 -5 -6 -7 I -2 0 2 1 0 -1 -2 -3 -4 -5 V -3 -1 1 3 2 1 0 -1 -2 -3 A -4 -2 0 2 4 3 2 1 0 -1 L -5 -3 -1 1 3 3 2 1 0 -1 A -6 -4 -2 0 2 2 4 3 2 1 S -7 -5 -3 -1 1 1 3 3 2 3 V -8 -6 -4 -2 0 0 2 4 3 2 E -9 -7 -5 -3 -1 -1 1 3 3 2 G -10 -8 -6 -4 -2 -2 0 2 2 2 A -11 -9 -7 -5 -3 -3 -1 1 1 1 S -12 -10 -8 -6 -4 -4 -2 0 0 2

61 7/7/2016André de Carvalho - ICMC/USP 61 Example

62 7/7/2016André de Carvalho - ICMC/USP 62 Example

63 7/7/2016André de Carvalho - ICMC/USP 63 Tracing Back  Starting from the bottom right corner: A*(s, t) = V I V A L A S V E G A S V I V A D A - V - - I S

64 7/7/2016André de Carvalho - ICMC/USP 64 Algorithm Smith-Waterman 1.Create a table (m+1) x (n+1) for seqs. s and t of sizes of m and n 2.Fill the table entries (m:1) and (1:n) with zeros 3.From the top left, calculate the value of each entry using recursion: 4.Perform trace-back procedure from the largest table element to the first element with value zero

65 65 Example  Find the Best Local Alignment of sequences x and y:  x = QUEVIVALASVEGAS  y =VIVADAVIS  Use as Score function:

66 Algorithm Smith-Waterman

67 67 Algorithm Smith-Waterman  Trace-back from the largest table element to the first element with zero value:

68 Questions?


Download ppt "Sequence Alignment Dilvan Moreira (based on Prof. André Carvalho presentation)"

Similar presentations


Ads by Google