Download presentation
Presentation is loading. Please wait.
Published byErika Lyons Modified over 8 years ago
1
Sequence Alignment Dilvan Moreira (based on Prof. André Carvalho presentation)
2
Reading Introduction to Computational Genomics: A Case Studies Approach Chapter 3
3
7/7/2016André de Carvalho - ICMC/USP 3 Topics Introduction Homology Global Alignment Algorithm Local Alignment Multiple Alignment
4
7/7/2016André de Carvalho - ICMC/USP 4 Introduction Sequence Alignment One of the most primitive operations in Bioinformatics It serves as a base for several more complex operations Finding which parts of two sequences are similar and which parts are different It may sound simple, but different applications and formalities can lead to complex solutions
5
7/7/2016André de Carvalho - ICMC/USP 5 Alignment Utilities Prediction of protein function From similar proteins Search in databases Look for similar sequences Genes recognition Discover gene location in sequences Sequences divergences Compare sequences from the same species or different species Assembly of fragments Allows the built of genomes
6
6 Eye of Tiger In 1994, Walter Gehring and colleagues (U. Basel) conducted an "Frankenstein " experiment They linked "eyeless" gene in various parts of Drosophila melanogaster (fruit fly) Connect 'eyeless' induces formation of an eye (not functional) Genes are usually named by the problems they cause when suffers mutation Result: many “eyes” were formed Eyless is a master gene It produces proteins that control several genes The eyeless gene controls +/- 2,000 other genes
7
7/7/2016André de Carvalho - ICMC/USP 7 Eye of Tiger
8
7/7/2016André de Carvalho - ICMC/USP 8 Eye of Tiger All multicellular organisms use master genes With the same purpose in different species Slightly different versions of the eyeless gene exist in humans, mice and... tigers These different versions of the same gene are the homologous genes They have a common ancestor
9
7/7/2016André de Carvalho - ICMC/USP 9 Sequence Similarities Sequences of homologous genes They origin from the same ancestor High similarity is a good evidence of Homology The alignment measures sequence similarities, not homology There are different types of Homology Orthologous Genes Paralogs Genes Most commons
10
7/7/2016André de Carvalho - ICMC/USP 10 Sequence Similarities Orthologous Genes Homologous genes in different species Genes found in different species that share an ancestor It reflects the division of the species history These sequences results from the species evolution Differentiation occurs by insertion, deletion or nucleotide substitution Indels: insertions or deletions of bases Identical functions
11
7/7/2016André de Carvalho - ICMC/USP 11 Sequence Similarities Paralogs Genes Duplicated genes in multiple copies Can evolve and assume similar functions Family of specialized genes Paralogy: relationship between members of a family of genes within a genome Similar functions
12
7/7/2016André de Carvalho - ICMC/USP 12 Homology Different organisms with similar structures and the same embryological origin Same ancestor Function may or may not be the same Examples Man’s arm Bat’s Wing Bird’s wing Same function
13
7/7/2016André de Carvalho - ICMC/USP 13 Homology Homologous structures in most detaildetalhes Fonte: http://www.logic.com.br/prof.cynara/ecologia.htm
14
7/7/2016André de Carvalho - ICMC/USP 14 Homology Strucutural Homology seal man horse bird bat turtle Humerus Carpus Metacarpus Phalanges Radius and ulna
15
7/7/2016André de Carvalho - ICMC/USP 15 Homology Wing X Forearm General morphology Ratio of bones and muscles virtually identical Common ancestors shared these elements Specific morphology with feathers X featherless fingerless X with fingers Function flying X object manipulation
16
7/7/2016André de Carvalho - ICMC/USP 16 Sequence Alignment The advances have occurred in heels Often separated by decades Marks 1970: S. Needleman e C. Wunch introduced the concept of Global Alignment 1981: T. Smith e M. Waterman advanced to Local Alignment 1990: S. Altschul, W. Gish and others published fast heuristic methods BLAST
17
7/7/2016André de Carvalho - ICMC/USP 17 Sequence Alignment Algorithmic and statistical questions to be answered: Given two sequences, which is the best way to align them? How the quality of an Alignment can be evaluated? An alignment can be explained by chance, or it should be concluded that it is due to a common ancestor?
18
7/7/2016André de Carvalho - ICMC/USP 18 Sequence Alignment Situations where the Alignment is needed When the same gene is sequenced by two different laboratories It is necessary to compare the results When the same long string is entered twice It is necessary to look for typos
19
7/7/2016André de Carvalho - ICMC/USP 19 Sequence Alignment Situations you may need the Alignment Compare the complete genome of two organisms To identify similarities and differences Comparing a DNA sequence of hundred bases to a base genome of thousands of bases Examine whether the sequence appears in the genome
20
7/7/2016André de Carvalho - ICMC/USP 20 Sequence Alignment There is different types of Alignment: Regarding the type of sequences Among nucleotide sequences (4) Among amino acid sequences (2) Regarding the number of Alignments Alignment of pairs of sequences Aligning a sequence with another Multiple Alignment Aligns with several other sequences
21
A C T A C C T A G C C - A A G T – A - - - C T G A - - - - - T T – A - - 7/7/2016André de Carvalho - ICMC/USP 21 Sequence Alignment Alignment can be of different types Regarding lined sequence Global Uses complete sequences Local Search for the best match Between inner regions of two sequences
22
7/7/2016André de Carvalho - ICMC/USP 22 Example Align the sequences: GCGCATGGATTGAGCGA e TGCGCCATTGATGACCA -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three possibilities for each position: Perfect Match Divergences (mismatches) Insertions and deletions (indel)
23
7/7/2016André de Carvalho - ICMC/USP 23 Global Alignment aligned sequences must have the same size If the original sequences have different sizes, spaces must be included to match the sizes Must preserve the order of nucleotides (or amino acids) of the sequences Alignment size is greater than or equal to the largest sequence
24
7/7/2016André de Carvalho - ICMC/USP 24 Example Given the sequences below: Define a reasonable Alignment a c t g a e a t c t g
25
Given the sequences below: An acceptable Alignment would be: 7/7/2016André de Carvalho - ICMC/USP 25 Example a - c t g a a t c t g - a c t g a e a t c t g
26
7/7/2016André de Carvalho - ICMC/USP 26 Global Alignment Alignment of two sequences Search for the best possible Alignment When the sequences are more similar Optimal Alignment It is necessary to evaluate possible mappings systematically Alignments must be evaluated automatically
27
7/7/2016André de Carvalho - ICMC/USP 27 Sequence Alignment It is necessary to define criteria for assessing the possible Alignments Attribution of note or Score every Alignment Similarity measure Compares nucleotides (or amino acids) that appear in corresponding positions An Optimal Alignment may not be unique Several Alignments can have the same maximum score Small variations in the Score function can change the order of the best Alignments
28
7/7/2016André de Carvalho - ICMC/USP 28 Score The score of each position is calculated independently Alignment Cost of two symbols x i and y i is defined by a score function σ (x i,y i ) Alignment of the score is the sum of the positions Scores
29
7/7/2016André de Carvalho - ICMC/USP 29 Score Functions There are several alternatives One of the most used: Given two sequences x and y
30
7/7/2016André de Carvalho - ICMC/USP 30 Example Find the alignment between the sequences x= atctat e y = ctcat a t c t a t Score (x, y) = -3 - c t c a t -2 -1 -1 -1 +1 +1 a t c t a t Score (x, y) = -1 c t c a t - -1 +1 +1 -1 +1 -2
31
7/7/2016André de Carvalho - ICMC/USP 31 Example Find the alignment between the sequences x= atctat e y = ctcat a t c t a t Score (x, y) = +1 c t c - a t -1 +1 +1 -2 +1 +1 Best solution found
32
7/7/2016André de Carvalho - ICMC/USP 32 Score Functions Used Score in the book Examples Given 2 sequences x and y
33
7/7/2016André de Carvalho - ICMC/USP 33 Substitution Matrix It provides a more realistic Score function It allows you to associate a different cost to each possible replacement A G C T - A +1 -1 -1 -1 -1 G -1 +1 -1 -1 -1 C -1 -1 +1 -1 -1 T -1 -1 -1 +1 -1 - -1 -1 -1 -1 ND
34
7/7/2016André de Carvalho - ICMC/USP 34 Score Functions Dissimilarity between different nucleotides is generally similar Thus, weights associated to exchanges may differ For proteins, different amino acid substitutions have different effects Physic-chemical properties of the amino acids
35
7/7/2016André de Carvalho - ICMC/USP 35 Optimal Alignment The global optimum Alignment A(s, t) of two set strings s and t: Maximizes Global Alignment Score for all possible Alignments
36
7/7/2016André de Carvalho - ICMC/USP 36 Optimal Alignment Finding the Optimal Alignment A* is similar to a combinatorial optimization problem 1. Generate all possible Alignments 2. Calculate the M(A) Score for each Alignment 3. Select Alignment with maximum Score
37
Statistical Analysis of Alignments Good Alignment Chance or biology? We make a hypothesis test using random sequences (Cap. 2) We established a de α de value, for Example, 0.05 (5%) We compare the score achieved with scores of Alignments with randomized sequences We calculated the p-value as the number of Alignments with score >= original score If the p-value < α o, the Alignment is considered significant
38
Statistical Analysis of Alignments Compare with randomized sequences One of the sequences is used to produce random sequences Exchanging the bases Exchanging or bases blocks (retains local relations) The other sequence is aligned with the random sequences We order the scores achieved and the original score We accept the original score as significant if it is between 5% higher
39
Statistical Analysis of Alignments Example: It has a score of 2 1000 permutations are produced in the second sequence The best Global Alignment is found for each random sequence and the first sequence A*(s, t) = V I V A L A S V E G A S V I V A D A - V - - I S
40
Statistical Analysis of Alignments Only 2 of the 1000 permutations had a >=2 score p-value = 0.002 As α = 0.05, the Alignment is significant.
41
BLAST Basic Local Alignment Search Tool Other Algorithms are slow For large-scale Alignments, Example public databases: GenBank Most used tool for Alignment Fast Approximate Local Alignment
42
“query” sequence “words” (“query” pieces) the “words” are compared to the sequence data base (target sequences) and exact matches identified For each match, the Alignment is extended until there is a higher then a threshold score (MSPs) (Schneider and La Rota 2000) BLAST – How it Works
43
BLAST Algorithm The "query" sequence is divided into short regions W ("words") The words that align the sequence in the database ("target sequence") with a score >= T are separated (scores can use substitution matrices). For each pair of sequences (query and target ) that has one or more words in common: The Alignment is extended increasing their score to a certain limit S. These Alignments are high scoring pairs - HSPs ; the HSPs with higher scores are called MSPs. sequences target com muitos e maiores HSPs melhores target sequences with many major HSPs best
44
44 BLAST The basic concept is The greater the number of segments between two similar sequences, and The longer in length these segments are similar, The sequences are less divergent and more genetically related (homologous) they are more likely to be.
45
7/7/2016André de Carvalho - ICMC/USP 45 Multiple Alignment CLUSTAL W (1.82) multiple Sequence Alignment GLB1_GLYDI ---------GLSAAQRQVIAATWKDIAGADNGAGVGKDCLIKFLSAHPQMAAVFGFSGAS 51 HBB_HUMAN --------VHLTPEEKSAVTALWGKV----NVDEVGGEALGRLLVVYPWTQRFFESFGDL 48 HBA_HUMAN ---------VLSPADKTNVKAAWGKVG--AHAGEYGAEALERMFLSFPTTKTYFPHF-DL 48 MYG_PHYCA ---------VLSEGEWQLVLHVWAKVE--ADVAGHGQDILIRLFKSHPETLEKFDRFKHL 49 GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVY--STYETSGVDILVKFFTSTPAAQEFFPKFKGL 58 GLB3_CHITP ----------LSADQISTVQASFDKVK------GDPVGILYAVFKADPSIMAKFTQFAGK 44 LGB2_LUPLU --------GALTESQAALVKSSWEEFN--ANIPKHTHRFFILVLEIAPAAKDLFSFLKGT 50 *: : : :..: :.: * * GLB1_GLYDI DP--------GVAALGAKVLAQIGVAVSHLGDE--GKMVAQMKAVGVRHKGYGNKHIKAQ 101 HBB_HUMAN STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHC--DKLHVDPE 101 HBA_HUMAN SH-----GSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHA--HKLRVDPV 96 MYG_PHYCA KTEAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHA--TKHKIPIK 102 GLB5_PETMA TTADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHA--KSFQVDPQ 114 GLB3_CHITP DLES-IKGTAPFETHANRIVGFFSKIIGELPN-----IEADVNTFVASHK---PRGVTHD 95 LGB2_LUPLU SEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHV---SKGVADA 105.. :... *. : GLB1_GLYDI YFEPLGASLLSAMEHRIGGKMNAAAKDAWAAAYADISGALISGLQS----- 147 HBB_HUMAN NFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ 146 HBA_HUMAN NFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ 141 MYG_PHYCA YLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153 GLB5_PETMA YFKVLAAVIADTVAAG---------DAGFEKLMSMICILLRSAY------- 149 GLB3_CHITP QLNNFRAGFVSYMKAHTD---FAGAEAAWGATLDTFFGMIFSKM------- 136 LGB2_LUPLU HFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- 153 :. : :.... :.
46
7/7/2016André de Carvalho - ICMC/USP 46 Sequence Alignment There are several techniques Dot plot FASTA Blast Several versions Smith-Waterman Algorithm Algorithm of Needleman and Wunsch Genetics Algorithms
47
7/7/2016André de Carvalho - ICMC/USP 47 Algorithm Needleman-Wunsch 1.Create a table (m+1)x(n+1) for seqs. s and t sizes of m and n 2.Fill the table entries (m:1) and (1:n) with the values : 3.From the top left, calculate the value of each entry using recursion: 4.Perform trace-back procedure from the lower right corner
48
7/7/2016André de Carvalho - ICMC/USP 48 Tracing Back It allows you to recover Alignment with best Score Each cell has a pointer to cell used to calculate its value Cada célula possui um ponteiro para célula utilizada para calcular seu valor Form a path Every movement has a meaning Diagonal: there is a match or divergence Vertical: Space inclusion in the top sequence Horizontal: Space insertion in the side sequence
49
7/7/2016André de Carvalho - ICMC/USP 49 Example Find the best Alignment of sequences x and y: x = AGC y = AAAC Use as Score function:
50
7/7/2016André de Carvalho - ICMC/USP 50 Example AGC A A A C -2-4-60 -2 -4 -6 -8
51
7/7/2016André de Carvalho - ICMC/USP 51 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1
52
7/7/2016André de Carvalho - ICMC/USP 52 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1
53
7/7/2016André de Carvalho - ICMC/USP 53 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4
54
7/7/2016André de Carvalho - ICMC/USP 54 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 C
55
7/7/2016André de Carvalho - ICMC/USP 55 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 -C AC GC AC GC AC
56
7/7/2016André de Carvalho - ICMC/USP 56 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 G-C AAC AGC AAC -GC AAC
57
7/7/2016André de Carvalho - ICMC/USP 57 Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 AG-C AAAC -AGC AAAC A-GC AAAC
58
7/7/2016André de Carvalho - ICMC/USP 58 NW – Time and Space Complexity Time: Filling the matrix: Backtracing: Overall: Space: Holding the matrix: AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 O(n·m) O(n+m) O(n·m)
59
7/7/2016André de Carvalho - ICMC/USP 59 Example Find the best Alignment of sequences x and y: x = VIVADAVIS y = VIVAVEGAS Use as Score function:
60
7/7/2016André de Carvalho - ICMC/USP 60 Example V I V A D A V I S 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 V -1 1 0 -1 -2 -3 -4 -5 -6 -7 I -2 0 2 1 0 -1 -2 -3 -4 -5 V -3 -1 1 3 2 1 0 -1 -2 -3 A -4 -2 0 2 4 3 2 1 0 -1 L -5 -3 -1 1 3 3 2 1 0 -1 A -6 -4 -2 0 2 2 4 3 2 1 S -7 -5 -3 -1 1 1 3 3 2 3 V -8 -6 -4 -2 0 0 2 4 3 2 E -9 -7 -5 -3 -1 -1 1 3 3 2 G -10 -8 -6 -4 -2 -2 0 2 2 2 A -11 -9 -7 -5 -3 -3 -1 1 1 1 S -12 -10 -8 -6 -4 -4 -2 0 0 2
61
7/7/2016André de Carvalho - ICMC/USP 61 Example
62
7/7/2016André de Carvalho - ICMC/USP 62 Example
63
7/7/2016André de Carvalho - ICMC/USP 63 Tracing Back Starting from the bottom right corner: A*(s, t) = V I V A L A S V E G A S V I V A D A - V - - I S
64
7/7/2016André de Carvalho - ICMC/USP 64 Algorithm Smith-Waterman 1.Create a table (m+1) x (n+1) for seqs. s and t of sizes of m and n 2.Fill the table entries (m:1) and (1:n) with zeros 3.From the top left, calculate the value of each entry using recursion: 4.Perform trace-back procedure from the largest table element to the first element with value zero
65
65 Example Find the Best Local Alignment of sequences x and y: x = QUEVIVALASVEGAS y =VIVADAVIS Use as Score function:
66
Algorithm Smith-Waterman
67
67 Algorithm Smith-Waterman Trace-back from the largest table element to the first element with zero value:
68
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.