Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of Health and Rehabilitation Sciences Department of Health.

Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of Health and Rehabilitation Sciences Department of Health Information Management

Outline Pairwise sequence alignment Multiple sequence alignment Phylogenetic tree

Department of Health Information Management Similarity Search Find statistically significant matches to a protein or DNA sequence of interest. Obtain information on inferred function of the gene Sequence identity/similarity is a quantitative measurement of the number of nucleotides / amino acids which are identical /similar in two aligned sequences –Calculated from a sequence alignment –Can be expressed as a percentage –In proteins, some residues are chemically similar but not identical

Department of Health Information Management Sequence Alignment A linear, one-to-one correspondence between some of the symbols in one sequence with some of the symbols in another sequence –Four possible outcomes in aligning two sequences Identity; mismatch; gap in one sequence; gap in the other sequence May be DNA or protein sequences.

Department of Health Information Management Evolutionary Basis of Alignment The simplest molecular mechanisms of evolution are substitution, insertion, and deletion If a sequence alignment represents the evolutionary relationship of two sequences, residues that are aligned but do not match represent substitutions Residues that are aligned with a gap in the sequence represent insertions or deletions

Department of Health Information Management Alignment Algorithms Sequences often contain highly conserved regions These regions can be used for an initial alignment

Department of Health Information Management Alignments Two sequences Seq 1: ACGGACT Seq 2: ATCGGATCT There may be multiple ways of creating the alignment. Which alignment is the best? A – C – G G – A C T | | | | | A T C G G A T - C T A T C G G A T C T | | | A – C G G – A C T

Department of Health Information Management Optimal vs. Correct Alignment For a given group of sequences, there is no single “correct” alignment, only an alignment that is “optimal” according to some set of calculations This is partly due to: –the complexity of the problem, –limitations of the scoring systems used, –our limited understanding of life and evolution Success of the alignment will depend on the similarity of the sequences. If sequence variation is great it will be very difficult to find an optimal alignment

Department of Health Information Management Optimal Alignment Every alignment has a score Chose alignment with highest score Must choose appropriate scoring function Scoring function based on evolutionary model with insertions, deletions, and substitutions Use substitution score matrix – contains an entry for every amino acid pair

Department of Health Information Management Gaps Positions at which a letter is paired with a null are called gaps. Gap scores are typically negative. Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. Biologically, it should in general be easier for a sequence to accept a different residue in a position, rather than having parts of the sequence deleted or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions)

Department of Health Information Management Gaps in Sequence Alignment Gap can occur –Before the first character of a string –Inside a string –After the last character of a string CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA-

Department of Health Information Management Gap penalties There is no suitable theory for gap penalties. The simplest gap penalty is a constant penalty for each gap The most common type of gap penalty is the affine gap penalty: g = a + bx –a is the gap opening penalty –b is the gap extension penalty –x is the number of gapped-out residues. More likely contiguous block of residues inserted or deleted Scoring scheme should penalize new gaps more Typical values, e.g. a = 10 and b = 1 for BLAST.

Pairwise Sequence Alignment

Department of Health Information Management Pairwise Alignment The process of lining up two sequences to achieve maximal levels of identity or conservation for the purpose of assessing the degree of similarity and the possibility of homology It is used to –Decide if two genes are related structurally or functionally Find the similarities between two sequences with same evolutionary background –Identify domains or motifs that are shared between proteins –Analyze genomes Identify genes, search large databases, determine overlaps of sequences (DNA assembly)

Department of Health Information Management DNA and Protein Sequences DNA alphabet: { A, C, G, T } + –Four discrete possibilities – it’s either a match or a mismatch Protein alphabet: { A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y } + –20 possibilities which fall into several categories –Residues can be similar without being identical In some cases, protein sequence is more informative –Codons are degenerate: changes in the third position often do not alter the amino acid that is specified In some cases, DNA alignments are appropriate –To confirm the identity of a cDNA; to study noncoding regions of DNA; to study DNA polymorphisms, …

Department of Health Information Management Translating a DNA Sequence into Proteins DNA sequences can be translated into protein, and then used in pairwise alignments One DNA sequence can be translated into six potential proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Department of Health Information Management DNA Alignment Score CGAAGACTTGAGCTGAT || |||| ||| |||| CGCAGACATGA-CTGAC Mismatch Gap Match

Department of Health Information Management Alignment Scoring Scheme Possible scoring scheme: –match: +5 –mismatch: -3 –indel: –4 Example: G A A T T C A G T T A | | | G G A – T C – G - — A + - + - + + - + - - + 5 3 5 4 5 5 4 5 4 4 5 S = 5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 – 4 + 5 = 11 ACGT A5 C-35 G 5 T 5

Department of Health Information Management Amino Acid Sequence Alignment No exact match/mismatch scores Match state score calculated by table lookup Lookup table is substitution matrix (or scoring matrix)

Department of Health Information Management Substitution Matrix A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are Point- Accepted Mutations (PAM) and BLOcks Substituion Matrix (BLOSUM).

Department of Health Information Management Sequence Alignment Algorithms Dynamic Programming: –Needleman-Wunsch Global Alignment (1970) Smith-Waterman Local Alignment (1981) Guaranteed to find the best scoring Slow, especially used to compare with a large database Heuristics –FASTA, BLAST : heuristic approximations to Smith-waterman Fast and results comparable to the Smith-Waterman algorithm

Department of Health Information Management Dynamic Programming Solve optimization problems by dividing the problem into independent subproblems Sequence alignment has optimal substructure property –Subproblem: alignment of prefixes of two sequences –Each subproblem is computed once and stored in a matrix Optimal score: built upon optimal alignment computed to that point Aligns two sequences beginning at ends, attempting to align all possible pairs of characters –Alignment contains matches, mismatches and gaps –Scoring scheme for matches, mismatches, gaps –Highest set of scores defines optimal alignment between sequences

Department of Health Information Management The Big O Notation Computational complexity of an algorithm is how its execution time increases as the problem is made larger (e.g. more sequences to align) The big-O notation –If we have a problem size n, then an algorithm takes O(n) time if the time increases linearly with n. If the algorithm needs time proportional to the square of n, then it is O(n 2 ) –More example, here c is a constant: O(c) utopian O(log n) excellent O(n) very good O(n 2 ) not so good O(n 3 ) pretty bad O(c n ) disaster

Department of Health Information Management Drawbacks to DP Approaches Compute intensive Memory intensive Complexity of DP Algorithm –Time O(nm); space O(nm) where n, m are the lengths of the two sequences. –Space complexity can be reduced to O(n) by not storing the entries of dynamic programming table that are no longer needed for the computation (keep current row and the previous row only) A fast heuristic (BLAST) will be discussed next week

Department of Health Information Management Two Sequences >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA TGCTTCTGACATAGTTGTGTTGACTCACAAACTCAGAAACAGACACCATGGTGCACCTGACTGATGCTGA GAAGGCTGCTGTTAATGGCCTGTGGGGAAAGGTGAACCCTGATGATGTTGGTGGCGAGGCCCTGGGCAGG CTGCTGGTTGTCTACCCTTGGACCCAGAGGTACTTTGATAGCTTTGGGGACCTGTCCTCTGCCTCTGCTA TCATGGGTAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGATAAACGCCTTCAATGATGGCCTGAAACA CTTGGACAACCTCAAGGGCACCTTTGCTCATCTGAGTGAACTCCACTGTGACAAGCTGCATGTGGATCCT GAGAACTTCAGGCTCCTGGGCAATATGATTGTGATTGTGTTGGGCCACCACCTGGGCAAGGAATTCACCC CCTGTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTA AACCTCTTTTCCTGCTCTTGTCTTTGTGCAATGGTCAATTGTTCCCAAGAGAGCATCTGTCAGTTGTTGT CAAAATGACAAAGACCTTTGAAAATCTGTCCTACTAATAAAAGGCATTTACTTTCACTGC

Department of Health Information Management Pairwise Sequence Alignment FASTA: http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi DNA vs. DNA comparison Default parameters: –Match: +5 –Mismatch: -4 –Gap open penalty: -12 –Gap extension penalty: -4 BLAST search will be covered next week

Multiple Sequence Alignment

Department of Health Information Management Multiple Sequence Alignment Multiple sequence alignment (MSA) is a generalization of Pairwise Sequence Alignment: instead of aligning two sequences, n (>2) sequences are aligned simultaneously A multiple sequence alignment is obtained by inserting gaps (“-”) into sequences such that the resulting sequences have all length L and can be arranged in a matrix of n rows and L columns where each column represents a homologous position MSA applies both to DNA and protein sequences

Department of Health Information Management Why Do We Need MSA? MSA can help to develop a sequence “finger print” which allows the identification of members of distantly related protein family (motifs) Formulate & test hypotheses about protein 3-D structure MSA can help us to reveal biological facts about proteins, e.g.: how protein function has changed or evolutionary pressure acting on a gene Crucial for genome sequencing: –Random fragments of a large molecule are sequenced and those that overlap are found by a multiple sequence alignment program. To establish homology for phylogenetic analyses Identify homologous sequences in other organisms

Department of Health Information Management Multiple Sequence Alignment Difficulty: introduction of multiple sequences increases combination of matches, mismatches, gaps In pairwise alignments, one has a 2D matrix with the sequences on each axis. The number of operations required to locate the best “path” through the matrix is approximately proportional to the product of the lengths of the two sequences A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a DP algorithm in N dimensions. Algorithmically, this is not difficult to do

Department of Health Information Management Example fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA

Department of Health Information Management MSA How do we generate a multiple alignment? Given a pairwise alignment, just add the third, then the fourth, and so on, until all have been aligned. Does it work? It is not self-evident how these sequences are to be aligned together. It depends not only on the various alignment parameters but also on the order in which sequences are added to the multiple alignment

Department of Health Information Management Dynamic Programming for MSA Dynamic programming with two sequences –Relatively easy to code –Guaranteed to obtain optimal alignment An extension of the pairwise sequence alignment –Alignment of K sequences K(K-1)/2 possible sequence comparisons Alignment algorithms operate in a similar manner as pairwise alignment but now the distance matrix is K dimensional and the weight function compares K letters

Department of Health Information Management Time Complexity of Optimal MSA Space complexity (hyperlattice size): O(n k ) for k sequences each n long. Computing a hyperlattice node: O(2 k ). Time complexity: O(2 k n k ). Find the optimal solution is exponential in k (non- polynomial, NP-hard).

Department of Health Information Management Heuristics for Optimal MSA Reduction of space and time Heuristic alignment – not guaranteed to be optimal Alignment provides a limit to the volume within which optimal alignments are likely to be found Heuristics: –Progressive alignments (ClustalW)

Department of Health Information Management Progressive Alignment Works by progressive alignment: it aligns a pair of sequences then aligns the next one onto the first pair Most closely related sequences are aligned first, and then additional sequences and groups of sequences are added, guided by the initial alignments Uses alignment scores to produce a guide tree Aligns the sequences sequentially, guided by the relationships indicated by the tree –If the order is wrong and merge distantly related sequences too soon, errors in the alignment may occur and propagate Gap penalties can be adjusted based on specific sequence

Department of Health Information Management CLUSTALW http://www.ebi.ac.uk/clustalw/ Perform pairwise alignments of all sequences Use alignment scores to produce a guide tree Align sequences sequentially, guided by the tree Enhanced Dynamic Programming used to align sequences Genetic distance determined by number of mismatches divided by number of matches Gaps are added to an existing profile in progressive methods CLUSTALW incorporates a statistical model in order to place gaps where they are most likely to occur

All Pairwise Alignments Cluster Analysis Similarity Matrix Dendrogram ClustalW MSA Procedure From Higgins(1991) and Thompson(1994).

Department of Health Information Management Three Protein Sequences >sp|P25454|RAD51_YEAST DNA repair protein RAD51 OS=Saccharomyces cerevisiae GN=RAD51 PE=1 SV=1 MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNGSGDGGGLQEQAEAQGEMEDEAYDEAAL GSFVPIEKLQVNGITMADVKKLRESGLHTAEAVAYAPRKDLLEIKGISEAKADKLLNEAARLVPMGFVTAADFHM RRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLCHTLAVTCQIPLDIGGGEGKCLYIDTEGTFRPVRLV SIAQRFGLDPDDALNNVAYARAYNADHQLRLLDAAAQMMSESRFSLIVVDSVMALYRTDFSGRGELSARQMHLA KFMRALQRLADQFGVAVVVTNQVVAQVDGGMAFNPDPKKPIGGNIMAHSSTTRLGFKKGKGCQRLCKVVDSPC LPEAECVFAIYEDGVGDPREEDE >sp|P25453|DMC1_YEAST Meiotic recombination protein DMC1 OS=Saccharomyces cerevisiae GN=DMC1 PE=1 SV=1 MSVTGTEIDSDTAKNILSVDELQNYGINASDLQKLKSGGIYTVNTVLSTTRRHLCKIKGLSEVKVEKIKEAAGKIIQVGFI PATVQLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMSHTLCVTTQLPREMGGGEGKVAYIDTE GTFRPERIKQIAEGYELDPESCLANVSYARALNSEHQMELVEQLGEELSSGDYRLIVVDSIMANFRVDYCGRGELS ERQQKLNQHLFKLNRLAEEFNVAVFLTNQVQSDPGASALFASADGRKPIGGHVLAHASATRILLRKGRGDERVAK LQDSPDMPEKECVYVIGEKGITDSSD >sp|P48295|RECA_STRVL Protein recA OS=Streptomyces violaceus GN=recA PE=3 SV=1 MAGTDREKALDAALAQIERQFGKGAVMRMGDRTQEPIEVISTGSTALDIALGVGGLPRGRVVEIYGPESSGKTTLTLHA VANAQKAGGQVAFVDAEHALDPEYAKKLGVDIDNLILSQPDNGEQALEIVDMLVRSGALDLIVIDSVAALVPRAEI EGEMGDSHVGLQARLMSQALRKITSALNQSKTTAIFINQLREKIGVMFGSPETTTGGRALKFYASVRLDIRRIETLK DGTDAVGNRTRVKVVKNKVAPPFKQAEFDILYGQGISREGGLIDMGVEHGFVRKAGAWYTYEGDQLGQGKENA RNFLKDNPDLADEIERKIKEKLGVGVRPDAAKAEAATDAAAADTAGTDDAAKSVPAPASKTAKATKATAVKS

Department of Health Information Management An Alignment from ClustalW sp|P25454|RAD51_YEAST MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNG 50 sp|P25453|DMC1_YEAST ---------------------MSVTGTEIDSDTAKN-------------- 15 sp|P48295|RECA_STRVL ------------MAGTDREKALDAALAQIERQFGKG-------------- 24 :... :::.... sp|P25454|RAD51_YEAST SGDGGGLQEQAEAQGEMEDEAYDEAALGSFVPIEKLQVNGITMADVKKLR 100 sp|P25453|DMC1_YEAST -----------------------------ILSVDELQNYGINASDLQKLK 36 sp|P48295|RECA_STRVL -------------------------------AVMRMGDRTQEPIEVISTG 43.:.: ::. sp|P25454|RAD51_YEAST ESGLHTAEAVAYAPRKDLLEIKG-ISEAKADKLLNEAARLVPMG----FV 145 sp|P25453|DMC1_YEAST SGGIYTVNTVLSTTRRHLCKIKG-LSEVKVEKIKEAAGKIIQVG----FI 81 sp|P48295|RECA_STRVL STALDIALGVGGLPRGRVVEIYGPESSGKTTLTLHAVANAQKAGGQVAFV 93..:. *.* : :* * *. *..... * *: sp|P25454|RAD51_YEAST TAADFHMRRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLC 195 sp|P25453|DMC1_YEAST PATVQLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMS 131 sp|P48295|RECA_STRVL DAEHALDPEYAKKLGVDIDNLILSQPDNGEQALEIVDML--VRSGALDLI 141 *..: : :..* :.*.:::.* * ::

Phylogenetic Analysis

Department of Health Information Management Page 358 Evolution At the molecular level, evolution is a process of mutation with selection. Molecular evolution is the study of changes in genes and proteins throughout different branches of the tree of life. Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are also used for phylogenetic analyses.

Department of Health Information Management Phylogenetic Trees Phylogenetic trees are trees that describe the “relations” among species (genes, sequences) –Evolutionary relationships are shown as branches Sequences most closely related drawn as neighboring branches –Length and nesting reflects degree of similarity between any two items (sequences, species, etc.) Objective of Phylogenetic Analysis: determine branch length and figure out how the tree should be drawn –Dependent upon good multiple sequence alignment programs –Group sequences with similar patterns of substitutions

Department of Health Information Management Uses of Phylogenetic Analysis Phylogeny can answer questions such as: –How many genes are related to the gene I am working on? –Are humans really closest to chimps and gorillas? –How related are chicken, dog, mouse to zebrafish? –Where and when did HIV originate? –What is the history of life on earth? Given a set of genes, determine genes likely to have equivalent functions Follow changes occurring in a rapidly changing species Example: influenza –Study rapidly changing genes in influenza genome, predict next year’s strain and develop flu vaccination accordingly

Department of Health Information Management Difficulties With Phylogenetic Analysis Horizontal or lateral transfer of genetic material (for instance through viruses) makes it difficult to determine phylogenetic origin of some evolutionary events Genes selective pressure can be rapidly evolving, masking earlier changes that had occurred phylogenetically Two sites within comparative sequences may be evolving at different rates Rearrangements of genetic material can lead to false conclusions Duplicated genes can evolve along separate pathways, leading to different functions

Department of Health Information Management Rooted Trees One sequence (root) defined to be common ancestor of all other sequences Root chosen as a sequence thought to have branched off earliest A rooted tree specifies evolutionary path for each sequence A tree can be rooted using an outgroup (that is, a sequence known to be distantly related from all other sequences). http://www.ncbi.nlm.nih.gov/About/primer/phylo.html past present 1 23 4 5 6 78 9

Department of Health Information Management Unrooted Tree Indicates evolutionary relationship without revealing location of oldest ancestry 4 5 87 1 2 3 6

Department of Health Information Management http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

Department of Health Information Management 4 Steps of Phylogenetic Analysis Molecular phylogenetic analysis may be described in four steps: –Selection of sequences for analysis –Multiple sequence alignment –Tree building –Tree evaluation

Department of Health Information Management Page 371 Selection of Sequences (1/2) For phylogeny, DNA can be more informative. –Protein-coding sequences has synonymous and nonsynonymous substitutions. Thus, some DNA changes do not have corresponding protein changes. –Some substitutions in a DNA sequence alignment can be directly observed: single nucleotide substitutions, sequential substitutions, coincidental substitutions. –Additional mutational events can be inferred by analysis of ancestral sequences. These changes include parallel substitutions, convergent substitutions, and back substitutions. –Pseudogenes and noncoding regions may be analyzed using DNA

Department of Health Information Management Selection of Sequences (2/2) For phylogeny, protein sequences are also often used. –Proteins have 20 states (amino acids) instead of only four for DNA, so there is a stronger phylogenetic signal. –Nucleotides are unordered characters: any one nucleotide can change to any other in one step. –An ordered character must pass through one or more intermediate states before reaching the final state. –Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value.

Department of Health Information Management Multiple Sequence Alignment The fundamental basis of a phylogenetic tree is a multiple sequence alignment. –Confirm that all sequences are homologous –Adjust gap creation and extension penalties as needed to optimize the alignment –Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all sequences (species)

Department of Health Information Management Building Tree Two tree-building methods: distance-based and character- based –Distance-based methods involve a distance metric, such as the number of amino acid changes between the sequences, or a distance score –Character-based methods include maximum parsimony and maximum likelihood In both distance- or character-based methods for building a tree, the starting point is a multiple sequence alignment

Department of Health Information Management Maximum Parsimony Predicts evolutionary tree by minimizing number of steps required to generate observed variation For each position, phylogenetic trees requiring smallest number of evolutionary changes to produce observed sequence changes are identified Columns representing greater variation dominate the analysis Trees producing smallest number of changes for all sequence positions are identified Time consuming algorithm Only works well if the sequences have a strong sequence similarity

Department of Health Information Management Maximum Parsimony Example 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G Four sequences, three possible unrooted trees 1 24 3 1 34 2 1 42 3

Department of Health Information Management Maximum Parsimony Example Some sites are informative, others are not Site is informative if there are at least two different kinds of letters at the site, each of which is represented in at least two of the sequences Only informative sites are considered 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G Three informative columns

Department of Health Information Management Maximum Parsimony Example 1 G G A 2 G G G 3 A C A 4 A C G Is a substitution Col 1: 1 24 3 1 34 2 1 42 3 Col 2: 1 24 3 1 34 2 1 42 3 Col 3: 1 24 3 1 34 2 1 42 3 # of Changes: Tree 1: 4 Tree 2: 5 Tree 3: 6

Department of Health Information Management Distance Methods Looks at number of changes between each pair in a group of sequences Identify tree positioning neighbors correctly that has branch lengths reproducing original data as closely as possible Distance score counted as: –# of mismatched positions in alignment –# of sequence positions changed to generate the second sequence Success depends on degree the distances are additive on a predicted evolutionary tree

Department of Health Information Management Example of Distance Analysis Consider the alignment: A ACGCGTTGGGCGATGGCAAC B ACGCGTTGGGCGACGGTAAT C ACGCATTGAATGATGATAAT D ACACATTGAGTGATAATAAT Calculate distances (# of differences) Using this information, a tree can be drawn: C D A B

Department of Health Information Management Maximum Likelihood (ML) A likelihood is calculated for the probability of each residue in an alignment, based upon some model of the substitution process. A maximum likelihood method constructs a phylogenetic tree from DNA sequences whose likelihood is a maximum. This corresponds to the tree that makes the data the most probable evolutionary outcome –Calculates likelihood of a tree given an alignment –Probability of each tree is product of mutation rates in each branch –Likelihoods given by each column multiplied to give the likelihood of the tree This approach requires a explicit model of evolution which is both a strength and weakness because the results depend on the model used This methods can also be very computationally expensive Can only be done for a handful of sequences

Department of Health Information Management Which Method to Choose? Depends upon the sequences that are being compared –Strong sequence similarity: Maximum parsimony –Clearly recognizable sequence similarity Distance methods –All others: Maximum likelihood Best to choose at least two approaches Compare the results – if they are similar, you can have more confidence

Department of Health Information Management Evaluating Trees The main criteria by which the accuracy of a phylogentic tree is assessed are consistency, efficiency, and robustness. Bootstrapping is a commonly used approach to measuring the robustness of a tree topology –Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set? –To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment. –Make the dataset the same size as the original. –Do 100 bootstrap replicates. –Observe the percent of cases in which the assignment of clades in the original tree is supported by the bootstrap replicates –>70% is considered significant

Department of Health Information Management MEGA 5: Molecular Evolutionary Genetics Analysis http://www.megasoftware.net/ Human, mouse, rat, and zebrafish CFTR gene Multiple sequence alignment by ClustalW Build a tree using Maximum Parsimony The obtained phylogenetic tree

Department of Health Information Management Homework 2 Retrieve BRCA1 gene in human (Homo sapiens), mouse (Mus musculus), cow (Bos taurus), and dog (canis lupus familiaris) Use FASTA program to perform all-against-all pairwise sequence alignments Create multiple sequence alignment with ClustalW using the web server Build phylogenetic trees using different methods (such as Neighbor Joining, minimum evolution, UPGMA, and maximum parsimony implemented in MEGA)

Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of Health and Rehabilitation Sciences Department of Health.

Similar presentations

Presentation on theme: "Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of Health and Rehabilitation Sciences Department of Health."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of Health and Rehabilitation Sciences Department of Health.

Similar presentations

Presentation on theme: "Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of Health and Rehabilitation Sciences Department of Health."— Presentation transcript:

Similar presentations

About project

Feedback