Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago
(1) pairs of matched bases (2) pairs of mismatched bases (3) pairs consisting of a base from one sequence and a gap (null base) from the other sequence Sequence Alignment
TCAGA ** * TC-GT Alignment as an Evolutionary Hypothesis
Alignment I TCAG-ACG-ATTG || | | | | | TC-GGA-GC-T-G Matches = 7 Gaps = 6
Alignment II T CAGACGATTG || || T CGGAGCTG -- Matches = 4 Gaps = 1
Alignment III TCAG-ACGATTG || | | | | TC-GGA-GCTG - Matches = 6 Gaps = 4
Which alignment is best?
Gap and Mismatch Penalties Gap penalty - a factor by which gap values are multiplied to make the gaps equivalent to mismatches Mismatch penalty - an assessment of how frequently substitutions occur
Similarity Index S = x - w k z k X : number of matches Z k : number of gaps of length k w k : positive number representing penalty for gaps of length k
Distance (Dissimilarity) Index D = y + w' k z k y : number of mismatches z k : number of gaps of length k w' k : positive number representing penalty for gaps of length k
Gap penalty systems Fixed - no gap extension penalty Affine or Linear - has two componenets gap opening penalty and gap extension penalty Logarithmic - also has two components but the cost increases more slowly allowing longer gaps than the latter system
Gap penalty systems Linear Logarithmic Fixed Gap length Gap penalty
TCAG-ACG-ATTG || | | | | | S = -5 S = -11 TC-GGA-GC-T-G TCAGACGATTG || ||S = -4 S = 1 TCGGAGCTG-- TCAG-ACGATTG || | | | | S = -2 S = -6 TC-GGA-GCTG- Gap opening cost = 2 Gap opening cost = 3 Gap extension cost = 6 Gap extension cost = 0 BEST
Dynamic programming Large searches are divided into succession of small stages: solution of the initial search stage is trivial each partial solution in a later stage can be calculated by reference to only a small number of solutions of the earlier stage the final stage contains overall solution
ATGCGA10000T02111C01232C01233G01324C01243ATGCGA10000T02111C01232C01233G01324C01243 Pointer values and paths connecting the pointers
ATGCGA10000T02111C01232C01233G01324C01243ATGCGA10000T02111C01232C01233G01324C01243 Traceback ATGCG- || ATCCGC AT--GCG || ATCCGC-
Similarity Index S = x - w k z k x - number of matches z k - number of gaps of length k w k - a positive number representing penalty for gaps of length k
TCAGACGAGTG x = 6 (I) | | | | | | a gap of 2 bp TCGGA - - GCTG S = 6 - (a + 2b) TCAGACGAGTG x = 7 (II) | | | | | | | 2 gaps of 1 bp TCGGA -GC - TG S = 7 - 2(a + b) TCAGACGAGTG x = 7 (III) | | | | | | | 2 gaps of 1 bp TCGGA -G - CTG S = 7 - 2(a + b) TCAGACGAG - TG x = 8 (IV) | | | | | | | | 2 1-bp gaps; 1 2-bp gaps TC - G - - GAGCTG S = 8 - 2(a + b) - (a + 2b)
How to align two long genomic sequences?
Traditional Seq. Alignment The seqs. are usually known (coding or non-coding) and are homologous They are not very long, usually < 10,000 base pairs (bp) They contain no inversions Relies on dynamic programming: The time and space required are O(N 2 ), where N is the sequence length.
The Human Genome Genome size: ~3.2 billion bp Only ~1.5% is coding. Contains numerous repetitive elements (more than 4 million). Introns are usually longer than exons. Non-coding regions evolve fast and are not well conserved.
Genomic Seq. Alignment The seqs. can be > one million bp (Mb); e.g., the genome size of Mycobacterium tuberculosis is about 4 Mb. Long time to align. Large computer memory. May contain inversions and many tandem repeats. May contain non-alignable (too divergent) segments.
Genomic Seq. Alignment Strategy: Search for anchors that can divide the sequences into subregions. The gaps between anchors can then be aligned by a local alignment algorithm.
The System of Delcher et al. (1999) Three ideas: (1) Suffix trees; (2) the Longest Increasing Subsequence (LIS); and (3) the local alignment method of Smith and Waterman (1981) Two closely homologous long sequences or genomes (A and B).
Step 1: Perform a Maximum Unique Match (MUM) decomposition of the two sequences A MUM is a subsequence that occurs once in sequence A and once in sequence B, and is not contained in any longer such sequence.
Max. Unique Matches (MUMs) MUM1 Seq. A tcgatcaAGCTCACTGATatgtaccat Seq. B cgagcgAGCTCACTGATcctgcatca MUM2 -acgctgaATCGACGTAGTCCATGtactgta agtgc-agATCGACGTAGTCCATGatgaat
Suffix Trees A suffix is a subseq. that begins at any position in the seq. & extends to the seq. end. g a a c c g a c c t A suffix: c c g a c c t A suffix tree is a compact representation that stores all possible suffixes of a seq.
o Root g a a c c g a c c t at cga accgacct cc gacctt c t t accgacct cct
o Root g a a c c g a c c t# g a a c c t a c c t* at cga accgacct# cc gacct# c t t# acc cct 5 gacct# 1 tacct* 7 4 t
Step 2: Sort the MUMs After finding the MUMs, we sort them according to their positions in genome A. See figure. Longest Increasing Sequence (LIS): If the order of B positions is given by the sequence [1,2,10,4,5,8,6,7,9,3], the LIS is [1,2,4,5,6,7,9]. The LIS gives a global MUM-alignment.
Genome A: Genome B: Genome A: Genome B:
Step 3: Close the gaps between MUMs Use the Smith-Waterman algorithm to close the gaps between MUMs. Some regions may be very difficult to align. These regions are ignored and considered as non-alignable parts. Default: If the gap between 2 MUMs is 10 kb, no local alignment is attempted.