Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.

Similar presentations


Presentation on theme: "Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago."— Presentation transcript:

1 Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago

2

3 (1) pairs of matched bases (2) pairs of mismatched bases (3) pairs consisting of a base from one sequence and a gap (null base) from the other sequence Sequence Alignment

4 TCAGA ** * TC-GT Alignment as an Evolutionary Hypothesis

5 A: TCAGACGATTG L A = 11 B: TCGGAGCTG L B = 9

6 Alignment I TCAG-ACG-ATTG || | | | | | TC-GGA-GC-T-G Matches = 7 Gaps = 6

7 Alignment II T CAGACGATTG || || T CGGAGCTG -- Matches = 4 Gaps = 1

8 Alignment III TCAG-ACGATTG || | | | | TC-GGA-GCTG - Matches = 6 Gaps = 4

9 Which alignment is best?

10 Gap and Mismatch Penalties Gap penalty - a factor by which gap values are multiplied to make the gaps equivalent to mismatches Mismatch penalty - an assessment of how frequently substitutions occur

11 Similarity Index S = x -  w k z k X : number of matches Z k : number of gaps of length k w k : positive number representing penalty for gaps of length k

12 Distance (Dissimilarity) Index D = y +  w' k z k y : number of mismatches z k : number of gaps of length k w' k : positive number representing penalty for gaps of length k

13 Gap penalty systems Fixed - no gap extension penalty Affine or Linear - has two componenets gap opening penalty and gap extension penalty Logarithmic - also has two components but the cost increases more slowly allowing longer gaps than the latter system

14 Gap penalty systems Linear Logarithmic Fixed Gap length Gap penalty

15 TCAG-ACG-ATTG || | | | | | S = -5 S = -11 TC-GGA-GC-T-G TCAGACGATTG || ||S = -4 S = 1 TCGGAGCTG-- TCAG-ACGATTG || | | | | S = -2 S = -6 TC-GGA-GCTG- Gap opening cost = 2 Gap opening cost = 3 Gap extension cost = 6 Gap extension cost = 0 BEST

16 Dynamic programming Large searches are divided into succession of small stages: solution of the initial search stage is trivial each partial solution in a later stage can be calculated by reference to only a small number of solutions of the earlier stage the final stage contains overall solution

17 ATGCGA10000T02111C01232C01233G01324C01243ATGCGA10000T02111C01232C01233G01324C01243 Pointer values and paths connecting the pointers

18 ATGCGA10000T02111C01232C01233G01324C01243ATGCGA10000T02111C01232C01233G01324C01243 Traceback ATGCG- || ATCCGC AT--GCG || ATCCGC-

19 Similarity Index S = x -  w k z k x - number of matches z k - number of gaps of length k w k - a positive number representing penalty for gaps of length k

20 TCAGACGAGTG x = 6 (I) | | | | | | a gap of 2 bp TCGGA - - GCTG S = 6 - (a + 2b) TCAGACGAGTG x = 7 (II) | | | | | | | 2 gaps of 1 bp TCGGA -GC - TG S = 7 - 2(a + b) TCAGACGAGTG x = 7 (III) | | | | | | | 2 gaps of 1 bp TCGGA -G - CTG S = 7 - 2(a + b) TCAGACGAG - TG x = 8 (IV) | | | | | | | | 2 1-bp gaps; 1 2-bp gaps TC - G - - GAGCTG S = 8 - 2(a + b) - (a + 2b)

21 How to align two long genomic sequences?

22 Traditional Seq. Alignment The seqs. are usually known (coding or non-coding) and are homologous They are not very long, usually < 10,000 base pairs (bp) They contain no inversions Relies on dynamic programming: The time and space required are O(N 2 ), where N is the sequence length.

23 The Human Genome Genome size: ~3.2 billion bp Only ~1.5% is coding. Contains numerous repetitive elements (more than 4 million). Introns are usually longer than exons. Non-coding regions evolve fast and are not well conserved.

24 Genomic Seq. Alignment The seqs. can be > one million bp (Mb); e.g., the genome size of Mycobacterium tuberculosis is about 4 Mb. Long time to align. Large computer memory. May contain inversions and many tandem repeats. May contain non-alignable (too divergent) segments.

25 Genomic Seq. Alignment Strategy: Search for anchors that can divide the sequences into subregions. The gaps between anchors can then be aligned by a local alignment algorithm.

26 The System of Delcher et al. (1999) Three ideas: (1) Suffix trees; (2) the Longest Increasing Subsequence (LIS); and (3) the local alignment method of Smith and Waterman (1981) Two closely homologous long sequences or genomes (A and B).

27 Step 1: Perform a Maximum Unique Match (MUM) decomposition of the two sequences A MUM is a subsequence that occurs once in sequence A and once in sequence B, and is not contained in any longer such sequence.

28 Max. Unique Matches (MUMs) MUM1 Seq. A tcgatcaAGCTCACTGATatgtaccat Seq. B cgagcgAGCTCACTGATcctgcatca MUM2 -acgctgaATCGACGTAGTCCATGtactgta agtgc-agATCGACGTAGTCCATGatgaat

29 Suffix Trees A suffix is a subseq. that begins at any position in the seq. & extends to the seq. end. g a a c c g a c c t 1 2 3 4 5 6 7 8 9 10 A suffix: c c g a c c t A suffix tree is a compact representation that stores all possible suffixes of a seq.

30 o 112 10 2 32 19 5 6 7384 Root g a a c c g a c c t 1 2 3 4 5 6 7 8 9 10 at cga accgacct cc gacctt c t t accgacct cct

31 o 112 10 2 3 2 1 9 5 6 7 3 84 Root g a a c c g a c c t# g a a c c t a c c t* 1 2 3 4 5 6 7 8 9 10 at cga accgacct# cc gacct# c t t# acc cct 5 gacct# 1 tacct* 7 4 t

32 Step 2: Sort the MUMs After finding the MUMs, we sort them according to their positions in genome A. See figure. Longest Increasing Sequence (LIS): If the order of B positions is given by the sequence [1,2,10,4,5,8,6,7,9,3], the LIS is [1,2,4,5,6,7,9]. The LIS gives a global MUM-alignment.

33 Genome A: Genome B: 12 3 4 56 7 1 3 24 6 7 5 Genome A: Genome B: 124 6 7 124 6 7

34 Step 3: Close the gaps between MUMs Use the Smith-Waterman algorithm to close the gaps between MUMs. Some regions may be very difficult to align. These regions are ignored and considered as non-alignable parts. Default: If the gap between 2 MUMs is 10 kb, no local alignment is attempted.


Download ppt "Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago."

Similar presentations


Ads by Google