Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.

Slides:



Advertisements
Similar presentations
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

4. Lecture WS 2003/04Bioinformatics III1 Whole Genome Alignment (WGA) When the genomic DNA sequences of closely related organisms become available one.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Final presentation Final presentation Tandem Cyclic Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Alignment II Dynamic Programming
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Pairwise alignment Computational Genomics and Proteomics.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Developing Pairwise Sequence Alignment Algorithms
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Doug Raiford Phage class: introduction to sequence databases.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.
Heuristic Alignment Algorithms Hongchao Li Jan
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Genome alignment Usman Roshan.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Alignment of Long Sequences
Jin Zhang, Jiayin Wang and Yufeng Wu
Homology Search Tools Kun-Mao Chao (趙坤茂)
Homology Search Tools Kun-Mao Chao (趙坤茂)
Presentation transcript:

Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago

(1) pairs of matched bases (2) pairs of mismatched bases (3) pairs consisting of a base from one sequence and a gap (null base) from the other sequence Sequence Alignment

TCAGA ** * TC-GT Alignment as an Evolutionary Hypothesis

A: TCAGACGATTG L A = 11 B: TCGGAGCTG L B = 9

Alignment I TCAG-ACG-ATTG || | | | | | TC-GGA-GC-T-G Matches = 7 Gaps = 6

Alignment II T CAGACGATTG || || T CGGAGCTG -- Matches = 4 Gaps = 1

Alignment III TCAG-ACGATTG || | | | | TC-GGA-GCTG - Matches = 6 Gaps = 4

Which alignment is best?

Gap and Mismatch Penalties Gap penalty - a factor by which gap values are multiplied to make the gaps equivalent to mismatches Mismatch penalty - an assessment of how frequently substitutions occur

Similarity Index S = x -  w k z k X : number of matches Z k : number of gaps of length k w k : positive number representing penalty for gaps of length k

Distance (Dissimilarity) Index D = y +  w' k z k y : number of mismatches z k : number of gaps of length k w' k : positive number representing penalty for gaps of length k

Gap penalty systems Fixed - no gap extension penalty Affine or Linear - has two componenets gap opening penalty and gap extension penalty Logarithmic - also has two components but the cost increases more slowly allowing longer gaps than the latter system

Gap penalty systems Linear Logarithmic Fixed Gap length Gap penalty

TCAG-ACG-ATTG || | | | | | S = -5 S = -11 TC-GGA-GC-T-G TCAGACGATTG || ||S = -4 S = 1 TCGGAGCTG-- TCAG-ACGATTG || | | | | S = -2 S = -6 TC-GGA-GCTG- Gap opening cost = 2 Gap opening cost = 3 Gap extension cost = 6 Gap extension cost = 0 BEST

Dynamic programming Large searches are divided into succession of small stages: solution of the initial search stage is trivial each partial solution in a later stage can be calculated by reference to only a small number of solutions of the earlier stage the final stage contains overall solution

ATGCGA10000T02111C01232C01233G01324C01243ATGCGA10000T02111C01232C01233G01324C01243 Pointer values and paths connecting the pointers

ATGCGA10000T02111C01232C01233G01324C01243ATGCGA10000T02111C01232C01233G01324C01243 Traceback ATGCG- || ATCCGC AT--GCG || ATCCGC-

Similarity Index S = x -  w k z k x - number of matches z k - number of gaps of length k w k - a positive number representing penalty for gaps of length k

TCAGACGAGTG x = 6 (I) | | | | | | a gap of 2 bp TCGGA - - GCTG S = 6 - (a + 2b) TCAGACGAGTG x = 7 (II) | | | | | | | 2 gaps of 1 bp TCGGA -GC - TG S = 7 - 2(a + b) TCAGACGAGTG x = 7 (III) | | | | | | | 2 gaps of 1 bp TCGGA -G - CTG S = 7 - 2(a + b) TCAGACGAG - TG x = 8 (IV) | | | | | | | | 2 1-bp gaps; 1 2-bp gaps TC - G - - GAGCTG S = 8 - 2(a + b) - (a + 2b)

How to align two long genomic sequences?

Traditional Seq. Alignment The seqs. are usually known (coding or non-coding) and are homologous They are not very long, usually < 10,000 base pairs (bp) They contain no inversions Relies on dynamic programming: The time and space required are O(N 2 ), where N is the sequence length.

The Human Genome Genome size: ~3.2 billion bp Only ~1.5% is coding. Contains numerous repetitive elements (more than 4 million). Introns are usually longer than exons. Non-coding regions evolve fast and are not well conserved.

Genomic Seq. Alignment The seqs. can be > one million bp (Mb); e.g., the genome size of Mycobacterium tuberculosis is about 4 Mb. Long time to align. Large computer memory. May contain inversions and many tandem repeats. May contain non-alignable (too divergent) segments.

Genomic Seq. Alignment Strategy: Search for anchors that can divide the sequences into subregions. The gaps between anchors can then be aligned by a local alignment algorithm.

The System of Delcher et al. (1999) Three ideas: (1) Suffix trees; (2) the Longest Increasing Subsequence (LIS); and (3) the local alignment method of Smith and Waterman (1981) Two closely homologous long sequences or genomes (A and B).

Step 1: Perform a Maximum Unique Match (MUM) decomposition of the two sequences A MUM is a subsequence that occurs once in sequence A and once in sequence B, and is not contained in any longer such sequence.

Max. Unique Matches (MUMs) MUM1 Seq. A tcgatcaAGCTCACTGATatgtaccat Seq. B cgagcgAGCTCACTGATcctgcatca MUM2 -acgctgaATCGACGTAGTCCATGtactgta agtgc-agATCGACGTAGTCCATGatgaat

Suffix Trees A suffix is a subseq. that begins at any position in the seq. & extends to the seq. end. g a a c c g a c c t A suffix: c c g a c c t A suffix tree is a compact representation that stores all possible suffixes of a seq.

o Root g a a c c g a c c t at cga accgacct cc gacctt c t t accgacct cct

o Root g a a c c g a c c t# g a a c c t a c c t* at cga accgacct# cc gacct# c t t# acc cct 5 gacct# 1 tacct* 7 4 t

Step 2: Sort the MUMs After finding the MUMs, we sort them according to their positions in genome A. See figure. Longest Increasing Sequence (LIS): If the order of B positions is given by the sequence [1,2,10,4,5,8,6,7,9,3], the LIS is [1,2,4,5,6,7,9]. The LIS gives a global MUM-alignment.

Genome A: Genome B: Genome A: Genome B:

Step 3: Close the gaps between MUMs Use the Smith-Waterman algorithm to close the gaps between MUMs. Some regions may be very difficult to align. These regions are ignored and considered as non-alignable parts. Default: If the gap between 2 MUMs is 10 kb, no local alignment is attempted.