Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004.

Slides:



Advertisements
Similar presentations
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Advertisements

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Structural bioinformatics
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Heuristic alignment algorithms and cost matrices
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
CS273a Lecture 11, Aut 08, Batzoglou Multiple Sequence Alignment.
Some new sequencing technologies. Molecular Inversion Probes.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
CS273a Lecture 10, Aut 08, Batzoglou Multiple Sequence Alignment.
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
[Bejerano Fall10/11] 1 HW1 Due This Fri 10/15 at noon. TA Q&A: What to ask, How to ask.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Approximate logarithmic gap penalty Affine gap functions are linear in gap length, γ(x) = αx + β. Logarithmic gaps handle both problems by penalizing small.
Putting Together Alignments & Comparing Assemblies Michael Brudno Department of Computer Science University of Toronto 6.095/ Computational Biology:
Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Sequence comparison: Local alignment
Sequencing a genome and Basic Sequence Alignment
Comparative Genomics of the Eukaryotes
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Alignment.
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Mouse Genome Sequencing
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs.
Sequencing a genome and Basic Sequence Alignment
VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Copyright OpenHelix. No use or reproduction without express written consent1.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Construction of Substitution matrices
DNA, RNA and protein are an alien language
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence comparison: Local alignment
Alignment of Long Sequences
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Homology Search Tools Kun-Mao Chao (趙坤茂)
Pairwise sequence Alignment.
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Presentation transcript:

Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Conservation Implies Function Exon Gene CNS: Other Conserved

Edit Distance Model (1) Weighted sum of insertions, deletions & mutations to transform one string into another AGGCACA--CA AGGCACACA | |||| || or | || || A--CACATTCA ACACATTCA

Edit Distance Model (2) Given:x, y Define:F(i,j) = Score of best alignment of x 1 …x i to y 1 …y j Recurrence:F(i,j) = max (F(i-1,j) – GAP_PENALTY, F(i,j-1) – GAP_PENALTY, F(i-1,j-1) + SCORE(x i, y j ))

Edit Distance Model (3) F(i,j) = Score of best alignment ending at i,j Time O( n 2 ) for two seqs, O( n k ) for k seqs F(i,j) F(i,j-1) F(i-1,j-1) F(i-1,j) AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Overview Local Alignment (CHAOS) Multiple Global Alignment (LAGAN) -Whole Genome Alignment Glocal Alignment (Shuffle-LAGAN) Biological Story

Local Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC F(i,j) = max (F(i,j), 0) Return all paths with a position i,j where F(i,j) > C Time O( n 2 ) for two seqs, O( n k ) for k seqs

Heuristic Local Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC BLAST FASTA

CHAOS: CHAins Of Seeds 1.Find short matching words (seeds) 2.Chain them 3.Rescore chain

CHAOS: Chaining the Seeds Find seeds at current location in seq1 location in seq1 seed seq1 seq2

CHAOS: Chaining the Seeds location in seq1 distance cutoff seed seq1 seq2 Find seeds at current location in seq1

CHAOS: Chaining the Seeds location in seq1 distance cutoff gap cutoff seed seq1 seq2 Find seeds at current location in seq1

CHAOS: Chaining the Seeds Find seeds at current location in seq1 Find the previous seeds that fall into the search box location in seq1 distance cutoff gap cutoff seed Search box seq1 seq2

CHAOS: Chaining the Seeds Find seeds at current location in seq1 Find the previous seeds that fall into the search box Do a range query: seeds are indexed by their diagonal location in seq1 distance cutoff gap cutoff seed Search box seq1 seq2 Range of search

CHAOS: Chaining the Seeds Find seeds at current location in seq1 Find the previous seeds that fall into the search box Do a range query: seeds are indexed by their diagonal. Pick a previous seed that maximizes the score of chain location in seq1 distance cutoff gap cutoff seed Search box seq1 seq2 Range of search

CHAOS: Chaining the Seeds Find seeds at current location in seq1 Find the previous seeds that fall into the search box Do a range query: seeds are indexed by their diagonal. Pick a previous seed that maximizes the score of chain location in seq1 distance cutoff gap cutoff seed Search box seq1 seq2 Range of search Time O(n log n), where n is number of seeds.

CHAOS Scoring Initial score = # matching bp - gaps Rapid rescoring: extend all seeds to find optimal location for gaps

Overview Local Alignment (CHAOS) Multiple Global Alignment (LAGAN) -Whole Genome Alignment Glocal Alignment (Shuffle-LAGAN) Biological Story

Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z

LAGAN: 1. FIND Local Alignments 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP

LAGAN: 2. CHAIN Local Alignments 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP

LAGAN: 3. Restricted DP 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP

MLAGAN: 1. Progressive Alignment Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat

MLAGAN: 2. Multi-anchoring X Z Y Z X/Y Z To anchor the (X/Y), and (Z) alignments:

Cystic Fibrosis (CFTR), 12 species Human sequence length: 1.8 Mb Total genomic sequence: 13 Mb Human Baboon Cat Dog Cow Pig Mouse Rat Chimp Chicken Fugufish Zebrafish

CFTR (cont’d ) % Mammals LAGAN % Chicken & Fishes Mammals % MLAGAN 98% MAX MEMORY (Mb) TIME (sec) % Exons Aligned

Automatic computational system for comparative analysis of pairs of genomes Alignments (all pair combinations): Human Genome (Golden Path Assembly) Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002) Rat assemblies: January 2003, February D. Melanogaster vs D. Pseudoobscura February 2003

Tandem Local/Global Approach Finding a likely mapping for a contig (BLAT)

Progressive Alignment Scheme yes no yes no Human, Mouse and Rat genomes Pairwise M/R mapping Aligned M&R fragments Unaligned M&R sequences Map to Human Genome Mapping aligned fragments by union of M&R local BLAT hits on the human genome H/M/R MLAGAN alignment M/R pairwise alignment M/H and R/H pairwise alignment Unassigned M&R DNA fragments yes no

Computational Time 23 dual 2.2GHz Intel Xeon node PC cluster. Pair-wise rat/mouse – 4 hours Pair-wise rat/human and mouse/human – 2 hours Multiple human/mouse/rat – 9 hours Total wall time: ~ 15 hours

Distribution of Large Indels

Evolution Over a Chromosome

Overview Local Alignment (CHAOS) Multiple Global Alignment (LAGAN) -Whole Genome Alignment Glocal Alignment (Shuffle-LAGAN) Biological Story

Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

Local & Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Local Global

Glocal Alignment Problem Find least cost transformation of one sequence into another using new operations Sequence edits Inversions Translocations Duplications Combinations of above AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Shuffle-LAGAN A glocal aligner for long DNA sequences

S-LAGAN: Find Local Alignments 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts

S-LAGAN: Build Homology Map 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts

Building the Homology Map d a b c Chain (using Eppstein Galil); each alignment gets a score which is MAX over 4 possible chains. Penalties are affine (event and distance components) Penalties: a)regular b)translocation c)inversion d)inverted translocation

S-LAGAN: Build Homology Map 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts

S-LAGAN: Global Alignment 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts

S-LAGAN Results (CFTR) LocalLocal GlocalGlocal

Hum/MusHum/Mus Hum/RatHum/Rat

S-LAGAN Results (IGF cluster)

S-LAGAN results (HOX) 12 paralogous genes Conserved order in mammals

S-LAGAN results (HOX) 12 paralogous genes Conserved order in mammals

S-LAGAN Results (Chr 20) Human Chr 20 v. homologous Mouse Chr Segments of conserved synteny 70 Inversions

S-LAGAN Results (Whole Genome) LAGANS-LAGAN Total37%38% Exon93%96% Ups20078%81% CPU Time350 Hrs450 Hrs Used Berkeley Genome Pipeline % Human genome aligned with mouse sequence Evaluation criteria from Waterston, et al (Nature 2002)

Rearrangements in Human v. Mouse Preliminary conclusions: Rearrangements come in all sizes Duplications worse conserved than other rearranged regions Simple inversions tend to be most common and most conserved

What is next? (Shuffle) Better algorithm and scoring Whole genome synteny mapping Multiple Glocal Alignment(!?)

Overview Local Alignment (CHAOS) Multiple Global Alignment (LAGAN) -Whole Genome Alignment Glocal Alignment (Shuffle-LAGAN) Biological Story

Math1 (Mouse Atonal Homologue 1, also ATOH) is a gene that is responsible for nervous system development

Align Human, Mouse, Rat & Fugu

Detailed Alignment hum_a : 57336/ mus_a : 78565/ rat_a : / fug_a : 36013/68174 hum_a : 57386/ mus_a : 78615/ rat_a : / fug_a : 36063/68174

Can we align human & fly??? CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAG CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCG Melan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGC Pseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Putting it all together CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAG CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCG Melan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGC Pseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Overview Local Alignment (CHAOS) Multiple Global Alignment (LAGAN) -Whole Genome Alignment Glocal Alignment (Shuffle-LAGAN) Biological Story

Acknowledgments Stanford: Serafim Batzoglou Arend Sidow Matt Scott Gregory Cooper Chuong (Tom) Do Sanket Malde Kerrin Small Mukund Sundararajan Berkeley: Inna Dubchak Alexander Poliakov Göttingen: Burkhard Morgenstern Rat Genome Sequencing Consortium