Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.

Slides:



Advertisements
Similar presentations
Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology.
Advertisements

Two sequences Multiple sequences Local Blastz (zPicture-dcode.org) ALIGNMENTCONVERVED TFBS LAGAN (mVISTA) Global TBA/Multiz (Mulan-dcode.org) Local rVISTA.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Computational Biology, Part 7 Similarity Functions and Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Applications of Dynamic Programming zTo sequence analysis Shotgun sequence assembly Multiple alignments Dispersed & tandem repeats Bird song alignments.
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Multiple Sequence Alignment
Last lecture summary.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Molecular Evolution Revised 29/12/06
Structural bioinformatics
Heuristic alignment algorithms and cost matrices
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Protein Modules An Introduction to Bioinformatics.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.
Sequencing a genome and Basic Sequence Alignment
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise & Multiple sequence alignments
BLAST What it does and what it means Steven Slater Adapted from pt.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Chapter 21 Eukaryotic Genome Sequences
Construction of Substitution Matrices
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
BLAST program selection guide
LMO
Sequence comparison: Local alignment
Alignment of Long Sequences
Alternative Computational Analysis Shows No Evidence for Nucleosome Enrichment at Repetitive Sequences in Mammalian Spermatozoa  Hélène Royo, Michael Beda.
Introduction to Bioinformatics II
Presentation transcript:

Genome Alignment

Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment given a particular scoring function Too computationally intensive for genome alignment, especially multiple genomes

Genome Alignment Depending on level of similarity, genome alignments may need to contend with rearrangements and large-scale duplications and deletions Draft or partial genomes can both benefit from and confound alignment Need to visualize results in summary form

Genome Alignment Pair-wise –Align two genomes –Example: MUMmer Multiple or complex samples and a reference genome –All of one genome plus whatever parts match from the other genome(s) –Example: PIPs Multiple alignment –All of all the genomes –Example: Mauve

Some aligners

MUMmer (Maximal Unique Match) Fast pair-wise comparison of draft or complete genomes using nucleotide or 6- frame translated sequences MUMmer 3.0 can find all 20-basepair or longer exact matches between a pair of 5- megabase genomes in 13.7 seconds, using 78 MB of memory, on a 2.4 GHz Linux desktop computer

Suffix Tree Delcher et al. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res Jun 1;30(11):

MUMMER plot Genome 1 Genome 2

5 Campylobacter PROmer analysis Fouts et al. Major structural differences and novel potential virulence mechanisms from the genomes of multiple campylobacter species. PLoS Biol Jan;3(1):e15. One genome is used as the x-axis for all four pair-wise comparisons X-shape characteristic of collinearity interrupted by inversions around the origin or terminus of replication Loss of collinearity in more distant comparisons

Human Gut metagenome Percent Identity Plot (PIP) of random shotgun reads to a complete Bifidobacterium genome and a good quality draft Methanobrevibacter genome Gill et al. Metagenomic analysis of the human distal gut microbiome. Science Jun 2; 312(5778):

Mauve Multiple Genome Aligner Able to identify and align collinear regions of multiple genomes even in the presence of rearrangements Find and extend seed matches Group into locally collinear blocks Align intervening regions Darling et al. Genome Res Jul;14(7):

Progressive Mauve alignment of 12 E. coli genome Aaron Darling 2006 Ph.D. thesis, darling_thesis.pdf darling_thesis.pdf

Figure 1. The difference between positional homology alignment and glocal alignment. Three example linear genomes are broken into genes labeled A,B,C,D, and R. R is a multi- copy (repetitive) gene, with different copies labeled using numeric subscripts. Each copy of R is assumed to be identical in sequence, so that orthology/paralogy is unknowable from nucleotide substitution (as is often the case with mobile DNA repeat elements). Genes shifted downward in a given genome are inverted (reverse complement) relative to the reference genome. The positional homology alignment would ideally create two local alignment blocks where each block has exactly one alignment row for each genome. Only positionally- conserved copies of the repetitive gene family R become aligned to each other. The glocal alignment would ideally create four local alignment blocks wherein all copies of the repetitive gene family become aligned to each other.

Progressive Genome Alignment similar to CLUSTAL (next week) with integrated synteny mapping and positional homology and anchored alignment

Performance Metrics actual \ predicted negativepositive Negative TNFP Positive FNTP Accuracy – Proportion correct TN+TP/total TPR (Recall) – Proportion of predicted positives that are correct TP/FP+TP Sensitivity – Proportion of positives correctly predicted TP/FN+TP Specificity – Proportion of negatives correctly predicted TN/TN+FP

Sensitivity Positive Predictive Value (PPV) For nucleotide pairs, a TP is a pair aligned in both the calculated and correct alignments. A FP is a nucleotide pair in the calculated alignment that is absent from the correct alignment. Likewise, a FN is a pair in the correct alignment not present in the calculated alignment. We do not quantify True Negative (TN) alignments as the number of TN possibilities is extremely large, growing with the product of sequence lengths.

ENCODE project Goal = to identify all functional elements in the human genome Margulies et al reports results of the pilot project to analyze 1% of the genome using genome alignment to detect which regions of the sequence are evolutionarily constrained.

4 aligners –MAVID –MLAGAN –TBA –PECAN 23 mammalian species 30 Mb; 44 regions

Alignment Breakpoints

Alignment Coverage For example, vs. armadillo: MAVID27.4% MLAGAN42.4% TBA41.2% PECAN40.1% 17.4% covered by all 4 aligners Of which 66.1% are aligned identically

Performance Metrics Sensitivity – coverage of protein coding regions and ancestral repeats Specificity – primate specific repeats (Alu) and periodicity of substitutions in protein coding regions