Alignment of large genomic sequences Fragment-based alignment approach (DIALIGN) useful for alignment of genomic sequences. Possible applications: Detection.

Slides:



Advertisements
Similar presentations
Bioinformatics Methods Course Multiple Sequence Alignment Burkhard Morgenstern University of Göttingen Institute of Microbiology and Genetics Department.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt.
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple Alignment and Motif Searching Burkhard Morgenstern Universität Göttingen Institute of Microbiology and.
Structural bioinformatics
Heuristic alignment algorithms and cost matrices
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.
Comparative ab initio prediction of gene structures using pair HMMs
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
The Poor Beginners’ Guide to Bioinformatics. What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequencing a genome and Basic Sequence Alignment
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
Multiple Sequence Alignment
EGASP 2005 Evaluation Protocol
Learning Sequence Motif Models Using Expectation Maximization (EM)
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Presentation transcript:

Alignment of large genomic sequences Fragment-based alignment approach (DIALIGN) useful for alignment of genomic sequences. Possible applications: Detection of regulatory elements Identification of pathogenic microorganisms Gene prediction

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacc cctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa Consistency!

The DIALIGN approach atc------TAATAGTTAaactccccCGTGC-TTag cagtgcGTGTATTACTAAc GG-TTCAATcgcg caaa--GAGTATCAcc CCTGaaTTGAATaa

First step in sequence comparison: alignment S1S1 S2S2 S3S3

For genomic sequences: Neither local nor global methods appropriate S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

First step in sequence comparison: alignment Local method finds single best local similarity S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

First step in sequence comparison: alignment Multiple application of local methods possible S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

First step in sequence comparison: alignment S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3 Multiple application of local methods possible

First step in sequence comparison: alignment Multiple application of local methods possible S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

First step in sequence comparison: alignment Multiple application of local methods possible S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

First step in sequence comparison: alignment Threshold has to be applied to filter alignments: reduced sensitivity! S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

First step in sequence comparison: alignment Alternative approach: During evolution few large-scale re- arrangements -> relative order homologies conserved Search for chain of local homologies

First step in sequence comparison: alignment Genomic alignment: chain of homologies S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

First step in sequence comparison: alignment Genomic alignment: chain of homologies S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

First step in sequence comparison: alignment Genomic alignment: chain of homologies S1S1 S2S2 S1’S1’ S2’S2’ S3’S3’ S3S3

First step in sequence comparison: alignment Genomic alignment: chain of homologies S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

First step in sequence comparison: alignment Novel approaches for genomic alignment: WABA PipMaker MGA TBA Lagan Avid DIALIGN

Alignment of large genomic sequences Gene-regulatory sites identified by mulitple sequence alignment (phylogenetic footprinting)

Alignment of large genomic sequences

Objective function for DIALIGN: Weight score for every possible fragment f based on P-value: P(f) = probability of finding a fragment “like f” by chance in random sequences with same length as input sequences w(f) = -log P(f) (“weight score” of f) ”like f” means: at least same # matches (DNA, RNA) or sum of similarity values (proteins)

Objective function for DIALIGN: Score of alignment: sum of weight scores of fragments – no gap penalty!

Optimization problem for DIALIGN: Find consistent collection of fragments with maximum total weight score!

Alternative fragment weight scores for genomic sequences: Calculate fragment scores at nucleotide level and at peptide level.

catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg

catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg

catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg Standard score: Consider length, # matches, compute probability of random occurrence

Translation option: catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg

Translation option: L S Y V catcatatc tta tct tac gtt aactcccccgt cagtgcgtg ata gcc cat atc cgg I A H I DNA segments translated to peptide segments; fragment score based on peptide similarity: Calculate probability of finding a fragment of the same length with (at least) the same sum of BLOSUM values

P-fragment (in both orientations) L S Y V catcatatc tta tct tac gtt aactcccccgt cagtgcgtg ata gcc cat atc cgg I A H I N-fragment catcatatc ttatcttacgtt aactcccccgtgct || | | | cagtgcgtg atagcccatatc cg For each fragment f three probability values calculated; Score of f based on smallest P value.

Alternative fragment weight scores for genomic sequences: Calculate fragment scores at nucleotide level and at peptide level.

DIALIGN alignment of human and murine genomic sequences

DIALIGN alignment of tomato and Thaliana genomic sequences

Alignment of large genomic sequences Evaluation of signal detection methods: Apply method to data with known signals (correct answer is known!). E.g. experimentally verified genes for gene finding TP = true positves = # signals correctly predicted (i.e. signal present) FP = false positives = # signals predicted but wrong (i.e no signal present) TN = true negative = # no signal predicted, no signal present FN = false negative = # no signal predicted, signal present!

Alignment of large genomic sequences Sn = Sensitivity = correctly predicted signals / present signals = TP / (TP + FN) Sp = Specificity = correctly predicted signals / predicted signals = TP / (TP + FP)

Alignment of large genomic sequences Comprehensive evaluation of signal prediction method: Method assigns score to predictions Apply threshold parameter High threshold -> high specificity (Sp), low sensitivity (Sn) Low threshold -> high sensitivity, low specificity ROC curve („receiver-operator curve“) Vary threshold parameter, plot Sn against Sp

Performance of long-range alignment programs for exon discovery (human - mouse comparison)

DIALIGN alignment of tomato and Thaliana genomic sequences

AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments

AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons

AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons Recursive algorithm finds biologically consistent chain of potential exons

Identification of candidate exons Fragments in DIALIGN alignment

Identification of candidate exons Build cluster of fragments

Identification of candidate exons Identify conserved splice sites

Identification of candidate exons Candidate exons bounded by conserved splice sites

Construct gene models using candidate exons Score of candidate exon (E) based on DIALIGN scores for fragments, score of splice junctions and penalty for shortening / extending Find biologically consistent chain of candidate exons (starting with start codon, ending with stop codon, no internal stop codons …) with maximal total score

Find optimal consistent chain of candidate exons

atggtaggtagtgaatgtga

Find optimal consistent chain of candidate exons atggtaggtagtgaatgtga G1G2

Find optimal consistent chain of candidate exons Recursive algorithm calculates optimal chain of candidate exons in O(N log N) time

Find optimal consistent chain of candidate exons atggtaggtagtgaatgtga G1G2

Find optimal consistent chain of candidate exons Recursive algorithm calculates optimal chain of candidate exons in O(N log N) time

DIALIGN fragments

Candidate exons

Gene model

Result: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)

AGenDA GenScan 64 % 12 % 17 % Result: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)

Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster:

Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but

Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but Entire genes totally mis-aligned

Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but Entire genes totally mis-aligned Reason for mis-alignment: duplications !

Alignment of large genomic sequences The Hox gene cluster: 4 Hox gene clusters in pufferfish. 14 genes, different genes in different clusters!

Alignment of large genomic sequences The Hox gene cluster: Complete mis-alignment of entire genes!

Alignment of sequence duplications S1S1 S2S2

S1S1 S2S2 Conserved motivs; no similarity outside motifs

Alignment of sequence duplications S1S1 S2S2 Duplication in two sequences

Alignment of sequence duplications S1S1 S2S2 Duplication in two sequences

Alignment of sequence duplications S1S1 S2S2 Duplication in two sequences

Alignment of sequence duplications S1S1 S2S2 Mis-alignment would have lower score!

Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence

Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence

Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence Possible mis-alignment

Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3

Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3

Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3

Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3

Alignment of sequence duplications S1S1 S2S2 Consistency problem S3S3

Alignment of sequence duplications S1S1 S2S2 More plausible alignment – and higher score : S3S3

Alignment of sequence duplications S1S1 S2S2 Consistency problem S3S3

Alignment of sequence duplications S1S1 S2S2 Alternative alignment; probably biologically wrong; lower numerical score! S3S3

Anchored sequence alignment Biologically meaningful alignment often not possible by automated approaches.

Anchored sequence alignment Biologically meaningful alignment not possible by automated approaches. Idea: use expert knowledge to guide alignment procedure

Anchored sequence alignment Biologically meaningful alignment not possible by automated approaches. Idea: use expert knowledge to guide alignment procedure User defines a set anchor points that are to be „respected“ by the alignment procedure

Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point

Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point

Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point Anchor point = anchored fragment (gap-free pair of segments)

Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point Anchor point = anchored fragment (gap-free pair of segments) Remainder of sequences aligned automatically

Anchored sequence alignment NLF VALYDFVASG DNTLSITKGE klrvlgynhn iihredkGVI YALWDYEPQN DDELPMKEGD cmt Anchored alignment

Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS Anchor points in multiple alignment

Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQND DELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS Anchor points in multiple alignment

Anchored sequence alignment NLF V-ALYDFVAS GD NTLSITKGEk lrvLGYNhn iihredkGVI Y-ALWDYEPQ ND DELPMKEGDC MT GYQ YrALYDYKKE REedidlhlg DILTVNKGSL VA-LGFS-- Anchored multiple alignment

Algorithmic questions Goal: Find optimal alignment (=consistent set of fragments) under costraints given by user- specified anchor points!

Additional input file with anchor points: Algorithmic questions

NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS

Additional input file with anchor points: Algorithmic questions

Additional input file with anchor points: Sequences Algorithmic questions

Additional input file with anchor points: Sequences start positions Algorithmic questions

Additional input file with anchor points: Sequences start positions length Algorithmic questions

Additional input file with anchor points: Sequences start positions length score Algorithmic questions

Requirements: Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points

Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Inconsistent anchor points!

Algorithmic questions atctaat---agttaaactcccccgtgcttag Cagtgcgtgtattac-taacggttcaatcgcg caaagagtatcacccctgaattgaataa Inconsistent anchor points!

Algorithmic questions Requirements: Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points

Algorithmic questions Requirements: Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points Find alignment under constraints given by anchor points!

Algorithmic questions Use data structures from multiple alignment

Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Greedy procedure for multiple alignment

Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Greedy procedure for multiple alignment

Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Question: which positions are still alignable ?

Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x For each position x and each sequence S i exist an upper bound ub(x,i) and a lower bound lb(x,i) for residues y in S i that are alignable with x

Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x For each position x and each sequence S i exist an upper bound ub(x,i) and a lower bound lb(x,i) for residues y in S i that are alignable with x

Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure

Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i)

Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure

Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure

Algorithmic questions Anchor points treated like fragments in greedy algorithm:

Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores

Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores Accepted if consistent with previously accepted anchors

Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores Accepted if consistent with previously accepted anchors ub(x,i) and lb(x,i) updated during greedy procedure

Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores Accepted if consistent with previously accepted anchors ub(x,i) and lb(x,i) updated during greedy procedure Resulting values of ub(x,i) and lb(x,i) used as initial values for alignment procedure

Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i)

Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i) calculated using anchor points

Algorithmic questions Ranking of anchor points to prioritize anchor points, e.g. anchor points from verified homologies -- higher priority automatically created anchor points (using CHAOS, BLAST, … ) -- lower priority

Application: Hox gene cluster

Use gene boundaries as anchor points

Application: Hox gene cluster Use gene boundaries as anchor points + CHAOS / BLAST hits

Application: Hox gene cluster no anchoring anchoring Ali. Columns 2 seq seq seq Score CPU time 4:22 0:19

Application: Hox gene cluster Example: Teleost Hox gene cluster:

Application: Hox gene cluster Example: Teleost Hox gene cluster: Score of anchored alignment 15 % higher than score of non-anchored alignment !

Application: Hox gene cluster Example: Teleost Hox gene cluster: Score of anchored alignment 15 % higher than score of non-anchored alignment ! Conclusion: Greedy optimization algorithm does a bad job!

Application: Improvement of Alignment programs Two possible reasons for mis-alignments:

Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Wrong objective function: Biologically correct alignment gets bad numerical score

Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Wrong objective function: Biologically correct alignment gets bad numerical score Bad optimization algorithms: Biologically correct alignment gets best numerical score, but algorithm fails to find this alignment

Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Anchored alignments can help to decide

Application: RNA alignment

aa----CCCC AGC---GUAa gucgcuaucc a cacucuCCCA AGC---GGAG Aac ccg----CCA AaagauGGCG Acuuga non-anchored alignment

Application: RNA alignment aa----CCCC AGC---GUAa gucgcuaucc a cacucuCCCA AGC---GGAG Aac ccg----CCA AaagauGGCG Acuuga structural motif mis-aligned

Application: RNA alignment aaCCCCAGCG UAAGUCGCUA UCca-- --CACUCUCC CAAGCGGAGA AC CCGCCA AAAGAUGGCG ACuuga 3 conserved nucleotides as anchor points

WWW interface at GOBICS (Göttingen Bioinformatics Compute Server)