VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 7 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie.

Slides:

Advertisements

Similar presentations

Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune

Advertisements

Comparative genomics: Overview & Tools Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune

Basics of Comparative Genomics Dr G. P. S. Raghava.

Next Generation Sequencing, Assembly, and Alignment Methods

1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.

CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.

DNA Sequencing Lecture 9, Tuesday April 29, 2003.

CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.

CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.

DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)

DNA Sequencing.

DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.

DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.

DNA Sequencing and Assembly

DNA Sequencing.

CS273a Lecture 4, Autumn 08, Batzoglou Fragment Assembly (in whole-genome shotgun sequencing) CS273a Lecture 5.

Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..

Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..

DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..

DNA Sequencing. CS262 Lecture 9, Win07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.

CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.

DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.

DNA Sequencing.

DNA Sequencing. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT.

DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA Sequencing – gel electrophoresis 1.Start at primer(restriction site) 2.Grow DNA chain 3.Include.

DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.

CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)

DNA Sequencing. CS273a Lecture 3, Autumn 08, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.

CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.

CS273a Lecture 4, Autumn 08, Batzoglou DNA Sequencing.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

CS262 Lecture 9, Win07, Batzoglou Conditional Random Fields A brief description.

Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.

DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.

Genome sequencing and assembling

Phylogenetic Tree Construction and Related Problems Bioinformatics.

Sequencing a genome and Basic Sequence Alignment

TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

De-novo Assembly Day 4.

How to Build a Horse Megan Smedinghoff.

Mouse Genome Sequencing

CS 394C March 19, 2012 Tandy Warnow.

What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.

CS273a Lecture 4, Autumn 08, Batzoglou CS273a 2011 DNA Sequencing.

Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.

Genome sequencing Haixu Tang School of Informatics.

CZ5225: Modeling and Simulation in Biology Lecture 9: Next Generation Sequencing Prof. Chen Yu Zong Tel:

A Sequenciação em Análises Clínicas Polymerase Chain Reaction.

Sequencing a genome and Basic Sequence Alignment

Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.

Chapter 21 Eukaryotic Genome Sequences

Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

CS273a Lecture 4, Autumn 08, Batzoglou CS273a 2013 DNA Structure.

DNA Sequencing.

Johnson - The Living World: 3rd Ed. - All Rights Reserved - McGraw Hill Companies Genomics Chapter 10 Copyright © McGraw-Hill Companies Permission required.

1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.

Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.

DNA Sequencing.

DNA Sequencing Project

Fragment Assembly (in whole-genome shotgun sequencing)

Basics of Comparative Genomics

Genomes and Their Evolution

Genomes and Their Evolution

A Sequenciação em Análises Clínicas

Basics of Comparative Genomics

Presentation transcript:

VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 7 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Including slides by Serafim Batzoglou

Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 Vorlesungsthemen Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases Part 2: Sequence Alignments (3) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology 2

“If you take a sequence and just run a gene prediction program on it, the programs don’t usually do very well. But if you take human and mouse sequence, and compare them against each other — looking for similar regions — you get better predictions. And the more genomes we have, the better it will get.” Eugene Myers Was Professor at University of California, Berkeley Now at MPI Dresden His BLAST paper (1990) was cited more that times (among the most highly cited papers ever) Inventor of suffix array data structure Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 3

Two (main) Paradigms of Biology Molecular Paradigm Evolution Paradigm Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 4

Comparison of 1196 orthologous genes Sequence identity between genes in human/mouse –exons: 84.6% –protein: 85.4% –introns: 35% –5’ UTRs: 67% –3’ UTRs: 69% 27 proteins were 100% identical. Makalowski et al., 1996 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 5

Full Genomes are available Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 6

Full Genomes are available Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 7

Sequencing Growth Cost of one human genome 2004: $30,000, : $100, : $10, : $2,000 (today) Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 8

100 million species Each individual has different DNA Within individual, some cells have different DNA (e.g. cancer) Sequencing Applications Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 9

Typical Bioinformatics Flow Chart 6. Gene & Protein expression data 7. Drug screening Ab initio drug design OR Drug compound screening in database of molecules 8. Genetic variability 1a. Sequencing 1b. Analysis of nucleic acid seq. 2. Analysis of protein seq. (e.g. gene finding) 3. Molecular structure prediction 4. Molecular interaction 5. Metabolic and regulatory networks Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

So, if a new species/individual is available there are two tasks: Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ Generate genomics data – Sequencing and Assembly 2.Compare to available data (genomes) – Improve gene finder – Evolution of organisms? – Why do certain bacteria cause diseases while others do not? – Identification and prioritization of drug targets 11

Genome Assembly Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Some Terminology read a long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig Steps to Assemble a Genome..ACGATTACAATAGGTT.. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Shotgun Sequencing cut many times at random (Shotgun) genomic segment Get one or two reads from each segment Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Reconstructing the Sequence (Fragment Assembly) Cover region with high redundancy Overlap & extend reads to reconstruct the original genomic region reads Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Definition of Coverage Length of genomic segment:G Number of reads: N Length of each read:L Definition: Coverage C = N * L / G How much coverage is enough? Lander-Waterman model:Prob[ not covered bp ] = e -C Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides C Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Repeats Bacterial genomes:5% Mammals:50% Repeat types: Low-Complexity DNA(e.g. ATATATATACATA…) Microsatellite repeats (a 1 …a k ) N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Transposons – SINE (Short Interspersed Nuclear Elements) e.g., ALU: ~300-long, 10 6 copies – LINE (Long Interspersed Nuclear Elements) ~4000-long, 200,000 copies – LTR retroposons (Long Terminal Repeats (~700 bp) at each end) cousins of HIV Gene Families genes duplicate & then diverge (paralogs) Recent duplications ~100,000-long, very similar copies Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides 50% of human DNA is composed of repeats Error! Glued together two distant regions Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

What can we do about repeats? Two main approaches: Cluster the reads Link the reads Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

What can we do about repeats? Two main approaches: Cluster the reads Link the reads Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

What can we do about repeats? Two main approaches: Cluster the reads Link the reads Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides C R D ARB, CRD or ARD, CRB ? ARB Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Strategies for whole-genome sequencing 1.Hierarchical – Clone-by-clone i.Break genome into many long pieces ii.Map each long piece onto the genome iii.Sequence each piece with shotgun Example: Yeast, Worm, Human, Rat 2.Online version of (1) – Walking i.Break genome into many long pieces ii.Start sequencing each piece with shotgun iii.Construct map as you go Example: Rice genome 3.Whole genome shotgun One large shotgun pass on the whole genome Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Dog Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Whole Genome Shotgun Sequencing cut many times at random genome forward-reverse paired reads known dist Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Fragment Assembly (in whole-genome shotgun sequencing) Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT.. 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

K-mer based overlap Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Sorting k-mers Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

1. Find Overlapping Reads aaactgcagtacggatct aaactgcag aactgcagt … gtacggatct tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … actgcagta ctgcagtac gtacggatctactacaca gtacggatc tacggatct … ctactacac tactacaca (read, pos., word, orient.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient., pos.) aaactgcag aactgcagt acggatcta actgcagta cccaaactg cggatctac ctactacac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc tacggatct tactacaca Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

1. Find Overlapping Reads Find pairs of reads sharing a k-mer, k ~ 24 Extend to full alignment – throw away if not >98% similar TAGATTACACAGATTAC ||||||||||||||||| T GA TAGA | || TACA TAGT || Caveat: repeats  A k-mer that occurs N times, causes O(N 2 ) read/read comparisons  ALU k-mers could cause up to 1,000,000 2 comparisons Solution:  Discard all k-mers that occur “ too often ” Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

1. Find Overlapping Reads Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A replace T with C correlated errors— probably caused by repeats  disentangle overlaps TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA In practice, error correction removes up to 98% of the errors Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

2. Merge Reads into Contigs Overlap graph: – Nodes: reads r 1 …..r n – Edges: overlaps (r i, r j, shift, orientation, score) Note: of course, we don’t know the “color” of these nodes Reads that come from two regions of the genome (blue and red) that contain the same repeat Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

2. Merge Reads into Contigs We want to merge reads up to potential repeat boundaries repeat region Unique Contig Overcollapsed Contig Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

2. Merge Reads into Contigs Remove transitively inferable overlaps – If read r overlaps to the right reads r 1, r 2, and r 1 overlaps r 2, then (r, r 2 ) can be inferred by (r, r 1 ) and (r 1, r 2 ) rr1r1 r2r2 r3r3 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

2. Merge Reads into Contigs Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Overlap graph after forming contigs Unitigs: Gene Myers, 95 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Repeats, errors, and contig lengths Repeats shorter than read length are easily resolved – Read that spans across a repeat disambiguates order of flanking regions Repeats with more base pair diffs than sequencing error rate are OK – We throw overlaps between two reads in different copies of the repeat To make the genome appear less repetitive, try to: – Increase read length – Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate  decreases effective repeat content  increases contig length Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

3. Link Contigs into Supercontigs Too dense  Overcollapsed Inconsistent links  Overcollapsed? Normal density Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Find all links between unique contigs 3. Link Contigs into Supercontigs Connect contigs incrementally, if  2 forward-reverse links supercontig (aka scaffold) Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Fill gaps in supercontigs with paths of repeat contigs Complex algorithmic step Exponential number of paths Forward-reverse links 3. Link Contigs into Supercontigs Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

De Brujin Graph formulation Given sequence x 1 …x N, k-mer length k, Graph of 4 k vertices, Edges between words with (k-1)-long overlap Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

4. Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting (Alternative: take maximum-quality letter) Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Genome Comparison Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Structure of tryptophan operon Numbers: Gene number Arrows: Direction of transcription //: Dispersion of operon by 50 genes Domain fusion trpD and trpG trpF and trpC trpB and trpA genetically linked separate genes Dandekar et al., 1998

Important observations with regard to gene order Order is highly conserved in closely related species but gets changed by rearrangements With more evolutionary distance, no correspondence between the gene order of orthologous genes Group of genes having similar biochemical function tend to remain localized – Genes required for synthesis of tryptophan (trp genes) in E. coli and other prokaryotes Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

CG: Unit of comparison Number Content (sequence) Location (map position) Gene Order Gene Cluster (genes that are part of a known metabolic pathway, are found to exist as a group) Colinearity of gene order is referred as synteny A conserved group of genes in the same order in two genomes is a syntenic groups Translocation: movement of genomic part from one position to another Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ Unit of comparison: Gene/Genome

Synteny Refers to regions of two genomes that show considerable similarity in terms of – sequence and – conservation of the order of genes Likely to be related by common descent Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ syn- = together, tenia = ribbon, band synteny : the relative ordering of genes on the same chromosomes

Major mechanisms of reshuffling of synteny are: inversions and transpositions Noise on synteny is caused by: insertions, duplications, and deletions Blocks of synteny: long stretches of DNA where the relative ordering of orthologous genes is conserved Synteny allows for annotation of non-coding sequences and identification of homologous intergenetic regions Synteny Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Visualizing Synteny Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ Cat on Human

ACTTTTTGGGAC inversiontransposition GGG duplication GGG ACTTTTTGGTATATACATGTAGTACAAATAATCGAACCCCCGGACGGG TATATACATGTAGTACAAATAATCGAACCCCCG deletion TATATA Problem: Evolution of Genes and Chromosomes Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ Idea: Align the entire genomes of two closely related organisms (billions of bps) AAATAATCGAACCCCCGACTTTTTGGTATATACATGTAGTACGACGGGCATGTAGTACAAATAATAACCCCCGACTTTTTGGTATATACGGACGGG Splitting into new chromosomes Reshuffling of genes over the chromosomes

More Problems: Sequencing & Gene Detection Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ Missed Intron Missed ORF beginning or ending Missed ORF?

59 Two genomes can be formed by many smaller syntenic blocks rearranged by inversions or transpositions Can we define a metric for the syntenic distance of these two genomes? We are not interested in nucleotide differences but in the number of genomic rearrangements that separate the two species. METRIC = smallest number of operations (=inversion or transposition) that transform one genome into the other A metric for the syntenic distance Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Comparative Genomics Tools BLAST (compare one string to a genome) MUMmer (pairwise alignment) – “MUMmer 3 is the latest version, and is downloaded roughly 300 times every month. That's over 7,000 users in the 2 years after its release. In addition, the three versions of MUMmer have a combined citation count of over 700 papers.” (MUMmer homepage) PipMaker (reference and multiple samples) AVID/VISTA Mauve (multiple alignment) Comparisons and analyses at both – Nucleic acid and protein level Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

61 MUMmer (Maximal Unique Match) MUMmer, a system to align and compare very large DNA & Protein sequences in linear time Fast pair-wise comparison of draft or complete genomes using nucleotide or 6- frame translated sequences History – MUMmer 1.0, 1999 – MUMmer 2.1, 2002 – MUMmer 3.0, 2004 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Overview: MUMs String A: …CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSC TGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAG TCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTC GAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATCA TCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTA GTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGATT AGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGAT CGATCTAATCTATCTCTAAATGGAGCGA… String B: …GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCATC TATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGA GTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGTC GCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCT AGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGATC TATCTCGGCTTCGATCGTA… How align without Needleman/Wunsch ?

Overview: MUMs String A: …CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSC TGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAG TCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTC GAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATCA TCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTA GTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGATT AGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGAT CGATCTAATCTATCTCTAAATGGAGCGA… String B: …GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCAT CTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGG AGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGT CGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAG CTAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGA TCTATCTCGGCTTCGATCGTA… Easier with large exact matches highlighted?

Overview: MUMs Any optimal global alignment will probably use these two subsequences as “anchors” This is the shortcut needed to calculate global alignment quickly on large sequences Very intuitive in alignment process, but only a heuristic

Features of MUMmer The algorithm assumes that sequences are closely related Outputs: – Base to base alignment – Highlights the exact matches and differences in the genomes – Locates SNPs Large inserts Significant repeats Tandem repeats and reversals Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

66 MUMmer 1.0 Inputs: Two DNA sequences and length of the shortest MUM (Maximal Unique Match) Output: A base-to-base alignment of input sequences, highlights SNPs, large inserts, significant repeats, transpositions and reversals Techniques: Suffix trees, LIS (Longest Increasing Subsequences), Smith-Waterman alignment algorithm Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

MUMmer: Algorithm Read two genomes Perform Maximum Unique Match (MUM) of genomes using suffix tree Sort and order the MUMs using LIS Close the gaps in the Alignment Using SNPs, mutation regions, repeats, tandem repeats Output alignment MUMs regions that do not match exactly Tim Conrad, VL Algorithmische Bioinformatik, WS2015/

Tim Conrad AG Medical Bioinformatics Mehr Informationen im Internet unter medicalbioinformatics.de/teaching Vielen Dank! Weitere Fragen