VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 7 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Including slides by Serafim Batzoglou
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 Vorlesungsthemen Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases Part 2: Sequence Alignments (3) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology 2
“If you take a sequence and just run a gene prediction program on it, the programs don’t usually do very well. But if you take human and mouse sequence, and compare them against each other — looking for similar regions — you get better predictions. And the more genomes we have, the better it will get.” Eugene Myers Was Professor at University of California, Berkeley Now at MPI Dresden His BLAST paper (1990) was cited more that times (among the most highly cited papers ever) Inventor of suffix array data structure Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 3
Two (main) Paradigms of Biology Molecular Paradigm Evolution Paradigm Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 4
Comparison of 1196 orthologous genes Sequence identity between genes in human/mouse –exons: 84.6% –protein: 85.4% –introns: 35% –5’ UTRs: 67% –3’ UTRs: 69% 27 proteins were 100% identical. Makalowski et al., 1996 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 5
Full Genomes are available Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 6
Full Genomes are available Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 7
Sequencing Growth Cost of one human genome 2004: $30,000, : $100, : $10, : $2,000 (today) Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 8
100 million species Each individual has different DNA Within individual, some cells have different DNA (e.g. cancer) Sequencing Applications Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 9
Typical Bioinformatics Flow Chart 6. Gene & Protein expression data 7. Drug screening Ab initio drug design OR Drug compound screening in database of molecules 8. Genetic variability 1a. Sequencing 1b. Analysis of nucleic acid seq. 2. Analysis of protein seq. (e.g. gene finding) 3. Molecular structure prediction 4. Molecular interaction 5. Metabolic and regulatory networks Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
So, if a new species/individual is available there are two tasks: Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ Generate genomics data – Sequencing and Assembly 2.Compare to available data (genomes) – Improve gene finder – Evolution of organisms? – Why do certain bacteria cause diseases while others do not? – Identification and prioritization of drug targets 11
Genome Assembly Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Some Terminology read a long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig Steps to Assemble a Genome..ACGATTACAATAGGTT.. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Shotgun Sequencing cut many times at random (Shotgun) genomic segment Get one or two reads from each segment Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Reconstructing the Sequence (Fragment Assembly) Cover region with high redundancy Overlap & extend reads to reconstruct the original genomic region reads Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Definition of Coverage Length of genomic segment:G Number of reads: N Length of each read:L Definition: Coverage C = N * L / G How much coverage is enough? Lander-Waterman model:Prob[ not covered bp ] = e -C Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides C Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Repeats Bacterial genomes:5% Mammals:50% Repeat types: Low-Complexity DNA(e.g. ATATATATACATA…) Microsatellite repeats (a 1 …a k ) N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Transposons – SINE (Short Interspersed Nuclear Elements) e.g., ALU: ~300-long, 10 6 copies – LINE (Long Interspersed Nuclear Elements) ~4000-long, 200,000 copies – LTR retroposons (Long Terminal Repeats (~700 bp) at each end) cousins of HIV Gene Families genes duplicate & then diverge (paralogs) Recent duplications ~100,000-long, very similar copies Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides 50% of human DNA is composed of repeats Error! Glued together two distant regions Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
What can we do about repeats? Two main approaches: Cluster the reads Link the reads Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
What can we do about repeats? Two main approaches: Cluster the reads Link the reads Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
What can we do about repeats? Two main approaches: Cluster the reads Link the reads Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides C R D ARB, CRD or ARD, CRB ? ARB Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Strategies for whole-genome sequencing 1.Hierarchical – Clone-by-clone i.Break genome into many long pieces ii.Map each long piece onto the genome iii.Sequence each piece with shotgun Example: Yeast, Worm, Human, Rat 2.Online version of (1) – Walking i.Break genome into many long pieces ii.Start sequencing each piece with shotgun iii.Construct map as you go Example: Rice genome 3.Whole genome shotgun One large shotgun pass on the whole genome Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Dog Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Whole Genome Shotgun Sequencing cut many times at random genome forward-reverse paired reads known dist Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Fragment Assembly (in whole-genome shotgun sequencing) Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT.. 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
K-mer based overlap Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
Sorting k-mers Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
1. Find Overlapping Reads aaactgcagtacggatct aaactgcag aactgcagt … gtacggatct tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … actgcagta ctgcagtac gtacggatctactacaca gtacggatc tacggatct … ctactacac tactacaca (read, pos., word, orient.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient., pos.) aaactgcag aactgcagt acggatcta actgcagta cccaaactg cggatctac ctactacac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc tacggatct tactacaca Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
1. Find Overlapping Reads Find pairs of reads sharing a k-mer, k ~ 24 Extend to full alignment – throw away if not >98% similar TAGATTACACAGATTAC ||||||||||||||||| T GA TAGA | || TACA TAGT || Caveat: repeats A k-mer that occurs N times, causes O(N 2 ) read/read comparisons ALU k-mers could cause up to 1,000,000 2 comparisons Solution: Discard all k-mers that occur “ too often ” Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
1. Find Overlapping Reads Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A replace T with C correlated errors— probably caused by repeats disentangle overlaps TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA In practice, error correction removes up to 98% of the errors Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
2. Merge Reads into Contigs Overlap graph: – Nodes: reads r 1 …..r n – Edges: overlaps (r i, r j, shift, orientation, score) Note: of course, we don’t know the “color” of these nodes Reads that come from two regions of the genome (blue and red) that contain the same repeat Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
2. Merge Reads into Contigs We want to merge reads up to potential repeat boundaries repeat region Unique Contig Overcollapsed Contig Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
2. Merge Reads into Contigs Remove transitively inferable overlaps – If read r overlaps to the right reads r 1, r 2, and r 1 overlaps r 2, then (r, r 2 ) can be inferred by (r, r 1 ) and (r 1, r 2 ) rr1r1 r2r2 r3r3 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
2. Merge Reads into Contigs Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Overlap graph after forming contigs Unitigs: Gene Myers, 95 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Repeats, errors, and contig lengths Repeats shorter than read length are easily resolved – Read that spans across a repeat disambiguates order of flanking regions Repeats with more base pair diffs than sequencing error rate are OK – We throw overlaps between two reads in different copies of the repeat To make the genome appear less repetitive, try to: – Increase read length – Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate decreases effective repeat content increases contig length Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
3. Link Contigs into Supercontigs Too dense Overcollapsed Inconsistent links Overcollapsed? Normal density Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Find all links between unique contigs 3. Link Contigs into Supercontigs Connect contigs incrementally, if 2 forward-reverse links supercontig (aka scaffold) Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Fill gaps in supercontigs with paths of repeat contigs Complex algorithmic step Exponential number of paths Forward-reverse links 3. Link Contigs into Supercontigs Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
De Brujin Graph formulation Given sequence x 1 …x N, k-mer length k, Graph of 4 k vertices, Edges between words with (k-1)-long overlap Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
4. Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting (Alternative: take maximum-quality letter) Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Genome Comparison Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Structure of tryptophan operon Numbers: Gene number Arrows: Direction of transcription //: Dispersion of operon by 50 genes Domain fusion trpD and trpG trpF and trpC trpB and trpA genetically linked separate genes Dandekar et al., 1998
Important observations with regard to gene order Order is highly conserved in closely related species but gets changed by rearrangements With more evolutionary distance, no correspondence between the gene order of orthologous genes Group of genes having similar biochemical function tend to remain localized – Genes required for synthesis of tryptophan (trp genes) in E. coli and other prokaryotes Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
CG: Unit of comparison Number Content (sequence) Location (map position) Gene Order Gene Cluster (genes that are part of a known metabolic pathway, are found to exist as a group) Colinearity of gene order is referred as synteny A conserved group of genes in the same order in two genomes is a syntenic groups Translocation: movement of genomic part from one position to another Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ Unit of comparison: Gene/Genome
Synteny Refers to regions of two genomes that show considerable similarity in terms of – sequence and – conservation of the order of genes Likely to be related by common descent Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ syn- = together, tenia = ribbon, band synteny : the relative ordering of genes on the same chromosomes
Major mechanisms of reshuffling of synteny are: inversions and transpositions Noise on synteny is caused by: insertions, duplications, and deletions Blocks of synteny: long stretches of DNA where the relative ordering of orthologous genes is conserved Synteny allows for annotation of non-coding sequences and identification of homologous intergenetic regions Synteny Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Visualizing Synteny Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ Cat on Human
ACTTTTTGGGAC inversiontransposition GGG duplication GGG ACTTTTTGGTATATACATGTAGTACAAATAATCGAACCCCCGGACGGG TATATACATGTAGTACAAATAATCGAACCCCCG deletion TATATA Problem: Evolution of Genes and Chromosomes Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ Idea: Align the entire genomes of two closely related organisms (billions of bps) AAATAATCGAACCCCCGACTTTTTGGTATATACATGTAGTACGACGGGCATGTAGTACAAATAATAACCCCCGACTTTTTGGTATATACGGACGGG Splitting into new chromosomes Reshuffling of genes over the chromosomes
More Problems: Sequencing & Gene Detection Tim Conrad, VL Algorithmische Bioinformatik, WS2015/ Missed Intron Missed ORF beginning or ending Missed ORF?
59 Two genomes can be formed by many smaller syntenic blocks rearranged by inversions or transpositions Can we define a metric for the syntenic distance of these two genomes? We are not interested in nucleotide differences but in the number of genomic rearrangements that separate the two species. METRIC = smallest number of operations (=inversion or transposition) that transform one genome into the other A metric for the syntenic distance Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Comparative Genomics Tools BLAST (compare one string to a genome) MUMmer (pairwise alignment) – “MUMmer 3 is the latest version, and is downloaded roughly 300 times every month. That's over 7,000 users in the 2 years after its release. In addition, the three versions of MUMmer have a combined citation count of over 700 papers.” (MUMmer homepage) PipMaker (reference and multiple samples) AVID/VISTA Mauve (multiple alignment) Comparisons and analyses at both – Nucleic acid and protein level Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
61 MUMmer (Maximal Unique Match) MUMmer, a system to align and compare very large DNA & Protein sequences in linear time Fast pair-wise comparison of draft or complete genomes using nucleotide or 6- frame translated sequences History – MUMmer 1.0, 1999 – MUMmer 2.1, 2002 – MUMmer 3.0, 2004 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Overview: MUMs String A: …CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSC TGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAG TCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTC GAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATCA TCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTA GTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGATT AGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGAT CGATCTAATCTATCTCTAAATGGAGCGA… String B: …GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCATC TATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGA GTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGTC GCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCT AGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGATC TATCTCGGCTTCGATCGTA… How align without Needleman/Wunsch ?
Overview: MUMs String A: …CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSC TGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAG TCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTC GAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATCA TCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTA GTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGATT AGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGAT CGATCTAATCTATCTCTAAATGGAGCGA… String B: …GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCAT CTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGG AGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGT CGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAG CTAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGA TCTATCTCGGCTTCGATCGTA… Easier with large exact matches highlighted?
Overview: MUMs Any optimal global alignment will probably use these two subsequences as “anchors” This is the shortcut needed to calculate global alignment quickly on large sequences Very intuitive in alignment process, but only a heuristic
Features of MUMmer The algorithm assumes that sequences are closely related Outputs: – Base to base alignment – Highlights the exact matches and differences in the genomes – Locates SNPs Large inserts Significant repeats Tandem repeats and reversals Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
66 MUMmer 1.0 Inputs: Two DNA sequences and length of the shortest MUM (Maximal Unique Match) Output: A base-to-base alignment of input sequences, highlights SNPs, large inserts, significant repeats, transpositions and reversals Techniques: Suffix trees, LIS (Longest Increasing Subsequences), Smith-Waterman alignment algorithm Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
MUMmer: Algorithm Read two genomes Perform Maximum Unique Match (MUM) of genomes using suffix tree Sort and order the MUMs using LIS Close the gaps in the Alignment Using SNPs, mutation regions, repeats, tandem repeats Output alignment MUMs regions that do not match exactly Tim Conrad, VL Algorithmische Bioinformatik, WS2015/
Tim Conrad AG Medical Bioinformatics Mehr Informationen im Internet unter medicalbioinformatics.de/teaching Vielen Dank! Weitere Fragen