Genome sequence assembly

Slides:



Advertisements
Similar presentations
Genome Assembly: a brief introduction
Advertisements

WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
CISC667, F05, Lec4, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Whole genome sequencing Mapping & Assembly.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
Assembly.
DNA Sequencing and Assembly
CSE182-L12 Gene Finding.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Genome sequencing Haixu Tang School of Informatics.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
Human Genome.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
CISC667, S07, Lec4, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Whole genome sequencing Mapping & Assembly.
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
DNA Sequencing Project
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
CSCI 1810 Computational Molecular Biology 2018
AMOS Assembly Validation and Visualization
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Presentation transcript:

Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

Building a library Break DNA into random fragments (8-10x coverage) Actual situation

Building a library Break DNA into random fragments (8-10x coverage) Sequence the ends of the fragments Amplify the fragments in a vector Sequence 800-1000 (500-700) bases at each end of the fragment

Assembling the fragments NOte that contig orientation/order is not determined

Forward-reverse constraints The sequenced ends are facing towards each other The distance between the two fragments is known (within certain experimental error) Insert F R F R I II R I F II Clone F II R I

Building Scaffolds Break DNA into random fragments (8-10x coverage) Sequence the ends of the fragments Assemble the sequenced ends Build scaffolds We need to determine the relative order/orientation of contigs Using forward-reverse constraints helps

Assembly gaps Physical gaps Sequencing gaps sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap Sequencing gap is "easy" Physical gap resolution takes more than 1/2 of closure effort Multiplex PCR

Unifying view of assembly Scaffolding

Shotgun sequencing statistics

Typical contig coverage Imagine raindrops on a sidewalk

Lander-Waterman statistics L = read length T = minimum detectable overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ) contig = island with 2 or more reads

Example c N #islands #contigs bases not in any read Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40 c N #islands #contigs bases not in any read bases not in contigs 1 1,667 655 614 698 367,806 3 5,000 304 250 121 49,787 5 8,334 78 57 20 6,735 8 13,334 7 335

Experimental data X coverage # ctgs % > 2X avg ctg size (L-W) max ctg size # ORFs 1 284 54 1,234 (1,138) 3,337 526 3 597 67 1,794 (4,429) 9,589 1,092 5 548 79 2,495 (21,791) 17,977 1,398 8 495 85 3,294 (302,545) 64,307 1,762 complete 100 1.26 M 1,329 Caveat: numbers based on artificially chopping up the genome of Wolbachia pipientis dMel

Read coverage vs. Clone coverage 4 kbp 1 kbp Read coverage = 8X Clone (insert) coverage = 16 2X coverage in BAC-ends implies 100x coverage by BACs (1 BAC clone = approx. 100kbp)

Assembly paradigms Overlap-layout-consensus greedy (TIGR Assembler, phrap, CAP3...) graph-based (Celera Assembler, Arachne) Eulerian path (especially useful for short read sequencing)

TIGR Assembler/phrap Greedy Build a rough map of fragment overlaps Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done

Overlap-layout-consensus Main entity: read Relationship between reads: overlap 1 4 7 2 5 8 3 6 9 1 2 3 4 5 6 7 8 9 ACCTGA AGCTGA ACCAGA 1 1 2 3 1 2 3 2 3 2 3 1 3 1 1 3 2 2

Paths through graphs and assembly Hamiltonian circuit: visit each node (city) exactly once, returning to the start Genome

Implementation details

Overlap between two sequences overlap (19 bases) overhang (6 bases) …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… overhang % identity = 18/19 % = 94.7% overlap - region of similarity between regions overhang - un-aligned ends of the sequences The assembler screens merges based on: length of overlap % identity in overlap region maximum overhang size. when a pair of sequences is considered,the two sequences are merged only if they match the criteria

All pairs alignment Needed by the assembler Try all pairs – must consider ~ n2 pairs Smarter solution: only n x coverage (e.g. 8) pairs are possible Build a table of k-mers contained in sequences (single pass through the genome) Generate the pairs from k-mer table (single pass through k-mer table) k-mer

REPEATS

RptA RptB 3 6 9 12 2 5 8 11 1 4 7 10 13 6 4 8 10 2 12 1 13 3 11 5 7 9

Non-repetitive overlap graph 1 2 3 4 5,9 7 8 6,10 11 12 13

Handling repeats Repeat detection Repeat resolution pre-assembly: find fragments that belong to repeats statistically (most existing assemblers) repeat database (RepeatMasker) during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential mis-assemblies. Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) Repeat resolution find DNA fragments belonging to the repeat determine correct tiling across the repeat

Statistical repeat detection Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed - leads to false negatives

Mis-assembled repeats excision collapsed tandem rearrangement

An assembly puzzle: contradictory data (discovered after publication) ribosomal RNA repeats, Ames Porton strain “chimeras” “mates” How do you align the green pieces?

Puzzle solution Reference: Ames ‘ancestor’ strain Ames Porton Down strain Tandem duplication

Anthrax attack strain history 1981 “Ames” isolate Ft Detrick (Ames ancestor) 1982 … Lab B Lab C Lab D Porton Down ? ? ? Plasmids cured ? Attack Strain Porton Strain UC Berkeley Victim 2001 1998 2001 Florida isolate Porton 1 Porton 2

Probability of base-calling error GBX(g) b. anthracis SNP CS-1 Cut of assembly 67452 from GBX0130.contig (11 bases) Cut at ungapped consensus offset 15986 (from 1), +/- 5 positions: Cut at gapped consensus offset 16009 (from 1) 11 positions Ungapped consensus TGAATGCACAC Gapped consensus TGAATGCACAC T G A A T G C A C A C Covering reads: GBZEI27TF TGAATGCACAC 26 30 34 36 33 36 36 37 36 36 36 GBXEZ08TR TGAATGCACAC 27 30 33 35 41 37 36 23 36 36 36 GBZDA09TF TGAATGCACAC 26 18 35 31 26 20 29 19 36 36 36 Summary info: P-value (10^q) -7.9 -7.8 -10.2 -10.2 -10.0 -9.3 -10.1 -7.9 -10.8 -10.8 -10.8 Quality Class 5 3 3 3 3 3 3 3 3 3 3 Coverage depth 3 3 3 3 3 3 3 3 3 3 3 Homogeneity 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Cut of assembly 6264 from GBA0117.contig (11 bases) Cut at ungapped consensus offset 725687 (from 1), +/- 5 positions: Cut at gapped consensus offset 730468 (from 1) 11 positions Ungapped consensus TGAATACACAC Gapped consensus TGAATACACAC T G A A T A C A C A C GBIFW80TF TGAATA-ACAC 34 33 15 16 13 09 00 11 11 11 15 GBICA33TR TGAATACACAC 34 35 34 35 35 34 34 36 36 36 36 GBIFQ32TR TGAATACACAC 10 13 12 18 24 13 21 12 13 14 13 GBICH40TF TGAATACACAC 36 36 36 32 32 32 32 32 31 31 36 GBICU19TR TGAATACACAC 21 30 33 18 19 24 21 11 36 10 29 P-value (10^q) -42.8 -45.4 -45.5 -45.3 -46.7 -44.4 -46.7 -43.9 -46.8 -45.1 -42.7 Quality Class 1 1 1 1 1 1 1 1 1 1 1 Coverage depth 16 16 16 16 16 16 15 16 16 16 16 GBA(a) Probability of base-calling error Not shown