Computational Genomics: Genome assembly

Computational Genomics: Genome assembly
Andrey Kislyuk 23 January 2012 Please feel free to interrupt with questions, and let me know if i’m going too fast or something is unclear.

18 September 2018 · Computational Genomics
Key resources SEQanswers ( BioStar ( 18 September 2018 · Computational Genomics

Why do we need to assemble genomes?
DNA sequencing methods can’t sequence more than about 1000 nucleotides at a time Sanger method (1975) chain termination with labeled ddNTPs Maxam-Gilbert method (1976) cleaving agents Both require primers for DNA Polymerase, but we can’t make the primers until we know the sequence! Both limited to ~1000 nt on the gel/capillary Both require cloning and/or PCR amplification Large-scale sequencing: primer walking Slow, costly, and error-prone: not practical beyond ~10Kbp I wanted to backtrack a bit and look at the reasons genome sequencing is the way it is. We all know what shotgun sequencing means but personally I didn’t know the history and details of it until I took the class. Sequencing methods: Sanger/Maxam-Gilbert, primer walking Shotgun sequencing Diagram: basic idea behind assembly: giant jigsaw puzzle Sequencing strategies by genome size/complexity 18 September 2018 · Computational Genomics

Shotgun sequencing Whole genome shotgun sequencing (1995) Hydroshearing: prepare several libraries of random fragments approx. 2, 5, 10, 50… Kbp long Cloning: use bacterial plasmids to grow DNA – problems arise if DNA contains a gene harmful to the host bacteria Picking, amplification Sanger sequencing, capillary electrophoresis, read out fluorescent dyes with a laser – 4 different colors Result: lots of ~1000 nt Sanger reads Assemble them with pairwise sequence alignment Multiple coverage corrects errors Seems straightforward now, but many did not believe it could be done! Sequencing methods: Sanger/Maxam-Gilbert, primer walking Shotgun sequencing Diagram: basic idea behind assembly: giant jigsaw puzzle Sequencing strategies by genome size/complexity Perform shotgun assembly 18 September 2018 · Computational Genomics

How much shotgun sequencing?
So, can we really sequence the whole genome with this? (No, we can’t.) Lander and Waterman (1988): Assuming random distribution of reads and ignoring repeat resolution issues, Define: G = genome length L = length of a single read Then overall coverage is C = LN/G N = number of reads sequenced T = minimum overlap to align the reads together Coverage for any given base obeys the Poisson distribution: The number of gaps (bases with 0 coverage) is: Lander-Waterman Reads vs. singletons Predicting genome size from assembly 18 September 2018 · Computational Genomics

How much shotgun sequencing?
18 September 2018 · Computational Genomics

Pioneers of sequence assembly
J. Craig Venter’s group at TIGR (later JCVI) Created TIGR Assembler, Celera Assembler, and associated tools Jim Kent (UC Santa Cruz) Created GigAssembler Allowed the Human Genome Project to compete with Celera Philip Green’s group at the University of Washington Created Phred, Phrap and Consed tools Sequencing centers: JCVI, Sanger Institute, Whitehead/MIT, DOE JGI, Baylor HGSC, WUSTL 18 September 2018 · Computational Genomics

Why do we need to assemble genomes?
2nd Generation sequencing methods Cheaper and more processive (sequence more data), but shorter read length 454 Pyrosequencing: nt average read length Illumina: nt read length (up to 200 in development) ABI SOLiD: 50 nt read length Same idea: randomly hydrosheared library Random reads from across the genome form a big puzzle 18 September 2018 · Computational Genomics

Next Generation Sequencing Technologies
Sequencing by synthesis 2nd generation 454 Pyrosequencing Solexa/Illumina SOLiD 3rd generation Single-molecule sequencing Nanopore sequencing Method Read length Paired-end reads Coverage Sanger 1000 nt Yes, 2x500nt 5-10X 454 200 (gen 1) 500 (gen 2) Yes, 2x40nt 20-40X Solexa/ Illumina 50 nt 100X SOLiD 35 nt Yes, 2x25nt Table: seq tech comparison NGST enables sequencing outside major sequencing centers, ultimately to make sequencing a lab bench task like any other. That’s why we’re here. 18 September 2018 · Computational Genomics

“Break micro-reactors” Isolate DNA containing beads
454 Pyrosequencing A + PCR Reagents + Emulsion Oil B Mix DNA library & capture beads (limited dilution) Create “Water-in-oil” emulsion “Break micro-reactors” Isolate DNA containing beads Perform emulsion PCR 18 September 2018 · Computational Genomics

454 Pyrosequencing Load enzyme beads Load beads into PicoTiter™Plate PicoTiter™Plate Diameter = 44 μm Depth = 55 μm Well size = 75 pl Well density = 480 wells mm-2 1.6 million wells per slide 44 μm 18 September 2018 · Computational Genomics

Sequencing by synthesis
454 Pyrosequencing Sequencing by synthesis Photons generated are captured by CCD camera Reagent flow Margulies et al., 2005

Raw sequencer output Sanger: Trace (usually .ab1/.scf file format) 454: Flowgram (.sff file format) Flow Order TACG 1-mer 2-mer 3-mer 4-mer KEY (TCAG) Measures the presence or absence of each nucleotide at any given position What we get from the machine Split screen: L, sanger trace, R, flowgram; pull up demo (find some colorspace screenshot?) 18 September 2018 · Computational Genomics

Assembly algorithms Paradigms Overlap-Layout-Consensus De Bruijn graphs 18 September 2018 · Computational Genomics

Differences between an overlap graph and a de Bruijn graph
Differences between an overlap graph and a de Bruijn graph for assembly. Based on the set of 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bp are indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, are shown as dotted edges. In a de Bruin graph (C), a node is created for every k-mer in all the reads; here the k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mers overlap by k − 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here we have only considered the forward orientation of each sequence to simplify the figure. Schatz M C et al. Genome Res. 2010;20: ©2010 by Cold Spring Harbor Laboratory Press

Overlap-Layout-Consensus
Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT.. Credit: Serafim Batzoglou

Overlap: A pairwise alignment problem
Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring Credit: Serafim Batzoglou

Sort all k-mers in reads (k ~ 24) Find pairs of reads sharing a k-mer
Overlapping Reads Sort all k-mers in reads (k ~ 24) Find pairs of reads sharing a k-mer Extend to full alignment – throw away if not >95% similar TACA TAGT || TAGATTACACAGATTAC T GA TAGA | || ||||||||||||||||| TAGATTACACAGATTAC Credit: Serafim Batzoglou

Overlapping Reads and Repeats
A k-mer that appears N times initiates N2 comparisons For an Alu that appears 106 times  1012 comparisons – too much Solution: Discard all k-mers that appear more than t  Coverage, (t ~ 10) Surprisingly, there are a lot (dozens) of transposable elements in N. meningitidis, so we run into the same type of problem Credit: Serafim Batzoglou

Finding Overlapping Reads
Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA Credit: Serafim Batzoglou, Masahiro Kasahara

Finding Overlapping Reads (cont’d)
Correct errors using multiple alignment C: 20 C: 20 C: 35 C: 35 T: 30 C: 0 C: 35 C: 35 TAGATTACACAGATTACTGA C: 40 C: 40 TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA A: 15 A: 15 A: 25 A: 25 - A: 0 A: 40 A: 40 A: 25 A: 25 Score alignments Accept alignments with good scores Credit: Serafim Batzoglou

Repeats are a major challenge
Layout Repeats are a major challenge Do two aligned fragments really overlap, or are they from two copies of a repeat? Credit: Serafim Batzoglou

The k-mer uniqueness ratio
The k-mer uniqueness ratio for five well-known organisms and one single-celled human parasite. The ratio is defined here as the percentage of the genome that is covered by unique sequences of length k or longer. The horizontal axis shows the length in base pairs of the sequences. For example, ∼92.5% of the grapevine genome is contained in unique sequences of 100 bp or longer. Schatz M C et al. Genome Res. 2010;20: ©2010 by Cold Spring Harbor Laboratory Press

Merge Reads into Contigs
repeat region Merge reads up to potential repeat boundaries Credit: Serafim Batzoglou

Merge Reads into Contigs (cont’d)
repeat region Ignore non-maximal reads Merge only maximal reads into contigs Credit: Serafim Batzoglou

repeat boundary??? sequencing error b a Ignore “hanging” reads when detecting repeat boundaries Credit: Serafim Batzoglou

????? Unambiguous Insert non-maximal reads whenever unambiguous Credit: Serafim Batzoglou

Link Contigs into Supercontigs (aka scaffolds)
Normal density Too dense: Overcollapsed? (Myers et al. 2000) Inconsistent links: Overcollapsed? Credit: Serafim Batzoglou

Link Contigs into Supercontigs (cont’d)
Find all links between unique contigs Connect contigs incrementally, if  2 links Credit: Serafim Batzoglou

Fill gaps in supercontigs with paths of overcollapsed contigs Credit: Serafim Batzoglou

d ( A, B ) Contig A Contig B Define G = ( V, E ) V := contigs E := ( A, B ) such that d( A, B ) < C Reason to do so: Efficiency; full shortest paths cannot be computed Credit: Serafim Batzoglou

Contig A Contig B Define T: contigs linked to either A or B Fill gap between A and B if there is a path in G passing only from contigs in T Credit: Serafim Batzoglou

Reading errors are corrected
Consensus A consensus sequence is derived from a profile of the assembled fragments A sufficient number of reads is required to ensure a statistically significant consensus Reading errors are corrected Credit: Serafim Batzoglou

Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting Credit: Serafim Batzoglou

Mate pairs and paired-end reads
Mate pairs: Circularize and trim size-selected fragments during library preparation. Inserts can be approx. 1, 5, 10, 20 Kbp long. Paired-end reads: Sequence a short amplified fragment from both ends. Fragment length is more precise but limited to about 300 bp. 18 September 2018 · Computational Genomics

Mate pairs/Paired-end reads

Paired end reads (aka mate pairs)
18 September 2018 · Computational Genomics Credit: 454 Life Sciences

Base Calling and Trimming
Base Calling: the process of translating the raw sequencer output into The most likely nucleotide sequence Confidence scores for each position Trimming: the process of removing adapter, key, vector, and/or low quality sequence from a read .FASTQ or .SFF First step in processing: trimming and base calling 18 September 2018 · Computational Genomics

Reference-based assembly
Replaces overlap detection with alignment against a similar genome Also called mapping, mapped assembly Credit: M. Schatz 18 September 2018 · Computational Genomics

Reference-based assembly
Use a related genome to ease the layout task Much faster computationally Arranges reads with more confidence, so a better assembly is possible Allows other types of analysis: somatic mutations, organismal SNPs, structural variation, RNA-Seq, … 18 September 2018 · Computational Genomics

Assembly quality control
QC/QA Metrics: Size, number of contigs, N50 Diagnostic procedures 18 September 2018 · Computational Genomics

Genome size as predicted from the assembly

Read length, paired-end reads, coverage
Read length and paired-end reads matter. Long reads can span repeat regions Paired-end reads can reach into repeat regions and bridge gaps Combination of the two maximizes shotgun sequencing performance Coverage also matters. High coverage allows very high confidence in base calling Can do repeat resolution based on coverage fitting More likely that a read will span an ambiguous region 18 September 2018 · Computational Genomics

Scaffolding If paired end reads are available, scaffolding is already done. If not (our case)… Sequenced relatives may exist (our case) Use reference-based assembly to predict scaffolding No ready-made tools available for this Can be inaccurate Assemblers can get confused by repeats or overlaps that are too short May be able to join by hand Manual gap fill Automated gap fill (no tools exist yet) 18 September 2018 · Computational Genomics

Finishing Finishing is the process of completing the chromosome sequence. Close all gaps (usually by PCR, but large gaps in big genomes can be sent back to make BACs for resequencing) Re-sequence areas with less than 2x, 3x, 5x coverage (depending on quality standard) – same procedure as gaps Check and manually assemble unresolved repeat regions Check for mis-assembly by analyzing the overlap graph This is the most expensive and time-consuming part of sequencing. Lots of small projects omit finishing and work with draft genomes 18 September 2018 · Computational Genomics

Assemblers we used and our results
Strain Reads Bases Avg. coverage Avg. read length A_PH_M13220 197,000 47.6M 20× 240 B_OR_M10699 420,000 82M 35× 195 C_NYC_M15141 379,000 94.3M 40× 245 W135_BF_M9261 605,000 64M 27× 105 Strain Newbler Newbler + AMOS + manual gap fill Contigs Nt, max length % gapfilled A_PH_M13220 332 2.06M 68K 57 (41) 2.3M 398K 1.8% B_OR_M10699 169 2.10M 114K 40 2.18M 435K 1.1% C_NYC_M15141 176 2.05M 170K 50 (34) 2.28M 759K 2.0% W135_BF_M9261 205 2.0M 152K 27 (79) 2.21M 866K 1.6% 18 September 2018 · Computational Genomics

Homopolymer errors Specific to 454 pyrosequencing Sequencing errors usually result in frameshifts! Flow Order TACG 1-mer 2-mer 3-mer 4-mer KEY (TCAG) Measures the presence or absence of each nucleotide at any given position 18 September 2018 · Computational Genomics

Visualization tools: Hawkeye

Visualization tools: Mauve

More topics Currently relevant assemblers Newbler (demo) Velvet ALLPATHS-LG ABYSS SHRiMP Celera/WGS PHRAP Other visualization tools (Consed, MAQ, Prospector 2, ABySS-Explorer…) Microread assembly (Solexa and SOLiD) de Bruijn graph assembly paradigm 18 September 2018 · Computational Genomics

Computational Genomics: Genome assembly

Similar presentations

Presentation on theme: "Computational Genomics: Genome assembly"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Genomics: Genome assembly

Similar presentations

Presentation on theme: "Computational Genomics: Genome assembly"— Presentation transcript:

Similar presentations

About project

Feedback