Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland.

Slides:



Advertisements
Similar presentations
Advancing Science with DNA Sequence Maize Missouri 17 chromosome 10 project update Dan Rokhsar 3 October 2006.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Doug Brutlag 2011 Sequencing the Human Genome Doug Brutlag Professor Emeritus of Biochemistry.
Welcome to Introduction to Bioinformatics Wednesday, 10 February Genome Sequencing/Assembly Genome sequencing/Assembly Click anywhere to go on to the next.
Genome Assembly: a brief introduction
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Stuff to Do. Midterm I questions due 1/31 me your question (with answers), –if you have the capability, mail complete questions, figures, etc. and.
The Human Genome Race. Collins vs. Venter Collins Venter.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
A Contract Research and Services Organization. Ideas to Life! A Contract Research and Services Organization  Xcelris is a Specialty Contract Research.
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
High Throughput Sequencing
Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
Lecture 15 – Gene Cloning Based on Chapter 08 - Genomics: The Mapping and Sequencing of Genomes Copyright © 2010 Pearson Education Inc.
De-novo Assembly Day 4.
From Haystacks to Needles AP Biology Fall Isolating Genes  Gene library: a collection of bacteria that house different cloned DNA fragments, one.
How to Build a Horse Megan Smedinghoff.
CS 394C March 19, 2012 Tandy Warnow.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Detection of Genomic Rearrangements in K562 cells using Paired End Sequencing Rosa Maria Alvarez Massachusetts Institute of Technology Class of 2009.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Genome Characterization DNA sequence-ULTIMATE Map DNA sequencing-methods Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Service 2006.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Initial sequencing and analysis of the human genome Averya Johnson Nick Patrick Aaron Lerner Joel Burrill Computer Science 4G October 18, 2005.
Biochemistry 412 Overview of Genomics & Proteomics 18 January 2005.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
Human Genome.
Ultra-High Throughput DNA Sequencing on the 454/Roche GS-FLX
Genomics Part 1. Human Genome Project  G oal is to identify the DNA sequence of every gene in humans Genome  all the DNA in one cell of an organism.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
Sequencing Chromosome 12. runs db (blast) SOL dbrelational db Choice of suitable seed BACs Running 96 samples For each BAC check db update db update dbcheck.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Primers to map bsd deletion points on genomic DNA NameForwardReverse Amplicon size (base pairs) A15’-CCACGGATGGAGTGAGTTCT-3’5’-GCCCCCAAGATGAGGATTAT-3’931.
Virginia Commonwealth University
Human Genome Project.
Success criteria - PCR By the end of this lesson we will be know:
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Genomics Sequencing genomes.
Genome sequence assembly
Pre-genomic era: finding your own clones
Stuff to Do.
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
How to Build a Horse: Final Report
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Sequence the 3 billion base pairs of human
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Human Genome Project Seminal achievement. Scientific milestone.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Slides by Art Delcher, Mike Schatz, and Adam Phillippy Center for Bioinformatics and Computational Biology Univ. of Maryland

DNA target sample SHEAR & SIZE (16 of these) e.g., 10Kbp ± 8% std.dev. End Reads / Mate Pairs CLONE (16 of these) & END SEQUENCE (automated) & END SEQUENCE (automated) 550bp 10,000bp Mate-Pair Shotgun DNA Sequencing

SIZE SELECT e.g., 10Kbp ± 8% std.dev. SHEAR Shotgun DNA Sequencing (Technology) DNA target sample Vector LIGATE & CLONE Primer End Reads (Mates) SEQUENCE 550bp

Whole Genome Shotgun Sequencing – Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 5’ BAC 3’ – Collect another 20X in clone coverage of 50Kbp end sequence pairs: pairs for Human. ~ 1.2million pairs for Human. – Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: reads for Human. ~ 35million reads for Human. Short Long 2Kbp 10Kbp + single highly automated process + only three library constructions – assembly is much more difficult

Physical Mapping Clone-by-Clone Genome Sequencing Target – – 2 separate processes – clone libraries unstable, maps hard to complete – sequencing libraries must be made for every clone + assembly problem ‘easy’ and well understood Minimum Tiling Set (~33,000 BACs for human) for human) Shotgun Assembly

Celera’s Sequencing Factory

 300 ABI 3700 DNA Sequencers  50 Production Staff  20,000 sq. ft. of wet lab  20,000 sq. ft. of sequencing space  800 tons of A/C (160,000 cfm)  $1 million / year for electrical service  $10 million / month for reagents Celera’s Sequencing Factory (circa 2001)

 Collected Million reads = 5.11X coverage  Million are paired (77%) = Million pairs  2Kbp5.045 M98.6% true <6% std.dev.  10Kbp4.401 M98.6% true <8% std.dev.  50Kbp1.071 M90.0% true <15% std.dev.  Validated against finished Chrom. 21 sequence  The clones cover the genome 38.7X times  Data is from 5 individuals (roughly 3X, 4 others at.5X) Human Data (April 2000)

Consensus (15- 30Kbp) Reads Contig Assembly without pairs results in contigs whose order and orientation are not known. ? Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. 2-pair Mean & Std.Dev. is known Scaffold Pairs Give Order & Orientation

ChromosomeSTS STS-mapped Scaffolds Contig Gap (mean & std. dev. Known) Read pair (mates) Consensus Reads (of several haplotypes) SNPs External “Reads” Anatomy of a WGS Assembly

WGS Sequencing WGS Assembly Performance

 Detect repeats and so avoid being misled by them, leave for the last.  Make 1st order use of mate-pairs: first to circumnavigate and later to fill in repeats.  Make all the sure moves first  tiered phases that get progressively more aggressive  output a complete audit trail of the evidence for assembly. Assembler Design Philosophy

Repeat Rez I, II Assembly Pipeline (circa 2006) Overlapper Unitiger Scaffolder Trim & Screen  Reads (typically 800bp) are quality-trimmed so that average error rate is.5% with 1-in-1000 having more than 2% error. Average trim length is bp, depending on the genome. (590bp for human in year 2000)  Contaminant and vector sequence is removed  Repeat screening makes run time and overlap graph size reasonable, e.g overlaps per Alu read must be avoided.  Now we dynamically  Now we dynamically limit repetitive overlaps in the overlap phase.  gatekeeper program to vet inputs/assign ID’s Reads stored in compressed, random-access binary store.

Repeat Rez I, II Assembly PipelineOverlapper Unitiger ScaffolderAB impliesA B TRUE ORAB REPEAT- INDUCED Find all overlaps  40bp allowing 6% mismatch. Trim & Screen

Repeat Rez I, II Assembly Pipeline Compute all “overlap consistent” sub-assemblies: Compute all “overlap consistent” sub-assemblies: Unitigs (Uniquely Assembled Contig) Overlapper Unitiger Scaffolder Trim & Screen

OVERLAP GRAPH Edge Types:AB A BA B BB BAA A Regular Dovetail Prefix Dovetail Suffix Dovetail E.G.: Edges are annotated with deltas of overlaps

The Unitig Reduction 1. Remove “Transitively Inferrable” Overlaps: AB C AB C

The Unitig Reduction 2. Collapse “Unique Connector” Overlaps: A B AB

Unitigs: Definition Chordal Subgraph with no conflicting edges. Conflicting edge quely Assemble-able Con Uniquely Assemble-able Contig

Unitig Theorem (Myers, JCB ‘95) (1) Remove contained fragments (2) Remove transitively inferred edges (3) Collapse into unitigs (*) Restore t.i. edges between unitig ends. THM: Shortest Common Superstring of unitigs = Shortest Common Superstring of reads Caveat: SCS is not the right objective for assembly.

Revised Unitigger Algorithm  Preceding algorithm is computationally expensive  Current unitigger finds the “best” overlap on each end of each read—its “best buddy”.  Unitigs are chains of mutually unique best buddies— adjacent reads are best buddies of each other and of no other read.  This takes time and space linear in the number of reads.  In rare cases results are different from graph reduction.

Branch Point Extension  A repeat boundary reflected on an underlying sequence read. D C B Genome A Peers of A C  Compare peers to detect branch pts.  Consider graph without repeat-full edges and recompute unitigs D B  Makes sure you get a read-length into each repeat induced gap (most Alu sized elements are resolved) A

Bubble Smoothing

Assembly Pipeline Identify those that cover unique DNA = Identify those that cover unique DNA = U-unitigs Definitely Unique Definitely Repetitive Don’t Know Dist. For Unique Dist. For Repetitive Repeat Rez I, II Overlapper Unitiger Scaffolder Unique Repetitive Trim & Screen

Arrival Intervals is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA. Arrival rate statistic (A-stat) is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA. Definitely Unique Definitely Repetitive Don’t Know Dist. For Unique Dist. For Repetitive Unique DNA unitig Repetitive DNA unitig Identifying Unique DNA Stretches

Repeat Rez I, II Assembly PipelineOverlapper Unitiger Scaffolder Fill repeat gaps with doubly anchored positive unitigs Fill repeat gaps with doubly anchored positive unitigs Unitig>0 Trim & Screen

Repeat Rez I, II Assembly PipelineOverlapper Unitiger Scaffolder Fill repeat gaps with assembled, singly anchored reads Fill repeat gaps with assembled, singly anchored readsStones Trim & Screen

Surrogates  Stones containing more than 1 read are added to contigs as consensus sequence only, without underlying reads.  Called “surrogates”  Allows repeat unitigs to be put in multiple positions in the assembly, but leaves regions without underlying read coverage.  We later attempt to resolve surrogates, by assigning reads from the original repeat unitig to the separate surrogate copies, based on mate pairs.