1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.

Slides:



Advertisements
Similar presentations
Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Advertisements

Genome Assembly: a brief introduction
9 Genomics and Beyond Brief Chapter Outline
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Genome sequence assembly
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Whole Genome Assembly. WGA 1. Screener 2. Overlapper 3. Unitigger, 4. Scaffolder, 5. Repeat Resolver.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Sequence comparison: Local alignment
Sequencing a genome and Basic Sequence Alignment Lecture 8 1Global Sequence.
Sequencing a genome and Basic Sequence Alignment
Bacterial Genome Finishing Using Optical Mapping Dibyendu Kumar, Fahong Yu and William Farmerie Interdisciplinary Center for Biotechnology Research, University.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Pairwise Alignment, Part I Constructing the Values and Directions Tables from 2 related DNA (or Protein) Sequences.
Todd J. Treangen, Steven L. Salzberg
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
Microsoft ® Office Excel 2007 Working with Charts.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Sequencing a genome and Basic Sequence Alignment
Chap. 7 Genome Rearrangements Introduction to Computational Molecular Biology Chapter 7.1~7.2.4.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
Human Genome.
Today Please read… Science 291: Human Genome Project Dissenters My Brush with Greatness? 1992: Two years into the HGP, two of the projects.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Drosophila Genomics Where are we now? Where are we going? Christopher Shaffer, Wilson Leung, Sarah Elgin Dept of Biology; Washington University in St.
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Lesson: Sequence processing
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Sequence comparison: Local alignment
Removing Erroneous Connections
CS 598AGB Genome Assembly Tandy Warnow.
Pairwise sequence Alignment.
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
AMOS Assembly Validation and Visualization
Presentation transcript:

1

Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2

Alignment algorithm AMOScmp uses MUMmer MUMmer will be covered in detail by Adam Phillippy in a later lecture MUMmer provides very fast alignment of closely-related sequences 3

Assembly of a close relative 4

AMOScmp algorithm Read alignment: Each shotgun read is aligned to the reference genome using MUMmer. Repetitive sequences and polymorphisms between the target and the reference cause some reads to align in a non-contiguous fashion. We used a modified version of the Longest Increasing Subsequence (LIS) algorithm in order to generate chains of mutually consistent matches between each read and the reference. 5

Repeat resolution 1.Check to see if the paired-end sequence (the “mate”) is uniquely anchored in the genome. If it is, we place the read in the location that satisfies the constraints imposed by the mate-pair information. 2.If a read and its mate are both ambiguously placed, we attempt to find whether the mate-pair information allows us to place them both in the assembly. In some cases, there exists only one placement of both a read and its mate that satisfies the mate-pair constraints on distance and orientation. 3.When the first two steps leave us with more than one placement for a pair of reads, we choose at random one of the possible placements that satisfy the mate-pair constraints. 6

Repeat resolution: example Aligned all shotgun reads from Streptococcus agalactiae 2603 to the final, finished genome 26,099 reads total 25,310 uniquely anchored in genome 314 placed with the help of a uniquely anchored mate 22 were placed as unique pairs, with neither read being unique on its own 442 had to be placed in a randomly chosen copy of a repeat 7

Read alignment: anomalies Reads don’t always align properly Certain alignment patterns are used by AMOScmp to detect differences in the new “target” genome Many of these can be resolved 8

9 Mapping reads to the reference genome when the target genome contains an insertion. The bottom indicates the true layout of the reads (A,B,C) along the target. The top indicates the alignment of the reads to the reference. Slanted lines depict portions of the read that do not match; in the case of read B, the entire read does not align to the reference.

10 The insertion in the target genome is shorter than a single read. The "bubbles" identify the portions of the two reads that do not align to the reference.

11 Insertion into the reference. The alignment of reads to the reference (top) indicates the presence of the insertion. Dashed lines indicate the “stretch” of the reads needed to align to the reference.

12 Regions II and III from the target appear in a different order in the reference. Reads A, B, and C match the reference in disjoint locations — the dashed lines connect sections of a read that are adjacent in the target genome. Signature of a genome rearrangement

13 The gray areas are divergent – they are not recognizably similar. Portions of the reads not matching the reference are shown at an angle. Signature of a divergent region

14 Effect of short flanking repeats on the alignment of a read to the reference in the case of an insertion in the reference. The repeat is shown in gray. The dashed lines connect sections of read A that occur twice in the reference but once in A and in the target genome.

15 The rows correspond (top to bottom) to: CA — scratch assembly contigs created by Celera Assembler; 2603 — AMOS-Comp contigs created using strain 2603 as a reference; NEM — AMOS-Comp contigs using strain NEM 316 as a reference; nucmer — the alignment of strain NEM 316 to strain Stacked arrows in the bottom row correspond to repeats. Assembly of 1Mb of S. agalactiae 2603

Assemblies of strain 2603 produced by AMOScmp 16

Completeness of assembly (mapped back to finished strain 2603) 17 The total gap size indicates the total number of bases missing from the assembled contigs after mapping them to the finished genome. The column marked LW represents the theoretical estimate of coverage based on Lander-Waterman [19] statistics.

Limits on comparative assembly 18

19

Fishing in the Trace Archive 2,772,509 reads (traces) for Drosophila ananassae 2,214,248 traces for D. simulans 2,445,065 traces for D. mojavensis 20

21 Discovery of fruit-fly bacterial endosymbionts in published data Wolbachia pipientis is an intra-cellular bacterial endosymbiont of fruit flies (genus Drosophila) and other insects, primarily found in the reproductive organs of females. The endosymbiont is often inadvertently sequenced as part of a fruit fly genome project. Assembly strategy Use completed sequence of Wolbachia endosymbiont of Drosophia melanogaster (wMel) to extract Wolbachia reads from Drosophila shotgun data deposited in NCB I Trace Archive.

22 Strategy 1 Identify reads matching wMel with nucmer Assemble extracted reads with Celera Assembler Strategy 2 Extract and assemble reads with comparative assembler AMOScmp

23 AMOScmp wMel Drosophila + Wolbachia Wolbachia assembly

24 wAnawSimwWil Molecule length1,440,650896,761922,146 # matching reads32,7203,7272,291 # contigs # scaffolds32984 # genes1, wAna – Wolbachia endosymbiont of D. ananassae wSim – Wolbachia endosymbiont of D. simulans wWil – Wolbachia endosymbiont of D. willistoni (NOTE: D. mojavensis turned out to be an erroneous submission; D. willistoni was discovered later)