Genomic sequencing and its data analysis Dong Xu Digital Biology Laboratory Computer Science Department Christopher S. Life Sciences Center University.

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Genome Assembly: a brief introduction
Next Generation Sequencing, Assembly, and Alignment Methods
Lecture 14 Genome sequencing projects
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Finding approximate palindromes in genomic sequences.
CISC667, F05, Lec4, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Whole genome sequencing Mapping & Assembly.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
Mouse Genome Sequencing
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
The iPlant Collaborative
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
Structural genomics includes the genetic mapping, physical mapping and sequencing of entire genomes.
Virginia Commonwealth University
Lesson: Sequence processing
DNA Sequencing Project
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
How to Build a Horse: Final Report
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

Genomic sequencing and its data analysis Dong Xu Digital Biology Laboratory Computer Science Department Christopher S. Life Sciences Center University of Missouri, Columbia

Lecture Outline l Introduction to sequencing l Next-generation sequencers l Role of bioinformatics in sequencing l Theory of sequence assembly l Celera assembler l Assembly of short reads

What is DNA Sequencing? l A DNA sequence is the order of the bases on one strand. l By convention, we order the DNA sequence from 5’ to 3’, from left to right. l Often, only one strand of the DNA sequence is written, but usually both strands have been sequenced as a check.

Sequencing l Bacteria l Fungi, yeast l Insects: mosquito, fruit fly, moth, honey bee l Plants: Arabidopsis, rice, corn, grapevine, … l Animals: mouse, hedgehog, armadillo, cat, dog, horse, cow, elephant, platypus, … l Humans

Importance of Sequencing l Basic blueprint for life l Foundation of genomic studies l Vision: personalized medicine å Genetic disorders å Diagnostics å Therapies l $1000 genome

Lecture Outline l Introduction to sequencing l Next-generation sequencers l Role of bioinformatics in sequencing l Theory of sequence assembly l Celera assembler l Assembly of short reads

New Sequencers Illumina / Solexa Genetic Analyzer Applied Biosystems ABI 3730XL Roche / 454 Genome Sequencer FLX Applied Biosystems SOLiD

Illumina (Solexa) Workflow

Pair-end Reads l Paired-end sequencing (Mate pairs) å Sequence two ends of a fragment of known size. å Currently fragment length (insert size) can range from 200 bps – 10,000 bps

Accelerating Technology & Plummeting Cost Next Generation Sequencing

Lecture Outline l Introduction to sequencing l Next-generation sequencers l Role of bioinformatics in sequencing l Theory of sequence assembly l Celera assembler l Assembly of short reads

Analysis tasks l Initial analysis: base calling l Mapping to a reference genome l De novo or assisted genome assembly l SNP, detection/insertion, copy number l Transcriptome profiling l DNA methylation studies l CHIP-Seq

Initial Data Analysis workflow Images (.tif) Analysis Pipeline Image Analysis Base Calling Sequence Analysis For each tile: -Cluster intensities -Cluster noise For each tile: -Cluster sequence -Cluster probabilities -Corrected cluster intensities For all data: -Quality filtering -Sequence Alignment -Statistics Visualization Instrument PCAnalysis PC

Short read mapping l Input: å A reference genome å A collection of many bp tags å User-specified parameters l Output: å One or more genomic coordinates for each tag l In practice, only 70-75% of tags successfully map to the reference genome.

Multiple mapping l A single tag may occur more than once in the reference genome. l The user may choose to ignore tags that appear more than n times. l As n gets large, you get more data, but also more noise in the data.

Inexact matching l An observed tag may not exactly match any position in the reference genome. l Sometimes, the tag almost matches l Such mismatches may represent a SNP or a bad read-out. l The user can specify the maximum number of mismatches, or a quality score threshold. l As the number of allowed mismatches goes up, the number of mapped tags increases, but so does the number of incorrectly mapped tags. ?

Short-read analysis software

Lecture Outline l Introduction to sequencing l Next-generation sequencers l Role of bioinformatics in sequencing l Theory of sequence assembly l Celera assembler l Assembly of short reads

Library Creation Sequencing Assembly Gap Closure Finishing Annotation Sequencing Procedure

Genome Sequence Analysis - Step One Assemble Sequences into Contigs Sequenced fragmented DNA AAACGCGATCGATCGATCGA AAACGCGATCGATCGATCGATCGATCGATCGAT CGTAG CGATCGATCGATCGATCGTAG AAACGCGATCGATCGATCGA Assembled DNA Sequence CONTIG 1CONTIG 2CONTIG 3

Repeat Problems Repeats at read ends can be assembled in multiple ways. correct incorrect

Genome Sequence Analysis - Step One Initial Problem with Assembly Sequenced fragmented DNA Incorrectly Assembled DNA Sequence CONTIG 1 CONTIG 2

Genome Sequence Analysis - Step One Need to Mask Repeats Sequenced fragmented DNA Masked DNA Sequence CONTIG 1 CONTIG 3 CONTIG 5 CONTIG 2 CONTIG 4 Assembled DNA Sequence

Lander-Waterman Model Poisson Estimate Number of reads Average length of a read Probability of base read Lander ES, Waterman MS (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis“ Genomics 2 (3):

Lander–Waterman Assumptions 1.Sequencing reads will be randomly distributed in the genome 2. The ability to detect an overlap between two truly overlapping reads does not vary from clone to clone

In practice… Lander-Waterman is almost always an underestimate -cloning biases in shotgun libraries -repeats -GC/AT rich regions -other low complexity regions

Sequence Assembly Algorithms l Different than similarity searching l Look for ungapped overlaps at end of fragments å (method of Wilbur and Lipman, (SIAM J. Appl. Math. 44; , 1984) l High degree of identity over a short region l Want to exclude chance matches, but not be thrown off by sequencing errors

Sequence Reconstruction Algorithm In the shotgun approach to sequencing, small fragments of DNA are reassembled back into the original sequence. This is an example of the Shortest Common Superstring (SCS) problem where we are given fragments and we wish to find the shortest sequence containing all the fragments. A superstring of the set P is a single string that contains every string in P as a substring. For example: for The SCS is: GGCGCC F1 = GCGC F1 = GCGC F2 = CGCC F2 = CGCC F3 = GGCGF3 = GGCG

Greedy Algorithm for the Shortest Superstring Problem The shortest superstring problem can be examined as a Hamiltonian path and is shown to be equivalent to the Traveling Salesman problem. The shortest superstring problem is NP-complete. A greedy algorithm exists that sequentially merges fragments starting with the pair with the most overlap first. Let T be the set of all fragments and let S be an empty set. do { For the pair (s,t) in T with maximum overlap. [s=t is allowed] { If s is different from t, merge s and t. If s = t, remove s from T and add s to S. } } while ( T is not empty ); Output the concatenation of the elements of S. This greedy algorithm is of polynomial complexity and ignores the biological problems of: which direction a fragment is orientated, errors in data, insertions and deletions.

Lecture Outline l Introduction to sequencing l Next-generation sequencers l Role of bioinformatics in sequencing l Theory of sequence assembly l Celera assembler l Assembly of short reads

Celera Assembler Designed by Gene Myers, used to assemble the drosophilia, mouse and human genomes Steps:  Screener  Overlapper  Unitigger  Scaffolder & repeat resolution  Consensus

Screening reads Reads must be of very high reliability for assembly. Looking for 98%+ accuracy Vector contamination. Sequencing requires placing portions of the sequence to be determined in vectors (e.g. BACs or YACs). Need to avoid including any vector sequence Can also screen for known common repeats at this stage

Overlapper Compare every fragment to every other Criterion: at least 40bp overlap with no more than 6% mismatches Probability of a chance overlap so low that all of these are either true overlaps or part of a repeated sequence (“repeat overlap”) Key objective is to distinguish between these two possibilities as early as possible in the assembly process.

Unitigs Do the easy ones to assemble subset first. Fragments that have only one possible assembly are combined into longer sequences.  Reads which entirely match a subsegment of another  Fragment overlaps for which there are no conflicting overlaps  For Drosophila, 3.158M fragments collapse into 54,000 unitigs, going from 221M overlaps to 3.104M.

Celera Scaffolding Scaffold is a set of ordered, oriented contigs with gaps of approximately known size When the left and right reads of a mate are in different unitigs, their distance orients the unitigs and estimates the gap size. “Bundle” is a consistent set (2 or more) of mate pairs that place a pair of unitigs with respect to each other. The more mate pairs in a bundle, the higher the reliability

Scaffold picture At this point, errors are only in interiors of long repeating regions

Lecture Outline l Introduction to sequencing l Next-generation sequencers l Role of bioinformatics in sequencing l Theory of sequence assembly l Celera assembler l Assembly of short reads

Assembly for short reads l Challenging to assembly data. l Short fragment length = very small overlap therefore many false overlaps (while reads are getting longer) l Sequenced up to 100x coverage, increase in data size l Pair-end reads are helpful

Current approaches l Euler / De Bruijn approach. l More suited for short read assembly. l Implemented in Velvet, the mostly used short read assembly method at present (

De Bruijn graph method l Break each read sequence in to overlapping fragments of size k. (k-mers) l Form De Bruijn graph such that each (k- 1)-mer represents a node in the graph. l Edge exists between node a to b iff there exists a k-mer such that is prefix is a and suffix is b. l Traverse the graph in unambiguous path to form contigs.

De Bruijn graph

Summary l Is most active research area (for the next years) l Data rich; high quality (digital vs. analog) l Applicable to many studies l Promising to personalized medicine l Intensive developments for bioinformatics l Fast evolving l Assembly is challenging l Using pair-end reads is essential

Homework Read about the tools at Study Celera Assembler at assembler/index.php?title=Main_Page Study Verlet at

Homework Literature Reviews: 6.html ool=pubmed #section=777903&page=16 ool=pubmed ool=pubmed

Acknowledgments This file is for the educational purpose only. Some materials (including pictures and text) were taken from the Internet at the public domain.