1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

Slides:



Advertisements
Similar presentations
Assembling Algorithms and Techniques Upmanyu Misra Computational Issues in Molecular Biology CSE
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Genome Assembly: a brief introduction
Lecture 14 Genome sequencing projects
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Sequencing and Sequence Alignment
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Genome sequence assembly
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Fragment Assembly CIS 667 Spring 2004 February 18.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
Sequence comparison: Local alignment
Sequencing a genome and Basic Sequence Alignment
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Physical Mapping of DNA Shanna Terry March 2, 2004.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
O PTICAL M APPING AS A M ETHOD OF W HOLE G ENOME A NALYSIS M AY 4, 2009 C OURSE : 22M:151 P RESENTED BY : A USTIN J. R AMME.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Sequencing a genome and Basic Sequence Alignment
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Fragment Assembly of DNA BIO/CS 471 – Algorithms for Bioinformatics.
Human Genome.
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
Lesson: Sequence processing
DNA Sequencing Project
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Sequence comparison: Local alignment
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
Sequence Alignment 11/24/2018.
CSE 589 Applied Algorithms Spring 1999
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004

2 Introduction Q: What is Sequence A: To sequence a DNA molecule is to obtain the string of bases that it contains. Also know as read Q: How to sequence A: Recall the Sanger Sequencing technology mentioned in Chapter 1

3 Introduction Cut DNA at each base:A,C,G,T Fragment ’ s migrate distance is inversely proportional to their size Sanger Sequencing TCGCGATAGCTGTGCTA Run gel and read off sequence

4 Introduction Limitation The size of DNA fragments that can be read in this way is about 700 bps Problem Most genomes are enormous (e.g 10 8 base pair in case of human).So it is impossible to be sequenced directly! This is called Large-Scale Sequencing

5 Introduction Solution  Break the DNA into small fragments randomly  Sequence the readable fragment directly  Assemble the fragment together to reconstruct the original DNA  Scaffolder gaps Solving a one-dimensional jigsaw puzzle with millions of pieces(without the box) !

6 1. Break 2. Sequence 3. Assemble 4. Scaffolder 5. Conclusion

7 Break  DNA can be cutten into pieces through mechanical means

8 Issues in Break  How? Coverage The whole fragments provide an 8X oversampling of the genome Random Libraries with pieces sizes of 2,4,6,10, 12 and 40 k bp were produced Clone Obtaining several copies of the original genome and fragments

9 1. Break 2. Sequence 3. Assemble 4. Scaffolder 5. Conclusion

10 Sequence clone Directed sequencing (GEL) GTCCAGCCT Q: can we read the fragment from both end?

11 1. Break 2. Sequence 3. Assemble 4. Scaffolder 5. Conclusion

12 3. Assemble A Simple Example ACCGT CGTGC TTAC Overlap: The suffix of a fragment is same as the prefix of another. Assemble: align multiple fragments into single continuous sequence based on fragment overlap --ACCGT ----CGTGC TTAC TTACCGTGC

13 3. Assemble fragments assemble target original contig1contig2 gap

14 A simple model The simplest, naive approximation of DNA assemble corresponds to Shortest Superstring Problem(SCS): Given a set of string s1,..., sn, find the shortest string s such that each si appears as a substring of s. --ACCGT ----CGTGC TTAC TTACCGTGC

15 (1) Overlap step Create an overlap graph in which every node is a fragment and edges indicate an overlap (2) Layout step Determine which overlaps will be used in the final assembly, find an optimal spanning forest on the overlap graph

16 Overlap step Finding overlap Compare each fragment with other fragments to find whether there ’ s overlap on its end part and another ’ s beginning part. We call ‘ a overlap b ’ when a ’ s suffix equal to b ’ s prefix

17 Overlap step Overlap graph Directed, weighted graph G(V,E,w) V: set of fragments E : set of directed edge indicates the overlap between two fragments. An edge means an overlap between a and b with weight w. this equal to suffix(a,w)=prefix(b,w)

18 Example w z x u s y W=AGTATTGGCAATC Z=AATCGATG U=ATGCAAACCT X=CCTTTTGG Y=TTGGCAATCA S=AATCAGG

19 Layout step Looking for shortest common superstring is the same as looking for path of maxium weight Using greedy algorithm to select a edge with the best weight at every step. The selected edge is checked by Rule. If this check is accepted, the edge is accepted, otherwise omit this edge Rule: for either node on this edge, indegree and outdegree <=1; Acyclic

20 At last the fragments merged together, from the point of graph, it is a forest of hamitonian paths(a path through the graph that contains each node at most once)., each path correspond to a contig

21 Example w z x u s y W=AGTATTGGCAATC Z=AATCGATG U=ATGCAAACCT X=CCTTTTGG Y=TTGGCAATCA S=AATCAGG W->Y->S AGTATTGGCAATC TTGGCAATCA AATCAGG AGTATTGGCAATCAGG Z->U->X AATCGATG ATGCAAACCT CCTTTTGG AATCGATGCAAACCT TTTGG

22 Geedy Algorithm is neither optimal nor complete, and will introduce gap GCC ATGC TGCAT Can ’ t correctly model the assembly problem due to complication in the real problem instance

23 Complication with Assemble  Sequencing errors. Most sequencers have around 1% error in the best case.  Unknown orientation. Could have sequenced either strand.  Bias in the reads. Not all regions of the sequence will be covered equally.  Repeats. There is much repetitive sequence, especially in human and higher plants

24 Sequenceing Errors Fragments contains3 kinds of errors: insert, deletion, substitution Possibility :Substitutions ( 0.5-2% ), insert and deletion occur roughly 10 times less frequently

25 x:ACCGT Y:CGTGC Z:TTAC U:TACCGT Problems with the simple model - Errors A G 3 x y uz x y uz --ACCGT ----CGTGC TTAC -TACCGT TTACCGTGC

26 Problems with the simple model - Errors Solution Allow for bounded number of mismatches between overlapping fragments Approximate overlaps Criterion: minimum overlap length(40 bps), error rate(less than 6% mismatches ) How? Using semi-global alignment to find the best match between the suffix of one sequence and the prefix of another.

27 semi-global alignment Score system: 1 for matches, -1 for mismatches, -2 for gaps Initializing the first row and first column of zero, ignore gap in both extremities Algorithm is same as global comparision Search last column for higest score and obtain alignment by tracing back to start point ( overlap of x over y). overlap of y over x corresponds to the max in the last row …… y x

CGATGCCGATGC A C C G T Overlap: y->x CGATGC ACCGT Y: X: Overlap:x->y ACCG-T— --CGATGC

29 x:ACCGT Y:CGTGC Z:TTAC U:TACCGT Problems with the simple model - Errors A G 3 x y uz ACCGT ----CGTGC TTAC -TACCGT TTACCGTGC --ACCG-T ----CGATGC TT-C -TAGCGT TTACCGTGC Criterion 1.Score>-3 2. Mismatch<2 2 x y uz

30 Problems with the simple model - Unkown orientation Unknowns Orientation: Fragments can be read from both of the DNA strands. Solution Try all possible combination CACGT ACGT ACTACG GTACT CACGT ACGT CGTAGT AGTAC CACGT -ACGT --CGTAGT -----AGTAC CACGTAGTACTGA z x y Z’ X’ Y’

31 Problems with the simple model - Repeat Repeats can be characterized by length, copy number & fidelity between copies – Human T-cell receptor: 5x of a 4kb gene w/ ~3% variation – ALUs. ~300bp w/5-15% variation, clustering to be 50-60% of many human sequence regions – microsatellites, 3-6bp with thousands of repeats in centromeric and telemeric regions, 1-2% variation. gepard.bioinformatik.uni-saarland.de/html/BioinformatikIIIWS0304-Dateien/ V3-Assembly.ppt

32 Problems with the simple model - Repeat2 Original One DX1BCX3X2A X1 3X A C B D Fragment X3 X2 X1 A C B D Assembler Rearrangment DX2BCX1X3A Consensus

33 Problems with the simple model - Repeat3 X1BCX2A Original one A X2 X1 BC Assembly XAC B Target one gap Contig 1Contig2 Overcollapsing ! Shortest string is not always the best!

34 Problems with the simple model -Lack of coverage Lack of coverage Not all regions of the sequence will be covered equally Target DNA Uncovered area Solution Do more sampling to increase the coverage level Using scaffolder technology

35 1. Break 2. Sequence 3. Assemble 4. Scaffolder 5. Conclusion

36 4. Scaffolder  Scaffold Given a set of non-overlapping contigs, order and orient them to reconstruct the original DNA  How? Is there any relationsip can be built between different contigs? XBCA X B’ C’ A

37 4. Scaffolder -Mate Pairs Mate pairs:  The sequenced ends are facing towards each other  The distance between the two fragments is known( insert size – fragment size)  The mate pairs is extremly valuable during the scaffold step. Mate Pair

38 4. Scaffolder -Method A scaffold retrieve the original mate pairs spanning in different contigs Using the link information of the pairs( Distance, Orientation) to orients contigs and estimates the gap size, this is calles “walk”

39 4 Scaffolder -Example Contig 1Contig 2 gap

40 4 Scaffolder Graph Representation  Nodes: contigs  Directed edges: constraints on relative placement of contigs – relative order and relative orientation

41 1. Break 2. Sequence 3. Assemble 4. Scaffolder 5. Conclusion

42 5. Conclusion  The whole genome sequencing process Break-> Sequence -> Assemble-> Scaffolder  A Simple Model Using overlap graph to construct the shortest common string However, it can ’ t corrctly model the assembly problem

43 Conclusion- Repeat Repeat detection – pre-assembly: find fragments that belong to repeats statistically (most existing assemblers) repeat database (RepeatMasker) – during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) – post-assembly: find repetitive regions and potential mis- assemblies. (Reputer, RepeatMasker) Repeat resolution – find DNA fragments belonging to the repeat – determine correct tiling across the repeat