CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Genome Assembly: a brief introduction
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
3. Lecture WS 2003/04Bioinformatics III1 Whole Genome Shotgun Assembly Two strategies for sequencing: clone-by-clone approach whole-genome shotgun approach.
Lecture 14 Genome sequencing projects
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Genome sequence assembly
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
Gene Finding (DNA signals) Genome Sequencing and assembly
Assembly.
DNA Sequencing and Assembly
DNA Sequencing.
CSE182-L12 Gene Finding.
CS273a Lecture 4, Autumn 08, Batzoglou Fragment Assembly (in whole-genome shotgun sequencing) CS273a Lecture 5.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Fa 06CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Whole Genome Assembly Microarray analysis. Mate Pairs Mate-pairs allow you to merge islands (contigs) into super-contigs.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
Human Genome Project.
DNA Sequencing Project
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
Removing Erroneous Connections
Whole Genome Assembly.
CSE182-L12 Gene Finding.
CS 598AGB Genome Assembly Tandy Warnow.
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

CSE182-L10 LW statistics/Assembly

Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics & Repeats argue against the success of such an approach Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together

Questions Algorithmic: How do you put the genome back together from the pieces? Statistical? How many pieces do you need to sequence, etc.? –The answer to the statistical questions had already been given in the context of mapping, by Lander and Waterman.

Lander Waterman Statistics G L The fragments are falling randomly on the genome Overlapping fragments form islands of contiguous sequence. Ideally, we want one island for each chromosome. How many fragments should we sequence?

Lander Waterman Statistics G L

LW statistics: questions As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island. Q1: What is the expected number of islands? Ans: N exp(-c  ) The number increases at first, and gradually decreases.

Analysis: Expected Number Islands Computing Expected # islands. Let X i =1 if an island ends at position i, X i =0 otherwise. Number of islands = ∑ i X i Expected # islands = E(∑ i X i ) = ∑ i E(X i )

Prob. of an island ending at i E(X i ) = Prob (Island ends at pos. i) = Prob(clone began at position i-L+1 AND no clone began in the next L-T positions) i L T

LW statistics Pr[Island contains exactly j clones]? Consider an island that has already begun. With probability e -c , it will never be continued. Therefore Pr[Island contains exactly j clones]= Expected # j-clone islands

Expected # of clones in an island Expected # of clones in an island = Q: How? Why do we care? Often, at the beginning of a genome project, we do not know the length of the genome. This equation helps us determine the length.

Expected length of an island

Whole Genome Sequencing & Assembly

Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics & Repeats argue against the success of such an approach

Assembly Basics Three main components: –Overlap –Layout –Consensus

Overlap Given a pair of fragments s 1 and s 2, do they belong together? Yes, if a prefix of s 2 matches a suffix of s 1 How would you compute such a match?

Overlap S[i,j] = optimum score of an alignment of s 1 [1..i] against a suffix of s 2 [1..j] i j The best prefix-suffix alignment is given by: Max i {S[i,n] }

Overlap Detection Compute the best prefix- suffix alignments between each pair of fragments. Keep the “high-scoring” ones as evidence of true overlap. What is the problem?

Overlap detection problem Consider the number of fragments. The LW statistics say that we need good coverage (c=8, 10) to get most of the base-pairs. –G = 3000Mb, L=500 –Coverage LN/G = 10 –N = 10*3*10 9 /500 = 6*10 7 –Number of comparisons needed = 3.6 * Not good! (Only a small fraction are true overlaps)

k-mer based overlap (Piegeonhole principle again) Consider a 25bp sequence. –Expected number of occurrences in the genome –3*10 9 *4 -25 = 2*10 -6 A 25-bp sequence appears is unique to the genome! Two overlapping sequences should share a 25-mer Two non-overlapping sequences should not! 25bp

Sorting k-mers Build a list of k-mers that appear in the sequences and their reverse complements Create a record with 4 entries: –K-mer –Sequence number –Position in the sequence –Reverse complementation flag Sort a vector of these according to k- mer How many records per k-mer are expected? If number of records exceeds threshold, discard (why?) K-mer S.id Pos.

Alignment module Coalesce k-mer hits into longer, gap-free partial alignments. These extended k-mer hits are saved. For each pair of sequences, form a directed graph. For each maximal path in the graph, construct an alignment. Refine alignment via banded DP

Problem2: Size Islands might simply be too small in length  = (1-T/L) = (1-50/500) = 0.9, c = 8. #Islands = N e -c  = 45K Size of an island = 54K Not enough to make it an acceptable assembly! PLUS, there is the problem of Repeats, Chimerism etc.

Solution 2: Clones can have mate- pairs Recall that we sequence about 1000bp of the end of a clone If we sequenced both ends, we get extra information, particularly if we know the length of the original clone.

Mate Pairs Mate-pairs allow you to merge islands (contigs) into super-contigs

Super-contigs are quite large Make clones of truly predictable length. EX: 3 sets can be used: 2Kb, 10Kb and 50Kb. The variance in these lengths should be small. Use the mate-pairs to order and orient the contigs, and make super-contigs.

Whole genome shotgun Input: –Shotgun sequence fragments (reads) –Mate pairs Output: –A single sequence created by consensus of overlapping reads First generation of assemblers did not include mate-pairs (Phrap, CAP..) Second generation: CA, Arachne, Euler We will discuss Arachne, a freely available sequence assembler (2nd generation)

Problem 3: Repeats

Repeats & Chimerisms 40-50% of the human genome is made up of repetitive elements. Repeats can cause great problems in the assembly! Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly

Repeat detection Lander Waterman strikes again! The expected number of clones in a Repeat containing island is MUCH larger than in a non-repeat containing island (contig). Thus, every contig can be marked as Unique, or non-unique. In the first step, throw away the non-unique islands. Repeat

Detecting Repeat Contigs 1: Read Density Compute the log-odds ratio of two hypotheses: H1: The contig is from a unique region of the genome. The contig is from a region that is repeated at least twice

Detecting Chimeric reads Chimeric reads: Reads that contain sequence from two genomic locations. Good overlaps: G(a,b) if a,b overlap with a high score Transitive overlap: T(a,c) if G(a,b), and G(b,c) Find a point x across which only transitive overlaps occur. X is a point of chimerism

Contig assembly Reads are merged into contigs upto repeat boundaries. (a,b) & (a,c) overlap, (b,c) should overlap as well. Also, –shift(a,c)=shift(a,b)+shift(b,c) Most of the contigs are unique pieces of the genome, and end at some Repeat boundary. Some contigs might be entirely within repeats. These must be detected

Creating Super Contigs

Supercontig assembly Supercontigs are built incrementally Initially, each contig is a supercontig. In each round, a pair of super-contigs is merged until no more can be performed. Create a Priority Queue with a score for every pair of ‘mergeable supercontigs’. –Score has two terms: A reward for multiple mate-pair links A penalty for distance between the links.

Supercontig merging Remove the top scoring pair (S 1,S 2 ) from the priority queue. Merge (S 1,S 2 ) to form contig T. Remove all pairs in Q containing S 1 or S 2 Find all supercontigs W that share mate- pair links with T and insert (T,W) into the priority queue. Detect Repeated Supercontigs and remove

Repeat Supercontigs If the distance between two super-contigs is not correct, they are marked as Repeated If transitivity is not maintained, then there is a Repeat

Filling gaps in Supercontigs

Consensus Derivation Consensus sequence is created by converting pairwise read alignments into multiple-read alignments

Summary Whole genome shotgun is now routine: –Human, Mouse, Rat, Dog, Chimpanzee.. –Many Prokaryotes (One can be sequenced in a day) –Plant genomes: Arabidopsis, Rice –Model organisms: Worm, Fly, Yeast A lot is not known about genome structure, organization and function. –Comparative genomics offers low hanging fruit