Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.

Slides:



Advertisements
Similar presentations
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Advertisements

Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Next Generation Sequencing, Assembly, and Alignment Methods
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
Lecture 14 Genome sequencing projects
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing and Assembly
The Human Genome Race. Collins vs. Venter Collins Venter.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Reading the Blueprint of Life
Recombinant DNA Technology for the non- science major.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
How to Build a Horse Megan Smedinghoff.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Chapter 21 Eukaryotic Genome Sequences
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
Human Genome.
Genetic Engineering Genetic engineering is also referred to as recombinant DNA technology – new combinations of genetic material are produced by artificially.
Genomics Part 1. Human Genome Project  G oal is to identify the DNA sequence of every gene in humans Genome  all the DNA in one cell of an organism.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Graphs. Graph Definitions A graph G is denoted by G = (V, E) where  V is the set of vertices or nodes of the graph  E is the set of edges or arcs connecting.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Genome Analysis. This involves finding out the: order of the bases in the DNA location of genes parts of the DNA that controls the activity of the genes.
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
Biotechnology.
DNA Sequencing Project
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Section 3: Gene Technologies in Detail
Stuff to Do.
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
CS 598AGB Genome Assembly Tandy Warnow.
A Sequenciação em Análises Clínicas
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A

Presentation overview n Background n Shotgun sequencing, whole genome shotgun sequencing n Assembly algorithms n Repeat sequences n Scaffolding techniques n Assembler quality issues n Conclusions n References

Gene Sequencing n Genome u A sequence of DNA base pairs that control how cells function in organisms n Genomics u Study of genomes u Decoding entire genomes n Current research techniques decode DNA base pairs accurate for about nucleotides at a time.

Gene Sequencing n Shotgun Sequencing (Fred Sanger 1982) 1. Physically break the DNA 2. DNA sequencer reads the DNA. 3. Assembler reconstructs the original sequence. n Assembly is challenging u Data contains errors u DNA has repetitive sections called repeats. u Gaps

Gene Sequencing n Finishing u Solve errors in the assembly process u Costly – large human intervention and special lab techniques

DNA Sequencing Using heat, separate the DNA into strands. The primer binds to the intended location and polymerase starts lengthening the the primer.

DNA Sequencing

To find out fragment sizes, Use gel electrophloresis -positions and spacing show relative sizes -Fragments are terminated by a specific known nucleotide

DNA Sequencing In reality the gels look like this. Using gels researchers then read the sequence from it bottom to top. An automated DNA sequencer does this for large scale readings. (3-4 meters long!)

DNA Sequencing Example output – Fragment of one file (usually spans nucleotides) Sequencer plots the fragments

Gene Sequencing n Shotgun Sequencing for large genomes u First, break DNA into bacterial artificial chromosomes (BACs). u Map the BACs to the genome and obtain a tiling path. u Apply the shotgun method to each BAC. The National Institutes of Health and the National Science Foundation fund 'libraries' of BAC clones. BACs have large piece of human genomic DNA ( kb) that overlap randomly. BACs are replicated to produce millions of human DNA replications. Shotgun sequencing is then applied to the BACs. Based on the knowledge of the overlapping sequences, researchers use this to construct the original sequence

Gene Sequencing

n Whole-Genome shotgun sequencing u Does not use BACs but the original fragments. u Use human genome fragments of 2-10 kb and sequence those u Computationally expensive n Eugene Myers and colleagues successfully applied WGSS u Assembled the entire genome of a fruit fly u Assembler for large genomes. u 135 Mbp genome u assembled the human genome

Gene Sequencing n WGSS procedures u Clones and Coverage 1. Shatter the DNA 2. Pieces of DNA are inserted into cloning vectors, or, clones. 3. Escherichia coli multiplies the plasmid. 4. Sequence both ends of each clone insert which yields clone-pairing data. 5. Try to have more than 99% of the genome covered by reads.

Gene Sequencing n WGSS procedure continued u Assembly 1. Combines all sequencing reads into contigs based on sequence similarity between reads. 2. Idea: Overlapping reads are presumed to be from the same area of the genome.

Gene Sequencing

n WGSS procedure continued 1. Assembly can be improved by knowing more about clone mates and their size distribution. u Finishing F Assemblers produce too many contigs in practice. F Finishing is taking contigs and yielding a complete sequence. F Scaffolder orders contigs into scaffolds based on clone-mate pair information.

Gene Sequencing n WGSS procedure continued F In each scaffold, the gaps are determined by the order of the contigs. F Sequence gaps - gaps between configs in the same scaffold. F Physical gaps - gaps between scaffolds. These are difficult to fill and require complex lab techniques

Gene Sequencing Advantage to shotgun sequencing less likely to make mistakes because the location for each BAC is known and there are less pieces to assemble Disadvantage is it is computationally intensive WGSS is faster and less expensive Disadvantage is that it is more prone to errors – more fragments and more difficult to assemble correctly

Gene Sequencing n Assembly Algorithms F Shotgun sequencing assembly problem Find the shortest common superstring of a set of sequences. Given strings {s1, s2, …} find the shortest string T such that every si is a substring of T. This is NP-hard. Approximation algorithm for this is efficient, the greedy algorithm.

Gene Sequencing n Assembly Algorithms F Shotgun sequencing assembly problem continued. Greedy algorithms were the first successful assembly algorithm implemented. Used for organisms such as bacteria, single-celled eukaryotes. Because of the greedy algorithm’s limitations, two other algorithms were derived.

Gene Sequencing n Assembly Algorithms F Overlap-layout-consensus Algorithm based on graph theory A graph is constructed –nodes are reads –edges represent overlapping reads A contig is a simple path in the graph Simple path – contains each node at most once

Gene Sequencing n Assembly Algorithms F Overlap-layout-consensus An assembler builds the graph Output is a set of nonintersecting simple paths, each path being a contig.

Gene Sequencing n Assembly Algorithms F Eularian path graph theory Eularian path – a path that visits all edges of a graph Breaks reads into overlapping n-mers. Source – n-1 prefix and destination is the n-1 suffix corresponding to an n-mer. Basic problem is to find a path that uses all the edges. Eularian path is more efficient. In practice both are equally fast. Example - ACTTA and CTTAG represents ACTTAG

Gene Sequencing n Repeats in the sequence u Assembly programs should detect repeats in the assembly process and not after. F Incorrect genome reconstruction u Assemblers should try to resolve correctly as many repeats as possible. F Avoid intensive human labor

Gene Sequencing n Detecting repeats u Statistical methods F Assemblers assume that reads are sampled uniformly at random. F Using this idea, assemblers deduce that areas covered by a large number of reads may show an over-collapsed repeat. F Problems with this - samples are not uniformly distributed.

Gene Sequencing n Detecting repeats u Euler assembly program F Finds repeats by complex parts of the graph constructed during the assembly process. F Researchers look into these complex areas to try and resolve repeats. F Assemblers can use clone mate information to find incorrect assemblies. This is based on finding clone-mate pairs too close or too far from one another.

Gene Sequencing n Detecting repeats F Assemblers can sometimes find differences between repeats that can determine correct sequencing u Techniques for repairing sequencing errors during repeat resolution F find clusters of reads where the clusters share differences. F Ie) four reads contain an A, four contain a B. it is likely that the first four reads are from one copy and the last four from a different one.

Gene Sequencing n Detecting repeats continued F Drawbacks are if certain areas of the sequence have low coverage. F Difficult to separate from true polymorphism u Unresolved repeats F directed sequencing experiments F TIGR Assembly

Gene Sequencing n Scaffolding u Scaffolding groups contigs into subsets with known order and orientation. u Nodes are contigs u directed edge is between two nodes when mate pairs bridge the gap between them. u Mate pairs, if in different contigs, have a 1% chance of being neighbors.

Gene Sequencing n Scaffolding continued. u Three basic problems F Find all connected components F Find a consistent orientation for all nodes in the graph. Nodes have two types of edges Same orientation Different orientation Consistent orientation possible only if all undirected cycles have an even number of reversal edges. Optimization problem – find the smallest number of edges to be removed so that no cycle has an odd number of reversal edges F Fit the edges on a line so the least number of constraints is invalidated. (NP-complete)

Gene Sequencing n Scaffolding u Complex because of data errors. u Effect of errors can be reduced by simple heuristics. F Ie ignore linking information in repeat areas u Scaffolding orientation and order techniques: F Physical mapping F using markers along a DNA strand as independent information for scaffolding software. F involves making large scale maps of landmarks that lie along the the chromosomal DNA F Markers are known sequences of nucleotides, tags.

Gene Sequencing n Scaffolding continued u tags are searched for in the contigs u Good analogy: F Like taking copies of a map of a highway connecting Sydney and Melbourne, cutting this into many pieces and then trying to reconstruct the original map from the fragments. F We find pieces that show cities and their overlapping pieces of other cities, and from that information, reconstruct the order.

Gene Sequencing n Scaffolding continued u Sequences of closely related organisms are also used as scaffolding information. u Example – aligning scaffolds of a mouse genome to the human genome u Issues of scaffolding techniques F Errors in length of inserts (affecting distances between clone mates) F Physical mapping is error prone. F Bambus - scaffolder that factors in linking information confidence

Gene Sequencing n Scaffolding continued F first builds a sequence based on linking information with high confidence then factors in linking information with lower confidence. n Assessing Assembly Quality F misassembly correction is expensive F some assemblers have a simple quality- control method that does not capture larger errors F test assembly software if we know a complete sequence (artificial or real)

Gene Sequencing n Assessing Assembly Quality u Common measures of quality are: F number and sizes of contigs F Assumption: few large contigs is better than many small contigs. F True because there are less gaps in the former, but, does not account for the possibility of misassemblies.

Conclusion n GOAL is to complete the DNA sequence of an organism. F Assemblers can reduce human effort in the finishing phase. F Assemblers need better quality-control tools and measures.

References n Genome Sequence Assembly:Algorithms and Issues, 2002,Mihai Pop, Steven L. Salzberg, Martin Shumway, IEEE Computer, v35(7) n uencing.html uencing.html n d/shotgun.html d/shotgun.html n enomeassembler.htm enomeassembler.htm n n