Lecture 14 Genome sequencing projects

Slides:



Advertisements
Similar presentations
The Human Genome Project
Advertisements

Indexing DNA Sequences Using q-Grams
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Next Generation Sequencing, Assembly, and Alignment Methods
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
9 Genomics and Beyond Brief Chapter Outline
Sequencing a genome and Basic Sequence Alignment Lecture 10 1Global Sequence.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Assembly.
DNA Sequencing and Assembly
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Making, screening and analyzing cDNA clones Genomic DNA clones
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
Sequencing a genome and Basic Sequence Alignment Lecture 8 1Global Sequence.
Sequencing a genome and Basic Sequence Alignment
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
How to Build a Horse Megan Smedinghoff.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Sequencing a genome and Basic Sequence Alignment
Chapter 21 Eukaryotic Genome Sequences
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Initial sequencing and analysis of the human genome Averya Johnson Nick Patrick Aaron Lerner Joel Burrill Computer Science 4G October 18, 2005.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
Human Genome.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
Virginia Commonwealth University
Human Genome Project.
DNA Sequencing Project
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Pre-genomic era: finding your own clones
CS 598AGB Genome Assembly Tandy Warnow.
Genomes and Their Evolution
How to Build a Horse: Final Report
A Sequenciação em Análises Clínicas
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Fragment Assembly 7/30/2019.
Presentation transcript:

Lecture 14 Genome sequencing projects Bioinformatics Lecture 14 Genome sequencing projects Hierarchical and Shotgun approaches Genome assembly TIGR Assembler Ensembl

Genome size Mammalian genome ~ 3 megabase = 3x109 base pairs How many books are needed to print the entire mammalian genome? 1,500 letter per page x 1000 pages per book x 2000 books Assuming 5 cm per book this shelf is ~ 100 meters long!

Genome sequencing: the problem Sequencing read lengths vary depending upon several parameters but 600 to 800 nucleotides correspond to a good estimate. To sequence much larger fragments or even whole genome, essentially two strategies have been designed. a) The hierarchical approach. Depending on the vector used for cloning BAC, YAC, cosmid and other libraries of cloned contigs are usually created. The size of insert/contig may vary from tens to hundred thousand of base pairs. Collections of sub-fragments obtained by enzymatic restriction are mapped to get a unique contigs from which a minimal set of sub-fragments can be selected and sequenced thus limiting sequence redundancy. b) The shotgun approach. This can be applied to a DNA sequence of any size, including the whole genome. DNA is randomly fragmented by sonication or shearing. Following fragmentation and enzymatic end repair the DNA fragments are ligated to a plasmid vector and a bacterium host transformed to produce a library. Clones taken at random from the library are then sequenced from both end using two universal primers. At this stage a shotgun is characterised by its depth i.e. the cumulative length of sequence determined divided by the length of the fragment or genome to be sequenced. For example with an estimated size of 4 Mb a 10X shotgun would correspond to the assembly of about 60,000 reads with a mean size of 650 nt. The resulting sequences are assembled in a unique contig representing the whole fragment by sequence comparison using appropriate bio-informatic programs. The final stage or “polishing stage” corresponds to the elimination of gaps and other possible problems.

Shotgun approach

Genome assembly

Assembly of a contiguous DNA sequences Sequencing projects have rapidly moved to using the two approaches sequentially. For example, the construction of a BAC map covering an entire genome or chromosome is followed by a shotgun strategy to sequence a minimal set of BACs. The change that was introduced by G. Venter was the size of the DNA fragment or genome that was directly shotguned. The possibility to increase the size of the shotgun projects was dependent upon the development of robots adapted to high throughput project and of bioinformatic programs that solve two major problems. One is a quantitative problem regarding the capacity to store, compare, retrieve millions of reads corresponding to billions of nucleotides. DB problem. The second problem is related to the presence of numerous repeat sequences that are often longer than the mean read length, complicating correct assembly. Assembly problem.

Fragment assembly problem The Shortest Superstring Problem, while representing a challenge, is simplified abstraction, since it should also take into consideration three other difficulties. 1. Sequence data are not perfect and mistaken reads are possible. 2. Presence of numerous repeats. There is ~ a million of 300 base pairs Alu copies and many other repeats. Fortunately some repeats may slightly differ due to mutation process. 3. As DNA is double-stranded, orientation of substrings is unknown and it is not known which strand should be used in the reconstruction. Most of fragment assembly algorithms include the following three steps: Overlap. The problem is to find the best match between the suffix of one sequence an the prefix of another. The difficulties above force to use variation of the dynamic programming algorithm + filtration methods Layout. This is the hardest step in DNA assembly, which becomes even more computationally demanding with increasing number of fragments. The most difficult is deciding whether two fragments with a good overlap really overlap or represent a repeat or something else. Consensus. This step is devoted to finding the most frequent character in the stringing layout that is constructed after the layout step is completed. More sophisticated algorithms align substrings in small windows along the layout or use a mosaic of the best (high probabilistic scores) segments from the layout.

Genome assembly from smaller sequence fragments

TIGR Assembler TIGR Assembler is an Open Source software. The TIGR Assembler is a sequence fragment assembly program building contigs from small sequence reads. It is versatile, offering a wide variety of options for tuning the assembly process and analyzing sequence data. The current assembly engine uses a greedy algorithm and heuristics to build contigs, find repeat regions, and target alignment regions. Sequence overlaps are detected and scored using a 32-mer hash. Sequence alignment and merging is done using a Smith-Waterman dynamic programming algorithm. Gap penalties and score values corresponding to the bases and their quality values are predefined and hard coded into the program.

Genome assembly – contigs and suprcontigs alignment It is very difficult to produce a finished continuous sequence having the level of redundancy typical for many high eukaryotes. Instead, a draft sequence of about 150,000 contigs will be generated that could be combined to give a few thousand supercontigs. The production, in parallel, of a dense RH map will not only facilitate the assembly of the contigs into supercontigs, but will also make it possible to order the supercontigs — a necessary step for understand genome rearrangements and synteny.

CFA5 99Mb Cytogénétic HSA 650.2cR5000 RH meiotic 85cM FISH AHTH68Ren 1 11 16 17 99Mb 14.1 14.3 12 13 14.2 21 22 23 24 31 32 33 34 35 36 H68 THY1 H201 H248 SLC2A4 DIO1 K315 Cytogénétic HSA FISH *** 650.2cR5000 CPH14 C05.377 C05.414 FH2383 C05.771 CPH18 CO2608 ZUBECA6 AHT141 FH2140 FH2594 AHTH248 REN78M01 REN285I23 REN12N03 REN114G01 CD3E REN265H13 THY-1 REN42N13 REN51I08 HuEST-D29618 REN109K18 REN111B12 REN92G21 REN283H21 AHTH68Ren REN68H12 REN287B11 REN122J03 AHTH201Ren REN134J18 AHTK315 MSHR REN192M20 REN162F12 REN137C07 11q23 1p32 16q24 11q22 REN175P10 /REN213E01 RH meiotic CFA5 85cM

Mouse Genome: sequencing and assembly The mouse genome is about 14% smaller than the human genome (2.5 Gb compared with 2.9 Gb) probably due to higher rate of deletions. Over 90 % of mouse and human genomes can be partitioned into corresponding regions of conserved synteny. Sequencing strategy included four approaches: 1) construction of BAC-based physical map by fingerprinting and sequencing the clones ends, 2) Whole-Genome Shotgun sequencing to ~7 fold coverage and assembly to generate an initial draft, 3) hierarchical shotgun sequencing of BAC clones combined with WGS to create a hybrid WGS-BAC assembly, 4) production of finished sequence by using the BAC clones as template for direct finishing About 41 million reads were generated by the project participants, of which 33.6 million passed quality checks and 29.7 were paired (opposite end of the same clone). Clone inserts provide ~47-fold physical coverage of the genome. Genome assembly were achieved using two newly developed programs Arachne and Phusion. The assembly contains 224,713 contigs, connected into 7,418 supercontigs. The 200 largest supercontigs span more that 98% of the assembled sequence, of which 3 % is within sequence gaps.

Ensembl: An Open-Source Tool The Ensembl consists of two main parts: 1) The analysis pipeline, which adds new data and analyses regularly to the core database. The DB contains DNA sequences, predicted features on the sequences and a complete body of evidence supporting these predictions. Ensembl known genes therefore are those predicted genes that have high similarity to genes confirmed by experimental evidence. 2) The API (application programming interface), which gives structured access to the data. Easiness of retrieving information in meaningful form makes API an extremely powerful tool. The initial implementation of the API is in Perl, built upon layer of Bio-Perl objects. Other implementations and languages like Java and Python are also in use. The Ensembl is based around two ideas: a golden path (the pathway through the data containing nonredundant sequence) and virtual contig (contig determined by the user, an arbitrary region of a chromosome). NCBI and USCS web-sites contains systems similar to the Ensembl.