Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 14 Genome sequencing projects

Similar presentations


Presentation on theme: "Lecture 14 Genome sequencing projects"— Presentation transcript:

1 Lecture 14 Genome sequencing projects
Bioinformatics Lecture 14 Genome sequencing projects Hierarchical and Shotgun approaches Genome assembly TIGR Assembler Ensembl

2 Genome size Mammalian genome ~ 3 megabase = 3x109 base pairs
How many books are needed to print the entire mammalian genome? 1,500 letter per page x 1000 pages per book x 2000 books Assuming 5 cm per book this shelf is ~ 100 meters long!

3 Genome sequencing: the problem
Sequencing read lengths vary depending upon several parameters but 600 to 800 nucleotides correspond to a good estimate. To sequence much larger fragments or even whole genome, essentially two strategies have been designed. a) The hierarchical approach. Depending on the vector used for cloning BAC, YAC, cosmid and other libraries of cloned contigs are usually created. The size of insert/contig may vary from tens to hundred thousand of base pairs. Collections of sub-fragments obtained by enzymatic restriction are mapped to get a unique contigs from which a minimal set of sub-fragments can be selected and sequenced thus limiting sequence redundancy. b) The shotgun approach. This can be applied to a DNA sequence of any size, including the whole genome. DNA is randomly fragmented by sonication or shearing. Following fragmentation and enzymatic end repair the DNA fragments are ligated to a plasmid vector and a bacterium host transformed to produce a library. Clones taken at random from the library are then sequenced from both end using two universal primers. At this stage a shotgun is characterised by its depth i.e. the cumulative length of sequence determined divided by the length of the fragment or genome to be sequenced. For example with an estimated size of 4 Mb a 10X shotgun would correspond to the assembly of about 60,000 reads with a mean size of 650 nt. The resulting sequences are assembled in a unique contig representing the whole fragment by sequence comparison using appropriate bio-informatic programs. The final stage or “polishing stage” corresponds to the elimination of gaps and other possible problems.

4 Shotgun approach

5 Genome assembly

6 Assembly of a contiguous DNA sequences
Sequencing projects have rapidly moved to using the two approaches sequentially. For example, the construction of a BAC map covering an entire genome or chromosome is followed by a shotgun strategy to sequence a minimal set of BACs. The change that was introduced by G. Venter was the size of the DNA fragment or genome that was directly shotguned. The possibility to increase the size of the shotgun projects was dependent upon the development of robots adapted to high throughput project and of bioinformatic programs that solve two major problems. One is a quantitative problem regarding the capacity to store, compare, retrieve millions of reads corresponding to billions of nucleotides. DB problem. The second problem is related to the presence of numerous repeat sequences that are often longer than the mean read length, complicating correct assembly. Assembly problem.

7 Fragment assembly problem
The Shortest Superstring Problem, while representing a challenge, is simplified abstraction, since it should also take into consideration three other difficulties. 1. Sequence data are not perfect and mistaken reads are possible. 2. Presence of numerous repeats. There is ~ a million of 300 base pairs Alu copies and many other repeats. Fortunately some repeats may slightly differ due to mutation process. 3. As DNA is double-stranded, orientation of substrings is unknown and it is not known which strand should be used in the reconstruction. Most of fragment assembly algorithms include the following three steps: Overlap. The problem is to find the best match between the suffix of one sequence an the prefix of another. The difficulties above force to use variation of the dynamic programming algorithm + filtration methods Layout. This is the hardest step in DNA assembly, which becomes even more computationally demanding with increasing number of fragments. The most difficult is deciding whether two fragments with a good overlap really overlap or represent a repeat or something else. Consensus. This step is devoted to finding the most frequent character in the stringing layout that is constructed after the layout step is completed. More sophisticated algorithms align substrings in small windows along the layout or use a mosaic of the best (high probabilistic scores) segments from the layout.

8 Genome assembly from smaller sequence fragments

9 TIGR Assembler TIGR Assembler is an Open Source software.
The TIGR Assembler is a sequence fragment assembly program building contigs from small sequence reads. It is versatile, offering a wide variety of options for tuning the assembly process and analyzing sequence data. The current assembly engine uses a greedy algorithm and heuristics to build contigs, find repeat regions, and target alignment regions. Sequence overlaps are detected and scored using a 32-mer hash. Sequence alignment and merging is done using a Smith-Waterman dynamic programming algorithm. Gap penalties and score values corresponding to the bases and their quality values are predefined and hard coded into the program.

10

11

12 Genome assembly – contigs and suprcontigs alignment
It is very difficult to produce a finished continuous sequence having the level of redundancy typical for many high eukaryotes. Instead, a draft sequence of about 150,000 contigs will be generated that could be combined to give a few thousand supercontigs. The production, in parallel, of a dense RH map will not only facilitate the assembly of the contigs into supercontigs, but will also make it possible to order the supercontigs — a necessary step for understand genome rearrangements and synteny.

13 CFA5 99Mb Cytogénétic HSA 650.2cR5000 RH meiotic 85cM FISH AHTH68Ren
1 11 16 17 99Mb 14.1 14.3 12 13 14.2 21 22 23 24 31 32 33 34 35 36 H68 THY1 H201 H248 SLC2A4 DIO1 K315 Cytogénétic HSA FISH *** 650.2cR5000 CPH14 C05.377 C05.414 FH2383 C05.771 CPH18 CO2608 ZUBECA6 AHT141 FH2140 FH2594 AHTH248 REN78M01 REN285I23 REN12N03 REN114G01 CD3E REN265H13 THY-1 REN42N13 REN51I08 HuEST-D29618 REN109K18 REN111B12 REN92G21 REN283H21 AHTH68Ren REN68H12 REN287B11 REN122J03 AHTH201Ren REN134J18 AHTK315 MSHR REN192M20 REN162F12 REN137C07 11q23 1p32 16q24 11q22 REN175P /REN213E01 RH meiotic CFA5 85cM

14 Mouse Genome: sequencing and assembly
The mouse genome is about 14% smaller than the human genome (2.5 Gb compared with 2.9 Gb) probably due to higher rate of deletions. Over 90 % of mouse and human genomes can be partitioned into corresponding regions of conserved synteny. Sequencing strategy included four approaches: 1) construction of BAC-based physical map by fingerprinting and sequencing the clones ends, 2) Whole-Genome Shotgun sequencing to ~7 fold coverage and assembly to generate an initial draft, 3) hierarchical shotgun sequencing of BAC clones combined with WGS to create a hybrid WGS-BAC assembly, 4) production of finished sequence by using the BAC clones as template for direct finishing About 41 million reads were generated by the project participants, of which 33.6 million passed quality checks and 29.7 were paired (opposite end of the same clone). Clone inserts provide ~47-fold physical coverage of the genome. Genome assembly were achieved using two newly developed programs Arachne and Phusion. The assembly contains 224,713 contigs, connected into 7,418 supercontigs. The 200 largest supercontigs span more that 98% of the assembled sequence, of which 3 % is within sequence gaps.

15 Ensembl: An Open-Source Tool
The Ensembl consists of two main parts: 1) The analysis pipeline, which adds new data and analyses regularly to the core database. The DB contains DNA sequences, predicted features on the sequences and a complete body of evidence supporting these predictions. Ensembl known genes therefore are those predicted genes that have high similarity to genes confirmed by experimental evidence. 2) The API (application programming interface), which gives structured access to the data. Easiness of retrieving information in meaningful form makes API an extremely powerful tool. The initial implementation of the API is in Perl, built upon layer of Bio-Perl objects. Other implementations and languages like Java and Python are also in use. The Ensembl is based around two ideas: a golden path (the pathway through the data containing nonredundant sequence) and virtual contig (contig determined by the user, an arbitrary region of a chromosome). NCBI and USCS web-sites contains systems similar to the Ensembl.

16

17

18

19

20


Download ppt "Lecture 14 Genome sequencing projects"

Similar presentations


Ads by Google