Genome sequencing
Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli Library: collection of fragments of a genome in cloning vectors Draft: crude 1 st generation sequence assembly Scaffold: Sequences which are anchored to a genetic map
Vocabulary 2 Minimal tiling path: Minimal set of overlapping clones that together provides complete coverage across a genomic region Coverage: The number of times a genomic region is represented in a collection of clones or sequence reads Contig: Alignment of overlapping reads 'N50 length‘ is defined as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L
Bac by Bac Whole genome shotgun
Bac by Bac sequencing (slow)
Minimal tiling path
Whole genome shotgun sequencing WGSA
Hybrid shotgun sequencing
N 50 Cumulative contig content in % of genome Contig size (in kb) Order contigs according to size Compute cumulative size N50 = contig size (sequence length) which marks 50% of genome content
Human genome 2001: 2 Draft sequences published Public Bac by Bac sequence Celeras WGSA –90% of euchromatic sequence – gaps –N 50 : 81 kb –Error rate: 1: Finished public sequence –99 % of euchromatic sequence –341 gaps –N 50 : kb –Error rate: 1:
The problem with complex genomes Gaps Orientation of contigs not known Near identical repeats hard to resolve
Finishing the sequence GapDraft sequence
Resolving repeats
Detecting and resolving repeats in WGSA
Clone orientation
Segmental duplications / gaps Blue: duplications of size > 10kb Red: Gaps of size > 300 kb