Introduction to Sequencing BI420 – Introduction to Bioinformatics Introduction to Sequencing
The nuclear genome (chromosomes)
The genome sequence the primary template on which to outline functional features of our genetic code (genes, regulatory elements, secondary structure, tertiary structure, etc.)
Completed genomes Humans, mouse ~3,000 Mb D. melanogaster ~120 Mb C. elegans ~100 Mb E. coli ~5 Mb
Main genome sequencing strategies Whole-genome shotgun sequencing Celera Genomics, Inc. Clone-based shotgun sequencing Human Genome Project
Hierarchical genome sequencing BAC (bacterial artificial chromosome) library construction clone mapping shotgun subclone library construction sequencing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
BACs The genome is broken down into chunks 100-200 kb long, and these are individually integrated into BACs (bacterial artificial chromosomes). The BACs are then grown in bacterial cells to produce a large number of duplicate copies of the 100-200kb chunk. The original location of each BAC in the human genome is “fingerprinted” based on the lengths of sequences generated when restriction enzymes are applied.
Clone mapping – “sequence ready” map Restriction enzymes (ex. A, B, C, D, E) are applied to the chunk, generating a set of fragments of particular lengths. These provide a “fingerprint” for where the chunk came from in the genome.
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
Shotgun subclone library construction cloning vector BAC primary clone subclone insert sequencing vector The BAC sequence is fragmented into short reads, and these are sequenced.
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
Traditional gel-based sequencing Polymerize off single-strand in the presence of some radioactive dideoxy nucleotides, which cap a DNA sequence and will leave a film signal due to radioactivity. Then run in a gel, which separates segments by length. Repeat 4 times with each of dideoxy A, C, G and T
Sequencing using fluorescence Use dideoxy nucleotides that fluoresce under UV light with a different color for each of A, C, G, T. Instead of gel do electrophoresis in a capillary tube.
Robotic automation Lander et al. Nature 2001
Base calling PHRED base = A Q = 40 Software such as PHRED is used to interpret the chromatogram and generate base calls.
Vector clipping Because the sequence was generated in a BAC, there may be some overhangs of bacterial sequence (pink). These should be removed.
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
Sequence assembly PHRAP Software such as PHRAP is used to find sequences which overlap one another
Repetitive DNA may confuse assembly
region of low sequence coverage and/or quality Sequence completion (finishing) region of low sequence coverage and/or quality gap CONSED, AUTOFINISH Software such as CONSED produces an assembled genome, taking into account base quality.
Main genome sequencing strategies Whole-genome shotgun sequencing Celera Genomics, Inc. Clone-based shotgun sequencing Human Genome Project
Whole-genome shotgun sequencing characteristics PROS WGS less labor-intensive than clone-based sequencing Faster Very effective in re-sequencing projects where the scaffold of the genome is known, e.g. human genome population studies CONS WGS has greater uncertainty in mapping of read positions Difficult to use WGS to de novo assemble genomes that have not previously been sequenced or where structure is not known (seawater sampling, soil sampling, or other metagenomic studies)
Current usage Whole genome shotgun is used for re-sequencing studies, i.e. those where at least one individual of the species has already been fully sequenced. This is because indviduals in a species have very similar genomes, making it easy to assemble any new individual off the existing scaffold.
DNA Sequencing: Instructor Demo Platform: UNIX (bioclass.bc.edu) Instructions: http://bioinformatics.bc.edu/~marth/BI820/pages/sequencingComputerSession.html Data: bioclass:~marth/CLONE.tar.gz or http://bioinformatics.bc.edu/~marth/Data/CLONE.tar.gz