Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006
Lecture overview 1. Genome sequencing strategies, sequencing informatics 2. Genome annotation, functional and structural features in the human genome 3. Genome variability, DNA nucleotide, structural, and epigenetic variations
1. The Human genome sequence
The nuclear genome (chromosomes)
The genome sequence the primary template on which to outline functional features of our genetic code (genes, regulatory elements, secondary structure, tertiary structure, etc.)
Completed genomes ~1 Mb ~100 Mb >100 Mb ~3,000 Mb
Main genome sequencing strategies Clone-based shotgun sequencing Whole-genome shotgun sequencing Human Genome ProjectCelera Genomics, Inc.
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
Clone mapping – “sequence ready” map
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
Shotgun subclone library construction BAC primary clone cloning vector sequencing vector subclone insert
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
Sequencing
Robotic automation Lander et al. Nature 2001
Base calling PHRED base = A Q = 40
Vector clipping
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
Sequence assembly PHRAP
Repetitive DNA may confuse assembly
Sequence completion (finishing) CONSED, AUTOFINISH gap region of low sequence coverage and/or quality
2. Human genome annotation
Genome annotation – Goals protein coding genesRNA genes repetitive elements GC content
The starting material AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGT AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAG TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCT CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTAT ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCT GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCT AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT
Coding genes – ab initio predictions ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAAAAAAA Open Reading Frame = ORF Stop codon Start codon PolyA signal
Ab initio predictions Gene structure
Ab initio predictions …AGAATAGGGCGCGTACCTTCCAACGAAGACTGGG… splice donor site splice acceptor site
Ab initio predictions Genscan Grail Genie GeneFinder Glimmer etc… EST_genome Sim4 Spidey EXALIN
Homology based predictions ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAAAAAAA ACGGAAGTCT known coding sequence from another organism GGACTATAAA expressed sequence genes predicted by homology Genomescan Twinscan etc…
Consolidation – gene prediction systems Otto Ensembl FgenesH Genscan Grail Genewise Sim4 dbEst
ncRNA genes prediction based on structure (e.g. tRNAs) for other novel ncRNAs, only homology-based predictions have been successful
Repeat annotations Repeat annotation are based on sequence similarity to known repetitive elements in a repeat sequence library
The landscape of the human genome
Gene annotations – # of coding genes Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Gene annotations – gene length Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Gene annotations – gene function Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
GC content and coding potential Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
ncRNAs Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Segmental duplications Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Repeat elements Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Genes and repeats
Physical vs. genetic map (Mb/cM) 0.4 cM1.3 cM0.7 cM 0.4 Mb0.7 Mb0.3 Mb
3. Human genome variability
DNA sequence variations the reference Human genome sequence is 99.9% common to each human being sequence variations make our genetic makeup unique SNP the most abundant human variations are single-nucleotide polymorphisms (SNPs) – 10 million SNPs are currently known
DNA sequence variations insertion-deletion (INDEL) polymorphisms
Structural variations Speicher & Carter, NRG 2005
Structural variations Feuk et al. Nature Reviews Genetics 7, 85–97 (February 2006) | doi: /nrg1767
Detection of structural variants Feuk et al. Nature Reviews Genetics 7, 85–97 (February 2006) | doi: /nrg1767
Epigenetic changes: chromatin structure Sproul, NRG 2005
Epigenetic changes: DNA methylation Laird, NRC 2003