Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome Biology and Biotechnology

Similar presentations


Presentation on theme: "Genome Biology and Biotechnology"— Presentation transcript:

1 Genome Biology and Biotechnology
Genoom Biologie Prof. M. Zabeau Genome Biology and Biotechnology 3. The genome structures of vertebrates Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology (VIB) University of Gent International course 2005 Academiejaar

2 The Genome Sequences of vertebrates
Fish genomes: “compact” vertebrate genomes Fugu rubripes (2002) Tetraodon nigroviridis (2004) Bird genome: Interesting evolutionary intermediate Chicken - Gallus gallus (2004) Rodent genomes: the model organism for the human Mouse - mus musculus (2002) Rat – Rattus norvegicus (2004) Primate genomes: our closest relatives Chimpanzee Human genome Draft genome sequence (2001) Finished genome sequence (2004)

3 vertebrate evolution 310 MY 450 MY
Genoom Biologie Prof. M. Zabeau 310 MY 450 MY Figure 1 Basal vertebrate evolution showing extant species whose genomes have been sequenced. The horizontal axis represents estimated relative species diversity. The Archosauria include the Aves, their Mesozoic dinosaur predecessors, and Crocodilia; the Lepidosauria (lizards, snakes and tuataras) are not indicated. Archaeopteryx (indicated by an asterisk) is considered to be the first known bird and lived approximately 150 Myr ago. Reprinted from: ICGSC, Nature 432, (2004) Academiejaar

4 Aparicio et al., Science, 297, 1301-1310 (2002)
Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes Aparicio et al., Science, 297, (2002) Paper presents Low quality draft genome sequence of Fugu rubripes the sequence provided a valuable reference for annotating the human and mouse genomes Small genome (350 Mb versus 3000 Mb) low repetitive DNA content

5 The Fugu Genome Sequence
The draft sequence covers a total of 332.5 Mb Highly fragmented sequence (~30 Mb unassembled sequences) The total genome size is estimated at ~365 Mb The number of predicted genes: 31,059 similar to the number of human genes predicted from the draft sequence Repetitive sequences Density of <15% far below the 35 to 45% observed in mammals Transposable elements are still very active Reprinted from: Aparicio et al., Science, 297, (2002)

6 Protein-coding genes The gene-containing fraction is a ~ 108 Mb (30%)
The average gene density: one gene per 10.9 kb The Fugu genome is compact because introns are shorter than in the human genome Genome contains ~500 large introns (> 10 kb) compared > 12,000 large introns in human Genes are scaled in proportion to the compact genome size The number of introns is roughly the same as in human Both gain and loss of introns in the Fugu lineage are observed The compactness of the Fugu is accounted for by Low abundance of repeated sequences The small size of introns and intergenic regions Reprinted from: Aparicio et al., Science, 297, (2002)

7 Comparison of Fugu and Human Proteomes
75% of predicted human proteins have a strong match to Fugu Reprinted from: Aparicio et al., Science, 297, (2002)

8 Jaillon et. al., Nature 431, 946 - 957 (2004)
Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype Jaillon et. al., Nature 431, (2004) Paper presents High quality draft genome sequence with long-range linkage and chromosome anchoring of Tetraodon nigroviridis freshwater puffer fish with the smallest known vertebrate genome

9 The Tetraodon genome sequence
The draft genome sequence (8,3 x) spans 342 Mb Largest scaffolds were mapped onto the chromosomes Draft is much less fragmented than that of Fugu Genome landscape Transposable elements are very rare (<4000 copies) Fewer than Fugu (15% of the genome) Estimated 20,000–25,000 protein coding genes Very similar to the recent (2004) human gene count Much lower than reported for Fugu (current Fugu is also lower) Gene ontology (GO) classifications shows only subtle differences between fish and mammals Improved fish gene catalogue aids human gene predictions Reprinted from: Jaillon et. al., Nature 431, (2004)

10 Evidence For Whole-genome Duplication
Duplicated genes cluster on paralogous chromosomes paralogous chromosomes arising from whole-genome duplication each contain one member of duplicated gene pairs in the same order Reprinted from: Jaillon et. al., Nature 431, (2004)

11 Evidence For Whole-genome Duplication
Blocks of doubly conserved synteny The synteny map typically associates two regions in Tetraodon with one region in human Tni: Tetraodon Hsa: human Reprinted from: Jaillon et. al., Nature 431, (2004)

12 Ancestral genome of bony vertebrates
The patterns of doubly conserved synteny are consistent with 12 ancestral chromosomes which have rearranged to form the present day chromosomes of human and fish Human Fish Reprinted from: Jaillon et. al., Nature 431, (2004)

13 Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution International Chicken Genome Sequencing Consortium, Nature 432, (2004) Paper presents a draft genome sequence of the red jungle fowl Gallus gallus The first genome of non-mammalian amniote provides a new perspective on vertebrate genome evolution The evolutionary distance between chicken and human provides an excellent signal-to-noise ratio to detect functional elements 310 MY since the divergence of birds and mammals

14 The chicken genome sequence
Genoom Biologie Prof. M. Zabeau The chicken genome sequence The draft genome sequence (6,6 x) spans Mb Draft represents ~96% of the euchromatic part of the genome 23,212 chicken mRNAs and 485,000 ESTs Chicken genome is 3x smaller than mammalian genomes reflecting substantially fewer interspersed repeats transposable elements make up <9% of the genome, markedly lower than the 40–50% observed in mammalian genomes Pseudogenes 51 retrotransposed genes vs. > 15,000 in mammalian genomes segmental duplications Limited to very small (<10kb) intrachromosomal duplications Reprinted from: ICGSC, Nature 432, (2004) Academiejaar

15 Gene content of the chicken genome
Genoom Biologie Prof. M. Zabeau Gene content of the chicken genome Protein-coding genes Predict 20,000 to 23,000 protein-coding genes Matches the current (2004) estimate for mammalian genomes Non-coding RNA genes 571 ncRNA genes from >20 gene families Fewer than in human: many ncRNA genes are pseudogenes Syntenic relationships for non-coding RNA genes differ from those of protein-coding genes implies a novel mode of evolution for some ncRNA genes Only certain ncRNA genes are in regions of conserved synteny microRNAs (miRNAs) and small nucleolar RNAs (snoRNAs) found in introns of protein-coding genes Reprinted from: ICGSC, Nature 432, (2004) Academiejaar

16 Evolutionary conservation of gene components
Genoom Biologie Prof. M. Zabeau Evolutionary conservation of gene components Sequence conservation of chicken and human orthologs highest in protein-coding exons minimal in introns Significant in the 5' and 3' flanking and untranslated regions Figure 2 An idealized protein-coding gene structure showing average percentage alignment and average percentage identity (including gaps and unaligned regions) over 10,000 orthologous gene structures in either human–chicken, human–mouse or mouse–rat alignments (as aligned by BLASTZ156). The reference structure was taken from human or mouse, and only those with cDNA-based definitions of the structure were used. The central figure shows an idealized gene structure, with the grey exons representing coding sequence and white boxes representing 3' and 5' untranslated regions. 5’UTR exon 3’UTR Reprinted from: ICGSC, Nature 432, (2004) Academiejaar

17 Conservation of vertebrate protein content
Genoom Biologie Prof. M. Zabeau Conservation of vertebrate protein content 60% of chicken genes have a single human orthologue also have a single orthologue in the Fugu genome Represent a conserved core present in most vertebrates Figure 5 Chicken genes classified according to their predicted evolutionary relationships with genes of two other model vertebrates (Fugu and human). Forty-three per cent of the chicken genes are present in 1:1:1 orthology relationships for the three species. Also present in three species are n:n:n (many:many:many) orthologues; putative gene duplication events have resulted in multiple genes in at least one of the species. Pairwise orthologues are assigned when orthology is not detectable in the third species. Between Fugu and chicken, pairwise orthologues are rare (as expected), and might be indicative of gene loss in the lineage leading to humans. For a substantial number of genes, clear orthology relations cannot be described at all, but some similarity to genes in the other species remains detectable ('Homology', E-value cutoff is 10-6 in Smith–Waterman searches at the protein level). See Methods for details of orthology assignment. Reprinted from: ICGSC, Nature 432, (2004) Academiejaar

18 Conserved core orthologues in vertebrates
Core orthologues conserved in vertebrates have Highly conserved protein sequences indicating that They have been subject to purifying selection Reprinted from: ICGSC, Nature 432, (2004)

19 Expansion of multigene families
Genoom Biologie Prof. M. Zabeau Expansion of multigene families Expansion and contraction of multigene families were major factors in the independent evolution of mammals and birds Figure 7 Loss, innovation, expansions and contractions of protein families: domain counts and orthologous relations. All at-least-twofold over- and under-representations (separated by a solidus) are shown for both members of domain families (a) and 'many to many' orthology relations (b). Ranking of families and groups has been done with respect to the human genome; Fugu data are also shown for comparison. An asterisk indicates that manual analyses refined what were otherwise automatic counts. Families not subjected to twofold variations are not shown. Reprinted from: ICGSC, Nature 432, (2004) Academiejaar

20 Chromosomal dynamics in the vertebrates
Genoom Biologie Prof. M. Zabeau Chromosomal dynamics in the vertebrates Maps of conserved synteny: orthologous chromosomal segments with conserved gene order show slow rate of rearrangement in the human lineage 3-fold higher rate in the rodent lineage The human genome is closer to the chicken in terms of synteny Figure 14 Rates of genome structure divergence. a, Phylogenetic tree of chicken, human and mouse showing rates of genome structure divergence, where the branch length is proportional to the number of estimated chromosomal rearrangements. Details are given in the supporting table (b). AA, amniote ancestor; MA, mammalian ancestor. The pie charts show the fraction of orthologous genes that have retained their genomic neighbourhood; for example, about 85% of chicken and human orthologous genes reside in orthologous chromosomal segments, and their sizes are proportional to the number of recognizable orthologous genes. b, The table highlights a very low rate of interchromosomal shuffling in early mammalian and bird evolution, and an elevated rate of interchromosomal shuffling in the rodent lineage. It provides a summary of the number of chromosomal rearrangements estimated by counting synteny breaks where ancestral state is supported by synteny to an outgroup species, and by reconstruction of ancestral genomes through a combinatorial search for the most parsimonious rearrangement pattern. The two methods agree reasonably and have been used to estimate the relative branch lengths of the genome structure divergence tree in a. Normalization to the length of the MA–human branch and to the time of independent evolution is presented for easy comparison. Reprinted from: ICGSC, Nature 432, (2004) Academiejaar

21 Ancestral mammalian genome
Genoom Biologie Prof. M. Zabeau Long blocks of conserved chicken–human synteny Entire chromosomes Genome rearrangements Many intrachromosomal rearrangements Few translocations between chromosomes Chicken has a number of micro-chromosomes Figure 12 Putative mammalian ancestor recovered by GRIMM and MGR using the human, mouse, rat (not shown) and chicken genomes. a, Each genome is represented as an arrangement of 586 synteny blocks each drawn as one unit, regardless of its length in nucleotides. Each human chromosome is assigned a unique colour, and a diagonal line is drawn through the whole chromosome. In other genomes, this diagonal line indicates the relative order and orientation of the rearranged blocks. b, The recovered ancestral X chromosome is optimal and unique. Gene order of the ancestral X chromosome is identical to human. Numbers associated with the lines indicate the minimum number of rearrangements required to convert between two nodes. Reprinted from: ICGSC, Nature 432, (2004) Academiejaar

22 Conserved sequences in chicken and human
Genoom Biologie Prof. M. Zabeau Conserved sequences in chicken and human High substitution rates between human and chicken Can be used to detect functionally conserved sequences 70 Mb (2.5%) of human sequence aligns with chicken 44% are in protein-coding regions - exons 66% is non-coding: intronic (25%) and intergenic (31%) Conserved non-coding segments occur clustered and far from genes Identified 57 segments with average length of 1,1 MB gene poor, G+C poor and have no interspersed repeats the functional significance of these sequences is completely unknown Reprinted from: ICGSC, Nature 432, (2004) Academiejaar

23 Conclusion The chicken genome sequence
Genoom Biologie Prof. M. Zabeau Conclusion The chicken genome sequence is a key resource for comparative genomics to distinguish derived or ancient features of mammalian biology mammalian innovation and adaptation conserved non-coding sequences in particular Provides a framework for discovering the functional polymorphisms underlying interesting quantitative traits to further exploit the genetic potential of the chicken Figure 1 This shows the relative number of ORFs assigned to individual categories in the Gazetteer. There are eleven functional categories. Only proteins with a known function, or similarity or strong similarity to known proteins were assigned to one of the categories (similarities were measured by FASTA scores). In total, 3167 ORFs were assigned to at least one category. A single ORF can be assigned to more than one category Reprinted from: ICGSC, Nature 432, (2004) Academiejaar

24 Initial sequencing and comparative analysis of the mouse genome
Mouse Genome Sequencing Consortium, Nature 420, (2002) Paper presents Draft genome sequence of the mouse comparative analysis of the mouse and human genomes 75 MY since the divergence of rodents and primates The two genome sequences diverge by nearly one substitution for every two nucleotides the insights that can be gleaned from the two sequences

25 The Mouse Genome Project
The laboratory mouse is an experimental model system for studying human disease and mammalian biology The Mouse Genome Project International collaboration of centres in the US and the UK Adopted mixed strategy for the draft genome sequencing a BAC-based physical map of the mouse genome The initial draft genome sequence was generated by WGS sequencing to ~7-fold coverage Hierarchical shotgun sequencing of BAC clones The finished sequence should be completed in 2005 using the BAC clones for directed finishing Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, (2002)

26 The draft mouse genome sequence
The euchromatic mouse genome is estimated ~2.5 Gb The draft genome sequence covers ~96% of the genome Generation of the draft genome sequence Sequencing 41.4 Mi paired-end sequence reads derived from various clone types Assembly represents ~7.7-fold sequence coverage 224,713 sequence contigs total of 7,418 supercontigs The 200 largest supercontigs span more than 98% of the assembled sequence Anchoring to chromosomes Anchored all supercontigs >500 kb with the mouse genetic map Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, (2002)

27 The draft mouse genome sequence
The euchromatic mouse genome is estimated ~2.5 Gb The draft genome sequence covers ~96% of the genome Comparative analysis of human and mouse genomes The mouse genome is about 14% smaller than the human genome High degree of synteny >90% of the two genomes can be partitioned into corresponding regions of conserved synteny At the nucleotide level, approximately 40% of the human genome can be aligned to the mouse genome. represent orthologous sequences conserved from the common ancestor Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, (2002)

28 Synteny between mouse and human
Genoom Biologie Prof. M. Zabeau Synteny between mouse and human Regions containing orthologous sequence pairs define Syntenic segments as regions in which Orthologous sequence pairs are in the same order on a chromosome in both species Figure 2 Conservation of synteny between human and mouse. We detected 558,000 highly conserved, reciprocally unique landmarks within the mouse and human genomes, which can be joined into conserved syntenic segments and blocks (defined in text). A typical 510-kb segment of mouse chromosome 12 that shares common ancestry with a 600-kb section of human chromosome 14 is shown. Blue lines connect the reciprocal unique matches in the two genomes. The cyan bars represent sequence coverage in each of the two genomes for the regions. In general, the landmarks in the mouse genome are more closely spaced, reflecting the 14% smaller overall genome size. Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, (2002) Academiejaar

29 Synteny between mouse and human
Conservation of orthologous sequence pairs shows Each genome can be parsed into a total of 342 conserved syntenic segments. The segments vary greatly in length, from 303 kb to 64.9 Mb In total, about 90.2% of the human genome and 93.3% of the mouse genome reside within conserved syntenic segments The segments can be aggregated into a total of 217 conserved syntenic blocks The syntenic block and segment sizes are consistent with the random breakage model of genome evolution the minimal number of rearrangements needed to 'transform' one genome into the other is 295 rearrangements Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, (2002)

30 Blocks of conserved synteny in the human and mouse genomes
Genoom Biologie Blocks of conserved synteny in the human and mouse genomes Prof. M. Zabeau Figure 3 Segments and blocks >300 kb in size with conserved synteny in human are superimposed on the mouse genome. Each colour corresponds to a particular human chromosome. The 342 segments are separated from each other by thin, white lines within the 217 blocks of consistent colour Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, (2002) Academiejaar

31 Repetitive sequences in human and mouse
The most prevalent feature of mammalian genomes is their high content of repetitive sequences Most of which are interspersed repeats representing 'fossils' of transposable elements The repetitive sequences in mouse and human differ Only 37.5% of the mouse genome ~46% of the human genome is transposon-derived Insertions of transposable elements occured in the last 150–200 million years The most notable difference is the rate of transposition over time in mouse the rate has remained fairly constant in human the rate increased to a peak at ~40 Myr, and then plummeted Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, (2002)

32 Age distribution of interspersed repeats in the mouse and human genomes
Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, (2002)

33 Protein-coding genes in mouse and human
Human and mouse gene catalogues The current human gene catalogue (Ensembl build 29) contains 22,808 predicted genes The current mouse gene catalogue contains 22,011 predicted genes Comparative analysis of protein coding genes shows 80% of the mouse genes have orthologues in the human genome The proportion of mouse/human genes without any homologue is < 1%. Many local gene family expansions have occurred in the mouse lineage Most seem to involve genes for reproduction, immunity and olfaction The rate of protein evolution Most proteins evolve at fairly constant rate Certain proteins evolve much more rapidly: positive selection Proteins implicated in reproduction, host defence and immune response seem to be under, which drives Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, (2002)

34 Conclusions The mouse genome provides a powerful resource to unravel the secrets of the human genome Demonstrates the power of comparative genomics in identifying relevant genetic elements These findings inspired additional animal genome sequencing projects to fully exploit the power of comparative genomics As illustrated in: Thomas et. al., Nature 423, (2003) The sequence provides a comprehensive framework for functional genomics approaches to unravel gene functions in both human and mouse

35 Genome sequence of the Brown Norway rat yields insights into mammalian evolution
RGSPC, Nature 424: (2004) Paper presents a high-quality 'draft' sequence covering > 90% of the genome a three-way comparison with the human and mouse genomes to study the mammalian genome evolution Rat - mouse common ancestor: 12–24 Myr Rodent - human common ancestor: 75 Myr

36 The rat genome sequence
Genoom Biologie Prof. M. Zabeau The rat genome sequence The draft genome sequence covers 2,75 Gb A 'combined' sequencing strategy using WGS sequencing and light sequence coverage of BACs Sequential assembly of 'enriched BACs' (eBACs) joined into bactigs, superbactigs and ultrabactigs Figure 1 The new 'combined' sequence strategy and Atlas software. a, Formation of 'eBACs'. The RGSP strategy combined the advantages of both BAC and WGS sequence data54. Modest sequence coverage ( 1.8-fold) from a BAC is used as 'bait' to 'catch' WGS reads from the same region of the genome. These reads, and their mate pairs, are assembled using Phrap to form an eBAC. This stringent local assembly retains 95% of the 'catch'. b, Creation of higher-order structures. Multiple eBACs are assembled into bactigs based on sequence overlaps. The bactigs are joined into superbactigs by large clone mate-pair information (at least two links), extended into ultrabactigs using additional information (single links, FPC contigs, synteny, markers), and ultimately aligned to genome mapping data (radiation hybrid and physical maps) to form the complete assembly. eBAC Reprinted from: RGSPC, Nature 424: (2004) Academiejaar

37 Rat – mouse – human genome sequences
Genoom Biologie Prof. M. Zabeau Rat – mouse – human genome sequences Sequence elements in human, mouse and rat genomes 40% align in all 3 species 'ancestral core' of 1 Gb 95% of the exons and regulatory regions 28% aligns only with mouse rodent-specific repeats 29% does not align rat-specific repeats Figure 7 Aligning portions and origins of sequences in rat, mouse and human genomes. Each outlined ellipse is a genome, and the overlapping areas indicate the amount of sequence that aligns in all three species (rat, mouse and human) or in only two species. Non-overlapping regions represent sequence that does not align. Types of repeats classified by ancestry: those that predate the human–rodent divergence (grey), those that arose on the rodent lineage before the rat–mouse divergence (lavender), species-specific (orange for rat, green for mouse, blue for human) and simple (yellow), placed to illustrate the approximate amount of each type in each alignment category. Uncoloured areas are non-repetitive DNA—the bulk is assumed to be ancestral to the human–rodent divergence. Numbers of nucleotides (in Mb) are given for each sector (type of sequence and alignment category). Detailed results are tabulated (Supplementary Table SI-1). Reprinted from: RGSPC, Nature 424: (2004) Academiejaar

38 Evolution of genes Estimate that 90% of rat genes possess
Genoom Biologie Prof. M. Zabeau Evolution of genes Estimate that 90% of rat genes possess strict orthologues in both mouse and human Intronic structures are well conserved Most of the non-orthologous genes Arose by expansions of gene families in the different lineages Rapidly evolving genes Rat-specific genes comprise novel genes for “life style” pheromones, immunity, chemosensation, detoxification, proteolysis Reprinted from: RGSPC, Nature 424: (2004) Academiejaar

39 Rat – mouse – human synteny
Genoom Biologie Prof. M. Zabeau Rat – mouse – human synteny orthologous chromosome segments 105 mouse–rat segments 278 human-rat segments 280 human-mouse segments Figure 4 Map of conserved synteny between the human, mouse and rat genomes. For each species, each chromosome (x axis) is a two-column boxed pane (p arm at the bottom) coloured according to conserved synteny to chromosomes of the other two species. The same chromosome colour code is used for all species (indicated below). For example, the first 30 Mb of mouse chromosome 15 is shown to be similar to part of human chromosome 5 (by the red in left column) and part of rat chromosome 2 (by the olive in right column). An interactive version is accessible ( Reprinted from: RGSPC, Nature 424: (2004) Academiejaar

40 Rat – mouse – human genome rearrangements
Genoom Biologie Prof. M. Zabeau Rat – mouse – human genome rearrangements Reconstruction of the ancestral mammalian genome Identified a total of 353 rearrangements 247 between the murid ancestor and human 50 from the murid ancestor to mouse 56 from the murid ancestor to rat much higher (3x) rearrangement rate in the rodent than in the human lineage 247 Figure 5 Substitutions and microindels (1–10 bp) in the evolution of the human, mouse and rat genomes. a, The lengths of the labelled branches in the tree are proportional to the number of substitutions per site inferred using the REV model222 from all sites with aligned bases in all three genomes. b, The table shows the midpoint and variation in these branch-length estimates when estimated from different sequence alignment programs and different neutral sites, including sites from ancestral repeats3, fourfold degenerate sites in codons, and rodent-specific sites ('in neutral sites only' row; Supplementary Information). Other rows give midpoints and variation for micro-indels on each branch of the tree in a 50 56 Reprinted from: RGSPC, Nature 424: (2004) Academiejaar

41 The Human Genome The human genome project was launched in 1990
Phase I: generation of genetic and physical maps ( ) Demonstration that large scale sequencing is feasible: yeast, worm Phase II: large scale sequencing ( ) Pilot phase: finished sequence with 99.99% accuracy and no gaps of the human chromosomes 21 and 22 (published in ’98 and ‘99) Draft phase : draft sequence covering >90% of the genome completed in June 2000 (published in ) – took ~1 year Finishing phase: “finished” covering 99% of the genome sequence, completed in spring 2004 – took ~3 years Aftermath: no end point projected closing the last couple of hundred gaps Sequencing the centromeres Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)

42 The Human Genome Sequences
Draft genome sequences (2001) International Human Genome Sequencing Consortium (Collaboration of 20 public sequencing centers in 6 countries) Used a hierarchical shotgun sequencing strategy Sequence published in: Nature 409, 860 (2001) Celera Genomics - private initiative Used a whole-genome shotgun approach Assembly of the sequence combined their whole-genome shotgun data and the public genome sequence data Sequence published in: Venter et. al., Science, 291, 1304 (2001) Finished genome sequence (2004) International Human Genome Sequencing Consortium Sequence published in: Nature 431, (2004)

43 Finished Human Genome Sequence
Finishing process: complex iterative process Resolving problematic sequences From single nucleotide errors and gaps to the integrity of whole chromosomes The finishing process involved two distinct components producing finished maps consisting of continuous and accurate paths of overlapping large-insert clones producing finished clone sequences, consisting of continuous and accurate nucleotide sequences for each clone generated shotgun sequence of ~ BACs comprising a total sequence (redundant) length of 5,8 Gb Assembled sequences of ~ BACs Reprinted from: International Human Genome Sequencing Consortium Nature 431, (2004)

44 Finished Human Genome Sequence
Finished genome sequence Build 35 comprises Mbp Interrupted by “only” 341 gaps 308 gaps in the euchromatic sequence: totalling ~28 Mb 33 heterochromatic gaps (including 24 centromeres) : total ~198 Mb The total human genome size is estimated at ~3,080 Mb Comparison with draft sequence Substantially fewer gaps (341 versus 147,821) More accurate and complete sequence: error rate ~1 per 105 Confirmed local order and orientation of the sequences Corrected artefactual duplications resulting from mixups Verified most of the sequence with BAC cloned overlap sequence, paired end sequence reads from fosmids, draft chimpanzee genome sequence Reprinted from: International Human Genome Sequencing Consortium Nature 431, (2004)

45 Finished Human Genome Sequence
Importance of a completely finished genome sequence Accurate reference for identifying genetic variation in the human population Error rate of 10-5 << frequency of SNP of 10-3 Identification of segmental duplications Estimated to cover >5% of the genome sequence Located primarily in the pericentromeric and subtelomeric regions much higher than in mouse and rat Great medical interest: predisposes to deletion or rearrangement Williams syndrome region (7q) Charcot–Marie–Tooth region (17p) DiGeorge syndrome region (22q) Many remaining gaps involve unresolved segmental duplications Correct identification of all protein- coding genes structures ~60% of the gene models were corrected compared to the draft Reprinted from: International Human Genome Sequencing Consortium Nature 431, (2004)

46 Estimates of the Number of Human Genes
Reassociation kinetics (60s and 70s) Early estimates based estimated the mRNA complexity of typical vertebrate tissues to be 10,000–20,000, and were extrapolated to suggest around 40,000 genes for the entire genome Estimates from approximate gene and genome sizes Calculation based on the size of a typical gene ( 3*104 bp) and the size of the genome (3*109 bp) yielded 100,000 genes (W. Gilbert, pers. Com.) Number of CpG islands associated with known genes An estimate of 70,000–80,000 genes was made Estimates based on ESTs Estimates based on ESTs varied widely, from 35,000 to 120,000 genes Discrepancy results from contaminating genomic sequences and multiple ESTs from single genes Whole-genome shotgun sequence from the pufferfish Suggested around 30,000 human genes Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)

47 Identification of Protein Coding Genes
Draft genome sequence Initial estimate: – genes Finished genome sequence The human gene catalogue contains 22,287 gene loci consisting of 19,438 known genes and 2,188 predicted genes Current estimate: – genes is an upper limit, the actual number may be ~23.000 Consistent with gene counts in other vertebrates: fish and chicken Reprinted from: International Human Genome Sequencing Consortium Nature 431, (2004)

48 Basic Characteristics of Gene Structures
Mean and median values of gene structures Based on the draft sequence In particular, the UTRs in the RefSeq database are incomplete Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)

49 Protein Coding Genes General features of human genes
average coding length of about 1,400 bp Similar to al eukaryotic organisms average genomic extent of about 30 kb Much larger than in lower eukaryotes The variation in gene and intron size GC-rich regions: gene-dense with many compact genes AT-rich regions: gene-poor with genes containing large introns Known and predicted exons: ~ 1,2% of the human genome Average of 10.4 exons per gene Pseudogenes Current estimates: processed and unprocessed pseudogenes The total number of pseudogenes is thus likely to exceed the total number of functional genes Only those of recent origin can be identified with confidence Reprinted from: International Human Genome Sequencing Consortium Nature 431, (2004)

50 Basic Characteristics of Gene Structures
High variation in overall intron size distribution has very long tails Many genes are over 100 kb long Largest gene: dystrophin gene (DMD) 2.4 Mb longest known coding sequence: titin gene 80,780 bp, 178 exons Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)

51 Comparison with fly, worm and yeast
Apparent homologues of human proteins 40% to 60% of the yeast, worm and fly proteomes Human genes differ from those in worm and fly Spread out over much larger regions of genomic DNA Have a substantially larger number of exons 4,5 to 5 in fly and worm compared to 10,4 in human Are used to construct more alternative transcripts Larger number of proteins in human than in the worm or fly Increased complexity of the proteome Complexity of the human proteome is a consequence of large-scale protein innovation Multi-domain proteins with multiple functions, and domain architectures Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)

52 Protein Coding Gene Evolution in Human
Gene birth in the human lineage gene duplications that arose after divergence from the mouse Identified 1,183 gene clusters containing 3,300 recently duplicated genes ( with a peak 3–4 million years ago ) enriched in genes with immune function olfactory function reproductive functions Duplicated genes are the raw material for adaptive evolution: extra copies are able to undergo functional divergence in response to positive selection Gene death in the human lineage Recently inactivated genes include genes in olfactory function Reprinted from: International Human Genome Sequencing Consortium Nature 431, (2004)

53 Genoom Biologie Prof. M. Zabeau Future Perspectives Vertebrate genome sequencing projects ongoing or planned (currently totaling 25) Fish: zebrafish, salmon, tilapia, stickleback and Japanese medaka Amphibians: Xenopus laevis and X. tropicalis Birds: turkey Mammals: ~15 additional species cow, pig, cat, dog, horse, rabbit, guinea pig, elephant, kangaroo, shrew,…. Primates: chimp, orangutan, baboon and rhesus monkey Source: GOLDTM Genomes OnLine Database Xenopus laevis and X. tropicalis Academiejaar

54 Recommended reading Genome sequences
The sequencing of the human genome International Human Genome Sequencing Consortium Nature 409, 860 (2001) International Human Genome Sequencing Consortium Nature 431, (2004) The sequencing of the mouse genome Mouse Genome Sequencing Consortium, Nature 420, (2002) Chicken genome sequence International Chicken Genome Sequencing Consortium, Nature 432, (2004)

55 Further reading Vertebrate genome sequences Fish genome sequences
Aparicio et al., Science, 297, (2002) Jaillon et. al., Nature 431, (2004) Rat genome sequence RGSPC, Nature 424: (2004) Human genome sequence - Celera Genomics - private initiative Venter et. al., Science, 291, 1304 (2001)


Download ppt "Genome Biology and Biotechnology"

Similar presentations


Ads by Google