Presentation is loading. Please wait.

Presentation is loading. Please wait.

Completed Genomes: Viruses and Bacteria Monday, October 20, 2003 Introduction to Bioinformatics ME:440.714 J. Pevsner

Similar presentations


Presentation on theme: "Completed Genomes: Viruses and Bacteria Monday, October 20, 2003 Introduction to Bioinformatics ME:440.714 J. Pevsner"— Presentation transcript:

1 Completed Genomes: Viruses and Bacteria Monday, October 20, 2003 Introduction to Bioinformatics ME:440.714 J. Pevsner pevsner@jhmi.edu

2 Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by J Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by Wiley. These images and materials may not be used without permission from the publisher. Visit http://www.bioinfbook.org Copyright notice

3 We are now beginning the last third of the course: Today: completed genomes (Chapters 12-14) Wednesday: Fungi. Exam #2 is due at the start of class. Next Monday: Functional genomics (Jef Boeke) Next Wednesday: Pathways (Joel Bader) Monday Nov. 3: Eukaryotic genomes Wednesday Nov. 5: Human genome Monday Nov. 10: Human disease Wednesday Nov. 12: Final exam (in class) Announcements

4 Genome projects (Chapter 12) chronological overview major issues and themes Introduction to viruses (Chapter 13) classification bioinformatics challenges and resources Introduction to bacteria and archaea (Chapter 14) classification bioinformatics challenges and resources Outline of today’s lecture

5 A genome is the collection of DNA that comprises an organism. Today we have assembled the sequence of hundreds of genomes. We will begin by introducing the “tree of life” in an effort to make a comprehensive survey of life forms. Introduction to genomes Page 397

6 Ernst Haeckel (1834-1919), a supporter of Darwin, published a tree of life (1879) including Moner (formless clumps, later named bacteria). Chatton (1937) distinguished prokaryotes (bacteria that lack nuclei) from eukaryotes (having nuclei). Whittaker and others described the five-kingdom system: animals, plants, protists, fungi, and monera. In the 1970s and 1980s, Carl Woese and colleagues described the archaea, thus forming a tree of life with three main branches. Introduction: Systematics Page 399

7 plants animals monera fungi protists protozoa invertebrates vertebrates mammals Five kingdom system (Haeckel, 1879) Page 396

8 Fig. 12.1 Page 400 Pace (2001) described a tree of life based on small subunit rRNA sequences. This tree shows the main three branches described by Woese and colleagues.

9 Historically, trees were generated primarily using characters provided by morphological data. Molecular sequence data are now commonly used, including sequences (such as small-subunit RNAs) that are highly conserved. Visit the European Small Subunit Ribosomal RNA database for 20,000 SSU rRNA sequences. Molecular sequences as basis of trees Page 401

10 Genomes that span the tree of life are being sequenced at a rapid rate. There are several web-based resources that document the progress, including: GNNGenome News Network http://www.genomenewsnetwork.org/main.shtml GOLDGenomes Online Database http://wit.integratedgenomics.com/GOLD/ PEDANTProtein Extraction, Description & Analysis Tool http://pedant.gsf.de/ Genome sequencing projects Page 405

11 There are three main resources for genomes: EBIEuropean Bioinformatics Institute http://www.ebi.ac.uk/genomes/ NCBINational Center for Biotechnology Information http://www.ncbi.nlm.nih.gov TIGRThe Institute for Genomic Research http://www.tigr.org Genome sequencing projects Page 405

12 archaea bacteria eukaryota http://www.ncbi.nlm.nih.gov/Entrez/

13 Overview of viral complete genomes

14 Overview of archaea complete genomes

15 Overview of eukaryota genomes in NCBI’s Entez division

16 Overview of eukaryota genomes in NCBI’s Entrez division

17 We will next summarize the major achievements in genome sequencing projects from a chronological perspective. Chronology of genome sequencing projects Page 404

18 1977: first viral genome Sanger et al. sequence bacteriophage  X174. This virus is 5386 base pairs (encoding 11 genes). See accession J02482. 1981 Human mitochondrial genome 16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA) Today, over 400 mitochondrial genomes sequenced 1986 Chloroplast genome 156,000 base pairs (most are 120 kb to 200 kb) Chronology of genome sequencing projects Page 406

19 Fig. 12.6 Page 407 Entrez nucleotide record for bacteriophage  X174 (graphics display)

20 mitochondrion chloroplast Lack mitochondria (?)

21 1995: first genome of a free-living organism, the bacterium Haemophilus influenzae Chronology of genome sequencing projects Page 409

22 1995: genome of the bacterium Haemophilus influenzae is sequenced Fig. 12.9 Page 411

23

24 Overview of bacterial complete genomes

25 Fig. 12.9 Page 411 You can find functional annotation through the COGs database (Clusters of Orthologous Genes)

26 Fig. 12.9 Page 411 Click the circle to access the genome sequence

27 Fig. 12.10 Page 412 Click the circle to access the genome sequence Genes are color-coded according to the COGs scheme

28 1996: first eukaryotic genome The complete genome sequence of the budding yeast Saccharomyces cerevisiae was reported. We will describe this genome on Wednesday. Also in 1996, TIGR reported the sequence of the first archaeal genome, Methanococcus jannaschii. Chronology of genome sequencing projects Page 413

29 1996: a yeast genome is sequenced

30 To place the sequencing of the yeast genome in context, these are the eukaryotes…

31 Eukaryotes (Baldauf et al. 2000) Fungi

32 1997: More bacteria and archaea Escherichia coli 4.6 megabases, 4200 proteins (38% of unknown function) 1998: first multicellular organism Nematode Caenorhabditis elegans 97 Mb; 19,000 genes. 1999: first human chromosome Chromosome 22 (49 Mb, 673 genes) Chronology of genome sequencing projects Page 413

33

34 1999: Human chromosome 22 sequenced

35 49 MB 673 genes

36 2000: Fruitfly Drosophila melanogaster (13,000 genes) Plant Arabidopsis thaliana Human chromosome 21 2001: draft sequence of the human genome (public consortium and Celera Genomics) Chronology of genome sequencing projects Page 415

37

38

39 2000

40 Completed genome projects (current) Eukaryotes: 10In progress (partial): Anopheles gambiae Danio rerio (zebrafish) Arabidopsis thalianaGlycine max (soybean) Caenorhabditis elegans Hordeum vulgare (barley) Drosophila melanogaster Leishmania major Encephalitozoon cuniculi Rattus norvegicus Guillardia theta nucleomorph Mus musculus Plasmodium falciparum Saccharomyces cerevisiae (yeast) Schizosaccharomyces pombe Viruses: 1419 Bacteria: 139 Archaea: 36 Page 417

41 eukaryotes

42 [1] Selection of genomes for sequencing [2] Sequence one individual genome, or several? [3] How big are genomes? [4] Genome sequencing centers [5] Sequencing genomes: strategies [6] When has a genome been fully sequenced? [7] Repository for genome sequence data [8] Genome annotation Overview of genome analysis Page 418

43 Fig. 12.11 Page 418

44 [1] Selection of genomes for sequencing is based on criteria such as: genome size (some plants are >>>human genome) cost relevance to human disease (or other disease) relevance to basic biological questions relevance to agriculture Overview of genome analysis Page 419

45 [1] Selection of genomes for sequencing is based on criteria such as: genome size (some plants are >>>human genome) cost relevance to human disease (or other disease) relevance to basic biological questions relevance to agriculture Ongoing projects: ChickenFungi (many) ChimpanzeeHoney bee CowSea urchin Dog (recent publication)Rhesus macaque Overview of genome analysis Page 419

46 [2] Sequence one individual genome, or several? Try one… --Each genome center may study one chromosome from an organism --It is necessary to measure polymorphisms (e.g. SNPs) in large populations (November 5) For viruses, thousands of isolates may be sequenced. For the human genome, cost is the impediment. Overview of genome analysis Page 419

47 [3] How big are genomes? Viral genomes: 1 kb to 350 kb (Mimivirus: 800 kb) Bacterial genomes: 0.5 Mb to 13 Mb Eukaryotic genomes: 8 Mb to 686 Mb (discussed further on Monday, November 3) Overview of genome analysis Page 420

48 viruses plasmids bacteria fungi plants algae insects mollusks reptiles birds mammals Genome sizes in nucleotide base pairs 10 4 10 8 10 5 10 6 10 7 10 1110 10 9 The size of the human genome is ~ 3 X 10 9 bp; almost all of its complexity is in single-copy DNA. The human genome is thought to contain ~30,000-40,000 genes. bony fish amphibians http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt

49 [4] 20 Genome sequencing centers contributed to the public sequencing of the human genome. Many of these are listed at the Entrez genomes site. (See Table 17.6, page 625.) Overview of genome analysis Page 421

50 [5] There are two main stragies for sequencing genomes Whole Genome Shotgun (from the NCBI website) An approach used to decode an organism's genome by shredding it into smaller fragments of DNA which can be sequenced individually. The sequences of these fragments are then ordered, based on overlaps in the genetic code, and finally reassembled into the complete sequence. The 'whole genome shotgun' (WGS) method is applied to the entire genome all at once, while the 'hierarchical shotgun' method is applied to large, overlapping DNA fragments of known location in the genome. Overview of genome analysis Page 421

51 Hierarchical shotgun method Assemble contigs from various chromosomes, then sequence and assemble them. A contig is a set of overlapping clones or sequences from which a sequence can be obtained. The sequence may be draft or finished. A contig is thus a chromosome map showing the locations of those regions of a chromosome where contiguous DNA segments overlap. Contig maps are important because they provide the ability to study a complete, and often large segment of the genome by examining a series of overlapping clones which then provide an unbroken succession of information about that region. Overview of genome analysis Page 421

52 [6] When has a genome been fully sequenced? A typical goal is to obtain five to ten-fold coverage. Finished sequence: a clone insert is contiguously sequenced with high quality standard of error rate 0.01%. There are usually no gaps in the sequence. Draft sequence: clone sequences may contain several regions separated by gaps. The true order and orientation of the pieces may not be known. Overview of genome analysis Page 422

53 [7] Repository for genome sequence data Raw data from many genome sequencing projects are stored at the trace archive at NCBI or EBI (main NCBI page, bottom right) Overview of genome analysis Page 425

54 Fig. 12.14 Page 426

55 Fig. 12.14 Page 426

56 [8] Genome annotation Information content in genomic DNA includes: -- repetitive DNA elements -- nucleotide composition (GC content) -- protein-coding genes, other genes These topics will be discussed in detail on November 3 (eukaryotic genomes) Overview of genome analysis Page 425

57 20304050607080 GC content (%) Vertebrates Invertebrates Plants Bacteria 3 5 10 Number of species in each GC class 5 10 5 GC content varies across genomes Fig. 12.16 Page 428

58

59 Viruses are small, infectious, obligate intracellular parasites. They depend on host cells to replicate. Because they lack the resources for independent existence, they exist on the borderline of the definition of life. The virion (virus particle) consists of a nucleic acid genome surrounded by coat proteins (capsid) that may be enveloped in a host-derived lipid bilayer. Viral genomes consist of either RNA or DNA. They may be single-, double, or partially double stranded. The genomes may be circular, linear, or segmented. Introduction to viruses Page 437

60 Viruses have been classified by several criteria: -- based on morphology (e.g. by electron microscopy) -- by type of nucleic acid in the genome -- by size (rubella is about 2 kb; HIV-1 about 9 kb; poxviruses are several hundred kb). Mimivirus (for Mimicking microbe) has a double-stranded circular genome of 800 kb. -- based on human disease Page 438 Introduction to viruses

61 Fig. 13.1 Page 439

62 Fig. 13.2 Page 440 The International Committee on Taxonomy of Viruses (ICTV) offers a website, accessible via NCBI’s Entrez site http://www.ncbi.nlm.nih.gov/ICTVdb/

63 Vaccine-preventable viral diseases include: Hepatitis A Hepatitis B Influenza Measles Mumps Poliomyelitis Rubella Smallpox Page 441 Introduction to viruses

64 Some of the outstanding problems in virology include: -- Why does a virus such as HIV-1 infect one species (human) selectively? -- Why do some viruses change their natural host? In 1997 a chicken influenza virus killed six people. -- Why are some viral strains particularly deadly? -- What are the mechanisms of viral evasion of the host immune system? -- Where did viruses originate? Bioinformatic approaches to viruses Page 439-441

65 The unique nature of viruses presents special challenges to studies of their evolution. viruses tend not to survive in historical samples viral polymerases of RNA genomes typically lack proofreading activity viruses undergo an extremely high rate of replication many viral genomes are segmented; shuffling may occur viruses may be subjected to intense selective pressures (host immune respones, antiviral therapy) viruses invade diverse species the diversity of viral genomes precludes us from making comprehensive phylogenetic trees of viruses Diversity and evolution of viruses Page 441

66 Herpesviruses are double-stranded DNA viruses that include herpes simplex, cytomegalovirus, and Epstein-Barr. Phylogenetic analysis suggests three major groups that originated about 180-220 MYA. Bioinformatic approaches to herpesvirus Page 442

67 Fig. 13.3 Page 443

68 Consider human herpesvirus 9 (HHV-8). Its genome is about 140,000 base pairs and encodes about 80 proteins. We can explore this virus at the NCBI website. Try NCBI  Entrez  Genomes  viruses  dsDNA Bioinformatic approaches to herpesvirus Page 442

69 Fig. 13.4 Page 444

70 Fig. 13.5 Page 445

71 Fig. 13.10 Page 449

72 Consider human herpesvirus 9 (HHV-8). Its genome is about 140,000 base pairs and encodes about 80 proteins. Microarrays have been used to define changes in viral gene expression at different stages of infection (Paulose-Murphy et al., 2001). Conversely, gene expression changes have been measured in human cells following viral infection. Bioinformatic approaches to herpesvirus Page 442

73 Fig. 13.11 Page 450 Paulose-Murphy et al. (2001) described HHV-8 viral genes that are expressed at different times post infection

74 Human Immunodeficiency Virus (HIV) is the cause of AIDS. At the end of the year 2002, 42 million people were infected. HIV-1 and HIV-2 are primate lentiviruses. The HIV-1 genome is 9181 bases in length. Note that there are almost 100,000 Entrez nucleotide records for this genome (but only one RefSeq entry). Phylogenetic analyses suggest that HIV-2 appeared as a cross-species contamination from a simian virus, SIVsm (sooty mangebey). Similarly, HIV-1 appeared from simian immunodeficiency virus of the chimpanzee (SIVcpz). Bioinformatic approaches to HIV Page 446

75 Fig. 13.6 Page 446

76 Two major resources are NCBI and the Los Alamos National Laboratory (LANL) databases. See http://hiv-web.lanl.gov/ LANL offers -- an HIV BLAST server -- Synonymous/non-synonymous analysis program -- a multiple alignment program -- a PCA-like tool -- a geography tool Bioinformatic approaches to HIV Page 453

77 Fig. 13.13 Page 452

78 Fig. 13.6 Page 446

79 Bacteria and archaea constitute two of the three main branches of life. Together they are the prokaryotes. We can classify prokaryotes based on six criteria: [1] morphology [2] genome size [3] lifestyle [4] relevance to human disease [5] molecular phylogeny (rRNA) [6] molecular phylogeny (other molecules) Bacteria and archaea: genome analysis Page 466

80 Fig. 14.1 Page 468

81 Fig. 14.2 Page 470 M. genitalium has one of the smallest bacterial genome sizes. View its genome at www.tigr.org

82 We may distinguish six prokaryotic lifestyles: [1] Extracellular (e.g. E. coli) [2] Facultatively intracellular (Mycobacterium tuberculosis) [3] Extremophilic (e.g. M. jannaschi) [4] epicellular bacteria (e.g. Mycoplasma pneumoniae) [5] obligate intracellular and symbiotic (B. aphidicola) [6] obligate intracellular and parasitic (Rickettsia) Bacteria and archaea: lifestyles Page 472

83 Fig. 14.4 Page 477

84 Fig. 14.5 Page 478 Revised figure

85 Fig. 14.6 Page 479

86 DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae Nature 406, 477- 483 (2000)

87

88

89

90

91

92

93 Four main features of genomic DNA are useful: [1] Open reading frame length [2] Consensus for ribosome binding (Shine-Dalgarno) [3] Pattern of codon usage [4] Homology of putative gene to other genes Bacteria and archaea: finding genes Page 480

94 Fig. 14.7 Page 482 GLIMMER for gene-finding in bacteria (www.tigr.org)

95 Fig. 14.8 Page 484 Lateral gene transfer occurs in stages

96 COGs database: organisms and tools

97 COGs database: functional annotation

98

99 COGs database: distribution of COGs by number of species COGs database: distribution of COGs by number of clades...

100

101 How can whole genomes be compared? -- molecular phylogeny -- You can BLAST (or PSI-BLAST) all the DNA and/or protein in one genome against another -- TaxPlot and COG for bacterial (and for some eukaryotic) genomes -- PipMaker, MUMmer and other programs align large stretches of genomic DNA from multiple species

102 Fig. 14.16 Page 493

103 Fig. 14.16 Page 493

104 Fig. 14.17 Page 494

105 Fig. 14.18 Page 495


Download ppt "Completed Genomes: Viruses and Bacteria Monday, October 20, 2003 Introduction to Bioinformatics ME:440.714 J. Pevsner"

Similar presentations


Ads by Google