Mathematics and the Genome Winfried Just Department of Mathematics and Quantitative Biology Institute Ohio University.

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

Human Genome Project What did they do? Why did they do it? What will it mean for humankind? Animation OverviewAnimation Overview - Click.
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Phylogenetic reconstruction
Unit 1: DNA and the Genome Key area 8: Genomic sequencing.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Basic Molecular Biology Many slides by Omkar Deshpande.
Lecture 2 Molecular Biology Primer Saurabh Sinha.
Molecular Evolution Revised 29/12/06
living organisms According to Presence of cell The non- cellular organism The cellular organisms According to Type the Eukaryotes the prokaryotes human.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Bioinformatics Lecture 2. Bioinformatics: is the computational branch of molecular biology Using the computer software to analyze biological data The.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Alternative splicing and evolution Daniel Jeffares.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
What does mathematics contribute to bioinformatics? Winfried Just Department of Mathematics Ohio University.
Lecture 12 Splicing and gene prediction in eukaryotes
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Introduction to Biological Sequences. Background: What is DNA? Deoxyribonucleic acid Blueprint that carries genetic information from one generation to.
Scientific FieldsScientific Fields  Different fields of science have contributed evidence for the theory of evolution  Anatomy  Embryology  Biochemistry.
Comparative Genomics of the Eukaryotes
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
What is genomics? Study of genomes. What is the genome? Entire genetic compliment of an organism.
Elements of Molecular Biology All living things are made of cells All living things are made of cells Prokaryote, Eukaryote Prokaryote, Eukaryote.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Manifestations of a Code Genes, genomes, bioinformatics and cyberspace – and the promise they hold for biology education.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
This presentation was originally prepared by C. William Birky, Jr. Department of Ecology and Evolutionary Biology The University of Arizona It may be used.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Comp. Genomics Recitation 3 The statistics of database searching.
Comparative genomics Haixu Tang School of Informatics.
Bioinformatics and Computational Biology
Chapter 1 Introduction.
Topics in Bioinformatics CS832b Bin Ma. Lecture 1: Basic.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
Bailee Ludwig Quality Management. Before we get started…. ….Let’s see what you know about Genomics.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Johnson - The Living World: 3rd Ed. - All Rights Reserved - McGraw Hill Companies Genomics Chapter 10 Copyright © McGraw-Hill Companies Permission required.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Complexity Issues in Bioinformatics Winfried Just Department of Mathematics and Quantitative Biology Institute Ohio University.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Eukaryotic genes are interrupted by large introns. In eukaryotes, repeated sequences characterize great amounts of noncoding DNA. Bacteria have compact.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
bacteria and eukaryotes
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
MCB 7200: Molecular Biology
PBIO 4500/5500: Biotechnology and Genetic Engineering
Higher Biology Genomic Sequencing Mr G R Davidson.
Genomes and Their Evolution
Ab initio gene prediction
Today… Review a few items from last class
Genomes and Their Evolution
BIOL 2416 Chapter 1: Genetics: An Introduction
Introduction to Bioinformatics II
Every living organism inherits a blueprint for life from its parents.
Evolution of eukaryote genomes
Unit Genomic sequencing
Presentation transcript:

Mathematics and the Genome Winfried Just Department of Mathematics and Quantitative Biology Institute Ohio University

This talk is dedicated to the memory of Dr. Pawel Zbierski, one of the great teachers in my life.

Biology’s dilemma: There is too much to know about living things Roughly 1.5 million species of organisms have been described and given scientific names to date. Some biologists estimate that the total number of all living species may be several times higher. It is impossible to learn everything about all these organisms. Biologists solve the dilemma by focusing on some species, so-called model organisms, and trying to find out as much as they can about these model organisms.

Some important model organisms Mammals: Homo sapiens, Chimpanzee, mouse, rat Fish: Zebrafish, Pufferfish Insects: Fruitfly (Drosophila melanogaster) Roundworms: Ceanorhabditis elegans Protista: Malaria parasite (Plasmodium falciparum) Fungi: Yeast (Saccharomyces cerevisiae, S. pombe) Plants: Thale cress (Arabidopsis thaliana), corn, rice Bacteria: Escherichia coli, salmonella Archea: Methanococcus janaschii

Let’s find out everything about some species What would it mean to learn everything about a given species? All available evidence indicates that the complete blueprint for making an organism is encoded in the organism’s genome. Chemically, the genome consists of one or several DNA molecules. These are long strings composed of pairs of nucleotides. There are only four different nucleotides, denoted by A, C, G, T. The information about how to make the organism is encoded by the order in which the nucleotides appear.

Some genome sizes zHIV2 virus 9671 bp zMycoplasma genitalis 5.8 · 10 5 bp zHaemophilus influenzae 1.83 · 10 6 bp zSaccharomyces cerevisiae 1.21 · 10 7 bp zCaenorhabditis elegans 10 8 bp zDrosophila melanogaster 1.65 · 10 8 bp zHomo sapiens 3.14 · 10 9 bp zSome amphibians 8 · bp zAmoeba dubia 6.7 · bp

Sequencing Genomes Contemporary technology makes it possible to completely sequence entire genomes, that is, determine the sequence of A’s, C’s, G’s, and T’s in the organism’s genome. The first virus was sequenced in the 1980’s, the first bacterium (Haemophilus influenzae) in 1995, the first multicellular organism (Caenorhabditis elegans) in The rough draft of the human genome was announced in June 2000.

How rough is the draft of the human genome? zAnnounced June 2000 zCovers about 95% of the genome. zContains more than 100,000 gaps zPublic version: Started: 1990 Based on genome of one person zCelera version: Started 1998 Based on genome of five persons

Where to store all these data? Some of the sequence data are stored in proprietary data bases, but most of them are stored in the public data base Genbank and can be accessed via the World Wide Web. In fact, most relevant journals require proof of submission to Genbank before an article discussing sequence data will be published. A notable exception was the publication of Celera’s announcement in Science.

What’s in the databases? As of February 20, 2000, Genbank contained 5,861,088,510 bp of information. There were about 600 completely sequenced viruses, 19 completely sequenced bacteria, 6 completely sequenced archaea, and 3 complete genomes of eukaryotes: S. cerevisiaea (baker’s yeast), C. elegans (a roundworm), and Drosophila melanogaster (fruitfly).

What’s in the databases? As of November 23, 2000, Genbank contained 10,853,673,034 bp of information. There were about 600 completely sequenced viruses, 29 completely sequenced bacteria, 8 completely sequenced archaea, and 3 complete genomes of eukariotes: S. cerevisiaea (baker’s yeast), C. elegans (a roundworm), and Drosophila melanogaster (fruitfly). The genome of Arabidopsis (thale cress) was near completion, and a first draft of the human genome had been completed.

What’s in the databases? As of March 18, 2002, Genbank contained 20,197,497,568 bp of information. There were about 700 completely sequenced viruses, 63 completely sequenced bacteria, 13 completely sequenced archaea, and 5 complete genomes of eukaryotes: S. cerevisiaea, S. pombe (two yeasts), C. elegans (a roundworm), Drosophila melanogaster (fruitfly) and Arabidopsis thaliana (thale cress), as well as a draft of the human genome.

First mathematical challenge: Sequencing large genomes Currently, much of the sequencing process is automated. However, contemporary sequencing machines can only sequence stretches of DNA that are a few hundred base pairs long at a time. The process of assembling these stretches of sequence into a whole genome poses some interesting mathematical problems.

First mathematical challenge: Sequencing large genomes For example, the publicly financed Human Genome Project used an approach called genome mapping to facilitate sequence assembly. Most of the time the HGP took was in fact spent on onstructing the scaffold of this map. In contrast, Celera Genomics allegedly used an approach called shotgun sequencing that works by randomly cutting up the genome into small streches, sequencing them, and then using a clever algorithm to assemble the whole genome. There was much debate over the feasibility of the latter approach, but it apparently worked.

You have sequenced your genome - what do you do with it? This is known as genome analysis or sequence analysis. At present, most of bioinformatics is concerned with sequence analysis. Here are some of the questions studied in sequence analysis: zgene finding zprotein 3D structure prediction zgene function prediction zprediction of important sites in proteins zreconstruction of phylogenetic trees

How the genome controls the organism The genome controls the making and workings of an organism by telling the cell which proteins to manufacture under which conditions. Proteins are the workhorses of biochemistry and play a variety of roles. Most notably, many proteins are enzymes that catalyze specific chemical reactions. All biochemical reactions are catalyzed by enzymes. A gene is a stretch of DNA that codes a given protein.

Where are the genes? The objective of gene finding is to identify the regions of DNA that are genes. Ideally, we want to make statements like: “Positions 28,354 through 29,536 of this genome code a protein.” Once we have identified a gene, it is easy to translate the DNA code into the sequence of amino acids that make up the corresponding protein. The mathematical challenge here is to identify patterns in DNA that reliably indicate where a gene starts and ends, especially in eukaryotes.

Hidden Markov Models for gene finding (caricature) Most current gene finding programs are based on Hidden Markov Models. These work as follows: assume (wrongly) that the DNA-sequence has been generated randomly by a Markov model that can be in one of two states: “gene” or “intergenic region.” Each state has a characteristic probability of “emitting” a given nucleotide, and has a characteristic (low) probability of switching to the other state. The observer sees the sequence of emissions (nucleotides), but the information by which state a given nucleotide was emitted is hidden from the observer.

Hidden Markov Models for gene finding (caricature) Now the observer wants to infer the actual sequence of states of the Markov model that caused the observed emissions. This sequence is called the path through the hidden Markov model. The (posterior) probability of any given path is easy to calculate, and it is computationally inexpensive to infer the most likely path for a given sequence of emissions (using the so-called Viterbi algorithm). This path gives some hypothesis for the location of the genes. It is also easy to calculate probabilities that a predicted gene is actually a gene (under the assumptions of the model).

Hidden Markov Models for gene finding - the real picture In reality, the situation is much more complicated. Coding regions of genes are not characterized by frequencies of single nucleotides, but of triplets and hexamers of nucleotides. Additional information, such as signals that indicate the beginning or end of a gene or a splicing site are being used. Additional difficulties arise because of: zexistence of six possible reading frames zexistence of introns in eukaryotes zvariable codon usage frequencies in different species

A big mathematical challenge The underlying assumption of Hidden Markov Models that DNA sequences are emitted by a Markov Model is obviously far removed from biological reality. So the question is: How can we construct gene finding tools that are based on biologically more meaningful assumptions? This has practical consequences for evaluating the probability that a predicted gene is actually a gene, or estimating the fraction of actual genes that have been identified as such by a given gene-finding algorithm.

What did the Hidden Markov Models find? zMycoplasma genitalis (bacterium) 500 Genes zEscherichia coli (bacterium) 4,500 Genes zSaccharomyces cerevisiae (yeast) 6,000 Genes zCaenorhabditis elegans (worm) 19,000 Genes zDrosophila Melanogaster (fruitfly) 13,500 Genes zArabidopsis thaliana (thale cress) 25,500 Genes zHomo sapiens (Human) 24,000-40,000 Genes zOryza sativa japonica (rice) 32,000-50,000 Genes zOryza sativa indica (rice) 45,000-56,000 Genes

So we know the genes - do we know everything? Far from it. The next two questions are: zGiven a single gene, how does it function in the biology of an organism? zHow do various genes interact?

From genes to proteins From the chemical point of view, proteins are long chains of chemicals called amino acids. There are 20 amino acids used to make most proteins in most organisms. Amino acids are coded by triplets of nucleotides, which are also called codons.

Protein structure prediction When a protein is manufactured in the cell, it assumes a characteristic 3D structure or fold. It is very costly to determine the 3D structure of a protein experimentally (by NMR or X-ray crystallography). It would be much cheaper if we could predict the 3D structure of a protein directly from its sequence of amino acids that is coded in the genome. This is known as the protein folding problem. Many approaches have been proposed to develop algorithms for solving this problem; so far results are mixed.

Protein structure prediction In theory, it is possible to predict a protein fold ab initio, that is from first principles. However, the task is beyond the capabilities of current supercomputers. Recently IBM announced plans to develop a new pentaflop supercomputer (10 15 floating point operations per second) called “Blue Gene” that will be designed specifically with the task of ab initio protein fold prediction in mind. It should be just powerful enough for the task if no unexpected complications arise.

Prediction of gene function Suppose you have identified a gene. What is its role in the biochemistry of its organism? Sequence databases can help us in formulating reasonable hypotheses. zSearch the database for genes with similar nucleotide sequences in other organisms. zIf the functions of the most similar genes are known and if they tend to be the same function (e.g., “codes enzyme involved in glucose metabolism”), then it is reasonable to conjecture that your gene also codes an enzyme involved in glucose metabolism.

Prediction of gene function: homology searches Given a nucleotide or DNA sequence, searching the data base(s) for similar sequences is known as “homology searches”. The most popular software tool for performing these searches is called BLAST; therefore biologists often speak of “BLAST searches”. There are two interesting problems here: zHow to measure “similarity” of two sequences. zHow much similarity constitutes evidence of biologically meaningful homology as opposed to random chance?

Prediction of important sites in proteins Not all parts of a protein are equally important; the function of most of its amino acids is often just to maintain an appropriate 3D structure, and mutations of those less crucial amino acids often don't have much effect. However, most proteins have crucial parts such as binding sites. Any mutations occurring at binding sites tend to be lethal and will be weeded out by evolution.

How to predict binding sites from sequence data: zGet a collection of proteins of similar amino acid sequences and analogous biochemical function from your database. zAlign these sequences amino acid by amino acid. zCheck which regions of the protein are highly conserved in the course of evolution. zThe binding site should be in one of the highly conserved regions.

Using genomic data for reconstruction of phylogenies A phylogenetic tree depicts the branching pattern in the evolution of contemporary species from their common ancestor. Given the sequences of homologous genes (i.e., genes derived from a common ancestral gene), one can try to reconstruct the phylogenetic tree for these species by looking at the amount of evolutionary change that has occurred at the molecular level and estimating the times at which any two of these species diverged.

Methods for reconstruction of phylogenies zDistance methods zMaximum parsimony zMaximum likelihood zBayesian Analysis A big mathematical challenge is to devise fast and reliable algorithms for phylogenetic reconstruction for large sets of species, especially using Maximum Likelihood or Bayesian Analysis.

Reconstruction of phylogenies: A success story There are two basic kinds of free-living organisms: prokaryotes, that do not have a cell nucleus, and eukaryotes, which do. Prokaryotes fall into two major groups: Eubacteria and Archaea. Phenotypically, eubacteria and archaea are very similar to each other. However, it has been demonstrated by using molecular data that archaea are more closely related to eukaryotes than to eubacteria, and thus it appears that the evolutionary branching between archaea and eubacteria occurred before the branching of archaea and eukaryotes.

Gene interactions: Collecting gene expression data All cells of a multicellular organism have the same set of genes. What accounts for the differences in various cell types and function is which of the genes are being expressed (switched on) at a given time in a given cell. A relatively new technology, called gene chips or microarrays makes it possible to monitor, for tens of thousands of genes simultaneously, the differences in gene expression levels between two different experimental conditions.

Gene interactions: Interpeting gene expression data Once gene expression data have been collected, it is possible to identify clusters of genes that have similar expression profiles, that is, are up- or downregulated under the same experimental conditions. One then conjectures that genes with similar expression profiles have similar functions, for example, are involved in the same biochemical pathways. Such conjectures can serve as powerful guides for setting up experiments to confirm the biochemical role of groups of genes.

Interpeting gene expression data: A mathematical challenge Gene expression data sets are peculiar in the sense that we typically have very few experiments (5-10 perhaps) and a large number (tens of thousands) monitored genes. It seems inevitable that some genes will show similar expression profiles just by random accident. Question: How can we tell spurious clusters of genes from biologically meaningful ones?

Gene expression profiles: A success story Cancer patients with the same clinical picture often respond to very different types of treatment. Gene expression profiles of groups of cancer patients have revealed that what looks to the clinician as the same disease can sometimes be one of several diseases at the biochemical level. The latter can be distinguished by characteristic expression profiles of certain groups of genes. Once the biochemical nature of the disease has been established, treatment can be tailored to the type of disease a patient actually has.

The databases of the future Genomic data bases like Genbank are just the beginning. In the near future we will see: zGene expression data banks zSNP (single nucleotide polymorphism) data banks zproteomics data banks zdata banks of biochemical pathways z… Setting up these data banks and making intelligent use of them will require new mathematical tools, to be developed by the next generation of mathematicians.