Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Slides:



Advertisements
Similar presentations
Human Genome Project What did they do? Why did they do it? What will it mean for humankind? Animation OverviewAnimation Overview - Click.
Advertisements

Organizing Life’s Diversity
THE EVOLUTIONARY HISTORY OF BIODIVERSITY
Tree of Life Chapter 26.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Classification of Living Things. 2 Taxonomy: Distinguishing Species Distinguishing species on the basis of structure can be difficult  Members of the.
Phylogenetic reconstruction
Unit 1: DNA and the Genome Key area 8: Genomic sequencing.
Phylogeny and Systematics
PHYLOGENY AND SYSTEMATICS
Classification systems have changed over time as information has increased. Section 2: Modern Classification K What I Know W What I Want to Find Out L.
Chapter 26 – Phylogeny & the Tree of Life
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Bioinformatics Lecture 2. Bioinformatics: is the computational branch of molecular biology Using the computer software to analyze biological data The.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Alternative splicing and evolution Daniel Jeffares.
Protein Modules An Introduction to Bioinformatics.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Topic : Phylogenetic Reconstruction I. Systematics = Science of biological diversity. Systematics uses taxonomy to reflect phylogeny (evolutionary history).
Scientific FieldsScientific Fields  Different fields of science have contributed evidence for the theory of evolution  Anatomy  Embryology  Biochemistry.
Comparative Genomics of the Eukaryotes
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
CSE 6406: Bioinformatics Algorithms. Course Outline
Manifestations of a Code Genes, genomes, bioinformatics and cyberspace – and the promise they hold for biology education.
Mathematics and the Genome Winfried Just Department of Mathematics and Quantitative Biology Institute Ohio University.
The Evolutionary History of Biodiversity
Classification and Systematics Tracing phylogeny is one of the main goals of systematics, the study of biological diversity in an evolutionary context.
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Phylogenetic Trees: Common Ancestry and Divergence 1B1: Organisms share many conserved core processes and features that evolved and are widely distributed.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Organizing information in the post-genomic era The rise of bioinformatics.
17.2 Modern Classification
Comparative genomics Haixu Tang School of Informatics.
Central dogma: the story of life RNA DNA Protein.
Using blast to study gene evolution – an example.
Statistical Testing with Genes Saurabh Sinha CS 466.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
Bailee Ludwig Quality Management. Before we get started…. ….Let’s see what you know about Genomics.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Johnson - The Living World: 3rd Ed. - All Rights Reserved - McGraw Hill Companies Genomics Chapter 10 Copyright © McGraw-Hill Companies Permission required.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
Genomics Part 1. Human Genome Project  G oal is to identify the DNA sequence of every gene in humans Genome  all the DNA in one cell of an organism.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Complexity Issues in Bioinformatics Winfried Just Department of Mathematics and Quantitative Biology Institute Ohio University.
Chapter 18: Classification
5.4 Cladistics The images above are both cladograms. They show the statistical similarities between species based on their DNA/RNA. The cladogram on the.
Eukaryotic genes are interrupted by large introns. In eukaryotes, repeated sequences characterize great amounts of noncoding DNA. Bacteria have compact.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Section 2: Modern Systematics
Bioinformatics Overview
Part 3 Gene Technology & Medicine
Introduction to Bioinformatics Resources for DNA Barcoding
MCB 7200: Molecular Biology
PBIO 4500/5500: Biotechnology and Genetic Engineering
Pipelines for Computational Analysis (Bioinformatics)
Section 2: Modern Systematics
5.4 Cladistics.
Today… Review a few items from last class
Genomes and Their Evolution
Every living organism inherits a blueprint for life from its parents.
Evolution of eukaryote genomes
Unit Genomic sequencing
Presentation transcript:

Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary field at the intersection of biology, computer science, statistics, and mathematics. Its subject matter is the extraction of biologically useful information from large sets of molecular data, such as DNA or protein sequence data or gene expression data. The term “bioinformatics” is currently used mainly to refer to the extraction of information from sequence data, while the creation and analysis of gene expression data is called functional genomics.

Biology’s dilemma: There is too much to know about living things Roughly 1.5 million species of organisms have been described and given scientific names to date. Some biologists estimate that the total number of all living species may be several times higher. It is impossible to learn everything about all these organisms. Biologists solve the dilemma by focusing on some species, so-called model organisms, and trying to find out as much as they can about these model organisms.

Some important model organisms Mammals: Human, chimpanzee, mouse, rat Fish: Zebrafish, Pufferfish Insects: Fruitfly (Drosophila melanogaster) Roundworms: Ceanorhabditis elegans Protista: Malaria parasite (Plasmodium falciparum) Fungi: Baker’s yeast (Saccharomyces cerevisiae) Plants: Thale cress (Arabidopsis thaliana), corn, rice Bacteria: Escherichia coli, Mycoplasma genitalis Archea: Methanococcus janaschii

Let’s find out everything about some species What would it mean to learn everything about a given species? All available evidence indicates that the complete blueprint for making an organism is encoded in the organism’s genome. Chemically, the genome consists of one or several DNA molecules. These are long strings composed of pairs of nucleotides. There are only four different nucleotides, denoted by A, C, G, T. The information about how to make the organism is encoded by the order in which the nucleotides appear.

Some genome sizes zHIV2 virus 9671 bp zMycoplasma genitalis 5.8 · 10 5 bp zHaemophilus influenzae 1.83 · 10 6 bp zSaccharomyces cerevisiae 1.21 · 10 7 bp zCaenorhabditis elegans 10 8 bp zDrosophila melanogaster 1.65 · 10 8 bp zHomo sapiens 3.14 · 10 9 bp zSome amphibians 8 · bp zAmoeba dubia 6.7 · bp

Sequencing Genomes Contemporary technology makes it possible to completely sequence entire genomes, that is, determine the sequence of A’s, C’s, G’s, and T’s in the organism’s genome. The first virus was sequenced in the 1980’s, the first bacterium (Haemophilus influenzae) in 1995, the first multicellular organism (Caenorhabditis elegans) in A draft of the human genome was announced in 2000.

Where to store all these data? In databases of course. Some of the sequence data are stored in proprietary data bases, but most of them are stored in the public data base Genbank and an be accessed via the World Wide Web. In fact, most relevant journals require proof of submission to Genbank before an article discussing sequence data will be published. The URL for Genbank is:

What’s in the databases? In 1981, Genbank contained less than 500,000 bp of info. In 1986, Genbank contained 9,615,371 bp of info. In 1991, Genbank contained 71,947,426 bp of info. In 1996, Genbank contained 651,972,984 bp of info. In 2001, Genbank contained 15,849,921,438 bp of info. In 2004, Genbank contained 37,893,844,733 bp of info. In 2009, Genbank contained 106,533,156,756 bp of info.

What’s in the databases? On March 18, 2005 there were 1791 completely sequenced viruses, 204 completely sequenced bacteria, 21 completely sequenced archaea, and 9 complete genomes of Eukaryotes, among them two yeasts, the roundworm C. elegans, the fruitfly Drosophila melanogaster, the mosquito A. gambiae, the malaria parasite P. falciparum, and the plant Arabidopsis thaliana (thale cress). There are also drafts of 11 other genomes of eukaryotes, most notably of the human genome.

What’s in the databases? On December 17, 2010 there were 3518 completely sequenced viruses, 952 completely sequenced bacteria, 68 completely sequenced archaea, and 73 complete genomes of Eukaryotes, among them cow, wolf, horse, human, a monkey, pig, chimpanzee.

First challenge: Sequencing large genomes Currently, much of the sequencing process is automated. However, contemporary sequencing machines can only sequence stretches of DNA that are a few hundred base pairs long at a time. The process of assembling these stretches of sequence into a whole genome poses some interesting mathematical problems.

First challenge: Sequencing large genomes For example, the publicly financed Human Genome Project uses an approach called genome mapping to facilitate sequence assembly. Celera Genomics, a private enterprise, announced that they will be able to complete the sequencing of the entire human genome much faster by using an approach called shotgun sequencing. There was much debate over the feasibility of the latter approach, but it apparently worked. At its core, this was a debate over the mathematics of sequence assembly.

You have sequenced your genome - what do you do with it? This is known as genome analysis or sequence analysis. At present, most of bioinformatics is concerned with sequence analysis. Here are some of the questions studied in sequence analysis: zgene finding zprotein 3D structure prediction zgene function prediction zprediction of important sites in proteins zreconstruction of phylogenies

Genes and proteins The genome controls the making and workings of an organism by telling the cell which proteins to manufacture under which conditions. Proteins are the workhorses of biochemistry and play a variety of roles. A gene is a stretch of DNA that codes a given protein.

Where are the genes? The objective of gene finding is to identify the regions of DNA that are genes. Ideally, we want to make statements like: “Positions 28,354 through 29,536 of this genome code a protein.” The mathematical challenge here is to identify patterns in DNA that reliably indicate where a gene starts and ends, especially in eukaryotes.

Protein structure prediction When a protein is manufactured in the cell, it assumes a characteristic 3D structure or fold. It is very costly to determine the 3D structure of a protein experimentally (by NMR or X-ray crystallography). It would be much cheaper if we could predict the 3D structure of a protein directly from its primary structure, i.e., from the sequence of its amino acids. This is known as the protein folding problem. Many approaches have been proposed to develop algorithms for solving this problem; so far results are mixed.

Prediction of protein function Suppose you have identified a gene. What is its role in the biochemistry of its organism? Sequence databases can help us in formulating reasonable hypotheses. zSearch the database for proteins with similar amino acid sequences in other organisms. zIf the functions of the most similar proteins are known and if they tend to be the same function (e.g., “enzyme involved in glucose metabolism”), then it is reasonable to conjecture that your gene also codes an enzyme involved in glucose metabolism.

Prediction of protein function: homology searches Given a nucleotide or DNA sequence, searching the data base(s) for similar sequences is known as “homology searches”. The most popular software tool for performing these searches is called BLAST; therefore biologists often speak of “BLAST searches”. There are two interesting problems here: zHow to measure “similarity” of two sequences. zHow much similarity constitutes evidence of biologically meaningful homology as opposed to random chance?

Prediction of important sites in proteins Not all parts of a protein are equally important; the function of most of its amino acids is often just to maintain an appropriate 3D structure, and mutations of those less crucial amino acids often don't have much effect. However, most proteins have crucial parts such as binding sites. Mutations occurring at binding sites tend to be lethal and will be weeded out by evolution.

How to predict binding sites from sequence data: zGet a collection of proteins of similar amino acid sequences and analogous biochemical function from your database. zAlign these sequences amino acid by amino acid. zCheck which regions of the protein are highly conserved in the course of evolution. zThe binding site should be in one of the highly conserved regions.

The importance of being aligned DNA and protein molecules evolve mostly by three processes: point mutations (exchange of a single letter for another), insertions, and deletions. If a group of homologuous proteins from different organisms has been identified, it is assumed that these proteins have evolved from a common ancestor. The process of multiple sequence alignment aims at identifying loci in the individual molecules that are derived from a common ancestral locus. These form the columns of the alignment.

Example of a multiple alignment A T G - - T T C G G A C T | | | A C G A A T C C A G - C T | | | - C G A A T C C T A A C C | | | - T G A G C A C T A A C C

Reconstruction of phylogenetic trees A phylogenetic tree depicts the evolutionary history of a group of species. By observing similarities and differences between species, we may be able to reconstruct their phylogeny. Classically, the degree of similarity between two species has been assessed from morphological characters. By comparing genomic sequence data, we actually can quantify the degree of similarity between any two species, and use these degrees of similarity as a basis for reconstructing phylogenetic trees.

Reconstruction of phylogenetic trees The most common approach to using genomic data for reconstruction of phylogenetic trees is to look at genes with analogous function and thus supposedly common ancestry and see how far the genes taken from the extant organisms have diverged. The observed differences in the amino acid composition are then used to reconstruct the phylogeny. The current partition of organisms into eubacteria, archaea and eukaria was discovered in this way by analyzing rRNA.

The new frontier: Functional genomics It is fashionable nowadays to talk about functional genomics. Many people use this term as if it were a new discipline separate from bioinformatics, but I think it is more appropriate to consider it a new subfield of bioinformatics. The ultimate aim of functional genomics is to understand what genes do, when they do it, and how they do it. Ideally, we would like to understand the cell, or organism, as a giant network of chemical pathways that regulate each other.

Microarrays (gene chips) Microarrays or Gene Chips allow to monitor the level of activity of all the gene represented on the chip simultaneously under a variety of environmental conditions, in various organs, and at various stages of development. There are two types of challenges here: To determine when a change in activity level detected by the chip is statistically significant, and to use the data so obtained to make inferences about gene regulation.

What do we do with all these data? The bread and butter method of microarray data analysis is clustering. This allows to identify, for a sequence of experiments on the same set of genes under various conditions, groups of genes that are up- or down-regulated simultaneously. It is believed that genes acting in the same chemical pathway would normally belong to the same cluster. Some algorithms for clustering will be discussed in this course.