What is Comparative Genomics? Insights gained through comparison of genomes from different species.

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

Ch 17 Gene Expression I: Transcription
LECTURE 17: RNA TRANSCRIPTION, PROCESSING, TURNOVER Levels of specific messenger RNAs can differ in different types of cells and at different times in.
Protein Targetting Prokaryotes vs. Eukaryotes Mutations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Lecture 4: DNA transcription
Section 8.6: Gene Expression and Regulation
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Expression.
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
(CHAPTER 12- Brooker Text)
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Gene Structure and Identification
Fine Structure and Analysis of Eukaryotic Genes
Gene regulation  Two types of genes: 1)Structural genes – encode specific proteins 2)Regulatory genes – control the level of activity of structural genes.
Riboswitch Regulation of Gene Expression
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Activate Prior Knowledge
Eukaryotic Gene Expression The “More Complex” Genome.
6/2/11 – “E” Day Objective: To understand how gene technologies are used and discuss their ethical implications. Do Now: -Who are the soldier’s parents?
Gene structure in prokaryotes * In prokaryotic cells such as bacteria, genes are usually found grouped together in operons. * The operon is a cluster of.
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
Regulation of Gene Expression Eukaryotes
RNA Structure and Transcription Mrs. MacWilliams Academic Biology.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Finish up array applications Move on to proteomics Protein microarrays.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
RNA and Protein Synthesis
Genetics: Chapter 7. What is genetics? The science of heredity; includes the study of genes, how they carry information, how they are replicated, how.
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
What is the job of p53? What does a cell need to build p53? Or any other protein?
Chapter 17 From Gene to Protein
Chapter 21 Eukaryotic Genome Sequences
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Ch. 17 From Gene to Protein. Genes specify proteins via transcription and translation DNA controls metabolism by directing cells to make specific enzymes.
Questions?. Novel ncRNAs are abundant: Ex: miRNAs miRNAs were the second major story in 2001 (after the genome). Subsequently, many other non-coding genes.
From Genomes to Genes Rui Alves.
Transcription and mRNA Modification
Complexities of Gene Expression Cells have regulated, complex systems –Not all genes are expressed in every cell –Many genes are not expressed all of.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
David Sadava H. Craig Heller Gordon H. Orians William K. Purves David M. Hillis Biologia.blu B – Le basi molecolari della vita e dell’evoluzione The Eukaryotic.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Functions of RNA mRNA (messenger)- instructions protein
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Regulation of Gene Expression
Exam #1 is T 2/17 in class (bring cheat sheet). Protein DNA is used to produce RNA and/or proteins, but not all genes are expressed at the same time or.
Chapter 19 The Organization & Control of Eukaryotic Genomes.
Finding genes in the genome
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
TRANSCRIPTION (DNA → mRNA). Fig. 17-7a-2 Promoter Transcription unit DNA Start point RNA polymerase Initiation RNA transcript 5 5 Unwound.
Chapter 15. I. Prokaryotic Gene Control  A. Conserves Energy and Resources by  1. only activating proteins when necessary  a. don’t make tryptophan.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Gene Structure and Regulation. Gene Expression The expression of genetic information is one of the fundamental activities of all cells. Instruction stored.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Genomes and Their Evolution
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
credit: modification of work by NIH
Presentation transcript:

What is Comparative Genomics? Insights gained through comparison of genomes from different species

How did it all start? We needed some genomes to start comparing Many Bacteria sequenced first Model organisms Yeast Worm Fruit fly Thale cress Finally, Human Comparative genomics did not just happen Enough data had to be accumulated Development of new computational methods to meet the challenges of processing large amounts of data “Informatics” techniques from applied math, computer science and statistics were adapted for biological sequences

Comparing sequenced genomes Comparison of genomic sequences from different species can help identify the following: Gene structure Gene function Interaction between gene products Non-coding RNAs Regulatory sequences

Evolution and sequence conservation Genome comparisons are based on simple premise: conservation = functional importance If there are no constraints on DNA sequence, random mutations will occur Over large evolutionary times (millions of years), these random mutations make two related sequences different Sequences from different genomes will be conserved if: They code for proteins They are important for regulation (protein binding)

No-hypothesis-driven approach Hypothesis-driven approaches Develop goals based on available hypothesis Design initial experiments (and backups if those fail) When it yields results, go to NIH, NSF, DOE, ONR for funding No hypothesis-driven approaches Start with a general knowledge of the biological system Collect large amount of data (usually high-throughput methods) and try extracting and/or amplifying signal from noisy data Sometimes it works for reasons that are obvious Sometimes it works for reasons that are NOT obvious Sometimes it doesn’t work because the data is too noisy Funding agencies are not likely to fund this kind of research

Finding DNA regulatory motifs (protein binding sites) Experimental approaches Promoter Trapping DNA Footprinting In-vitro binding site selection (SELEX) Computational approaches Searching databases of known sites Finding over-represented motifs in a group of sequences (Gibbs sampling, Expectation Maximization) In promoters of homologous genes In promoters of functionally linked genes In promoters of interacting proteins Ab initio methods Positional conservation of (pseudo)palindromic DNA motifs

Finding motifs in promoters of homologous genes Perform all-versus-all proteomes BLAST search Pool together promoters of related genes Find conserved motifs (Gibbs sampling, Expectation Maximization) Only DNA motifs in related genes can be identified

Finding DNA motifs by positional conservation of palindromes The approach targets sites for dimeric proteins and is particularly suited for helix-turn-helix proteins of Bacteria and Archea HTH proteins bind as dimers usually with variable sequence spacing Binding sites are palindromic with poorly conserved middle GGATTnnnAATCC GGATTnnAATCC GGATTnnAAGCC Starting from a complete set of promoter sequences, we find imperfect palindromes of variable length Remove sequence bias (A/T or G/C content > 80%) Search all-versus-all and identify similar motifs YES

Many potential binding sites are found... The role of found motifs is difficult to predict RNA Pol K Ribosomal proteins Transposons GTP-binding ATPase Sulfate metabolism Short hypothetical proteins

Finding DNA motifs - the summary In promoters of homologous genes Easy to perform and interpret results Works only for proteins with sequence homology In promoters of interacting proteins General approach, works even in the absence of sequence homology Needs better coverage of interactions; High-throughput studies of species other than yeast will enable comparative analysis Ab initio methods General approach, requires no prior knowledge Complementary approaches (experimental or computational) are needed to link the found sites to their DNA-binding proteins

Evolution and sequence conservation Genome comparisons are based on simple premise: conservation = functional importance If there are no constraints on DNA sequence, random mutations will occur Over large evolutionary times (millions of years), these random mutations make two related sequences different Sequences from different genomes will be conserved if: They code for proteins They are important for regulation (protein binding) Comparative genomics is needed to identify conservation

Comparative genomics helps genome annotations In prokaryotes, finding genes is relatively easy based on open reading frames (ORFs) In eukaryotes, we have to look for ORFs, exons, introns, splice sites, polyA sites Bad news: Predicted exons sometimes do not exist More bad news: Pseudogenes Bad news keep coming: Alternative splicing Good news: In different species, the genes normally have similar exon-intron structure

RNA polymerase Case 1: Cellular concentration of metabolite is too low to occupy the riboswitch binding site. Transcription and … 3421 RNA polymerase Courtesy of R. Breaker, Yale U.

UUUUUAUG RNA polymerase Case 1: Cellular concentration of metabolite is too low to occupy the riboswitch binding site. Transcription and intramolecular RNA folding continue Courtesy of R. Breaker, Yale U.

UUUUUAUG Case 1: Cellular concentration of metabolite is too low to occupy the riboswitch binding site. Translation is initiated. Ribosome Typically the new mRNA codes for a biosynthetic or transport protein that raises the intracellular level of the metabolite. Gene regulation (next case) is accomplished by variations in the interactions of the regions highlighted in orange. Transcription and intramolecular RNA folding continue Courtesy of R. Breaker, Yale U.

Case 2: Cellular concentration of metabolite (X) is high. Intramolecular folding can lead to an alternate conformation. RNA polymerase produces the long untranslated leader region. The alternate riboswitch conformation is stable when metabolite is bound. X X X X X RNA polymerase X Nascent RNA DNA template Courtesy of R. Breaker, Yale U.

Case 2: Cellular concentration of metabolite (X) is high. Intramolecular folding can lead to an alternate conformation. RNA polymerase produces the long untranslated leader region. The alternate riboswitch conformation is stable when metabolite is bound. X X X X X X Transcription continues. UUUUU RNA polymerase Courtesy of R. Breaker, Yale U.

Case 2: Cellular concentration of metabolite (X) is high. X X X X X Transcription continues. RNA polymerase Now, RNA folding leads to formation of an intrinsic terminator. UUUUU XX Courtesy of R. Breaker, Yale U.

Case 2: Cellular concentration of metabolite (X) is high. X X X X X Transcription continues. RNA polymerase Now, RNA folding leads to formation of an intrinsic terminator. UUUUU X The transcript is never completed and the metabolite biosynthetic or transport protein is not produced Courtesy of R. Breaker, Yale U.

What does this ncRNA bind?

Can we predict functions without strict measure of significance (no sequence or structural similarity)? This is done by machine-trained (objective) jury-like system using inference

Comparative genomics predicts protein interactions (Rosetta Stone) In yeast, topoisomerase II has two domains that correspond to gyrases A and B Sequence comparisons show that these two domains are individual proteins in E. coli The implication is that these two proteins interact, and that their fusion was favored during the evolution

Predicting protein function by genome context

Krr1/Rrp20 Rio1/Rio2 Tif11 Spo11 What does gene colinearity mean?

Not much, unless supported by phylogeny and function

The case of Fibrillarin/Nop56 colinearity

Fibrillarin and Nop56 DO interact

Functional clues for hypothetical proteins based on genomic context analysis

High-throughput approaches Had to be developed quickly to match the speed of genome sequencing As a general rule, most experimental approaches can be adapted for high- throughput –Protein interactions (two hybrid, TAP) –Protein localizations –Gene regulations (microarray) –Structure determination (more recent, still gaining speed)

What is a high-throughput experiment? Usually done at the level of whole organism (whole genome) under different conditions HT experiments are aided by: –Equipment miniaturization –Robotics –Other automated procedures In almost all instances, heavy data analysis and processing is required

General properties of HT experiments Collect large amounts of data under many different conditions –Err on the side of collecting too much data, disk storage is cheap Process raw data (computers) Analyze data (computers) Integrate data from various sources (computers) Identify patterns and cluster the results based on similarity (computers)

Integrating heterogonous data to predict protein interactions

Analysis of different data types is usually based on Bayesian inference Example protein interactions: ● Proteins more likely to interact if they are co-expressed ● Proteins more likely to interact if they are co-localized in cell ● Proteins more likely to interact if they are co-localized in genome ● Proteins more likely to interact if they are parts of the same cellular process

Predicting large protein complexes from individual parts

Beware of erroneous annotations