Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is Comparative Genomics? Insights gained through comparison of genomes from different species.

Similar presentations


Presentation on theme: "What is Comparative Genomics? Insights gained through comparison of genomes from different species."— Presentation transcript:

1 What is Comparative Genomics? Insights gained through comparison of genomes from different species

2 How did it all start? We needed some genomes to start comparing Many Bacteria sequenced first Model organisms Yeast Worm Fruit fly Thale cress Finally, Human Comparative genomics did not just happen Enough data had to be accumulated Development of new computational methods to meet the challenges of processing large amounts of data “Informatics” techniques from applied math, computer science and statistics were adapted for biological sequences

3 Comparing sequenced genomes Comparison of genomic sequences from different species can help identify the following: Gene structure Gene function Interaction between gene products Non-coding RNAs Regulatory sequences

4 Evolution and sequence conservation Genome comparisons are based on simple premise: conservation = functional importance If there are no constraints on DNA sequence, random mutations will occur Over large evolutionary times (millions of years), these random mutations make two related sequences different Sequences from different genomes will be conserved if: They code for proteins They are important for regulation (protein binding)

5 No-hypothesis-driven approach Hypothesis-driven approaches Develop goals based on available hypothesis Design initial experiments (and backups if those fail) When it yields results, go to NIH, NSF, DOE, ONR for funding No hypothesis-driven approaches Start with a general knowledge of the biological system Collect large amount of data (usually high-throughput methods) and try extracting and/or amplifying signal from noisy data Sometimes it works for reasons that are obvious Sometimes it works for reasons that are NOT obvious Sometimes it doesn’t work because the data is too noisy Funding agencies are not likely to fund this kind of research

6 Finding DNA regulatory motifs (protein binding sites) Experimental approaches Promoter Trapping DNA Footprinting In-vitro binding site selection (SELEX) Computational approaches Searching databases of known sites Finding over-represented motifs in a group of sequences (Gibbs sampling, Expectation Maximization) In promoters of homologous genes In promoters of functionally linked genes In promoters of interacting proteins Ab initio methods Positional conservation of (pseudo)palindromic DNA motifs

7 Finding motifs in promoters of homologous genes Perform all-versus-all proteomes BLAST search Pool together promoters of related genes Find conserved motifs (Gibbs sampling, Expectation Maximization) Only DNA motifs in related genes can be identified

8 Finding DNA motifs by positional conservation of palindromes The approach targets sites for dimeric proteins and is particularly suited for helix-turn-helix proteins of Bacteria and Archea HTH proteins bind as dimers usually with variable sequence spacing Binding sites are palindromic with poorly conserved middle GGATTnnnAATCC GGATTnnAATCC GGATTnnAAGCC Starting from a complete set of promoter sequences, we find imperfect palindromes of variable length Remove sequence bias (A/T or G/C content > 80%) Search all-versus-all and identify similar motifs YES

9 Many potential binding sites are found... The role of found motifs is difficult to predict RNA Pol K Ribosomal proteins Transposons GTP-binding ATPase Sulfate metabolism Short hypothetical proteins

10 Finding DNA motifs - the summary In promoters of homologous genes Easy to perform and interpret results Works only for proteins with sequence homology In promoters of interacting proteins General approach, works even in the absence of sequence homology Needs better coverage of interactions; High-throughput studies of species other than yeast will enable comparative analysis Ab initio methods General approach, requires no prior knowledge Complementary approaches (experimental or computational) are needed to link the found sites to their DNA-binding proteins

11 Evolution and sequence conservation Genome comparisons are based on simple premise: conservation = functional importance If there are no constraints on DNA sequence, random mutations will occur Over large evolutionary times (millions of years), these random mutations make two related sequences different Sequences from different genomes will be conserved if: They code for proteins They are important for regulation (protein binding) Comparative genomics is needed to identify conservation

12 Comparative genomics helps genome annotations In prokaryotes, finding genes is relatively easy based on open reading frames (ORFs) In eukaryotes, we have to look for ORFs, exons, introns, splice sites, polyA sites Bad news: Predicted exons sometimes do not exist More bad news: Pseudogenes Bad news keep coming: Alternative splicing Good news: In different species, the genes normally have similar exon-intron structure

13 RNA polymerase Case 1: Cellular concentration of metabolite is too low to occupy the riboswitch binding site. Transcription and … 3421 RNA polymerase Courtesy of R. Breaker, Yale U.

14 UUUUUAUG RNA polymerase Case 1: Cellular concentration of metabolite is too low to occupy the riboswitch binding site. Transcription and intramolecular RNA folding continue. 3 4 2 1 3421 Courtesy of R. Breaker, Yale U.

15 UUUUUAUG Case 1: Cellular concentration of metabolite is too low to occupy the riboswitch binding site. Translation is initiated. Ribosome Typically the new mRNA codes for a biosynthetic or transport protein that raises the intracellular level of the metabolite. Gene regulation (next case) is accomplished by variations in the interactions of the regions highlighted in orange. Transcription and intramolecular RNA folding continue. 3 4 2 1 Courtesy of R. Breaker, Yale U.

16 Case 2: Cellular concentration of metabolite (X) is high. Intramolecular folding can lead to an alternate conformation. RNA polymerase produces the long untranslated leader region. The alternate riboswitch conformation is stable when metabolite is bound. X X X X X RNA polymerase X Nascent RNA DNA template Courtesy of R. Breaker, Yale U.

17 Case 2: Cellular concentration of metabolite (X) is high. Intramolecular folding can lead to an alternate conformation. RNA polymerase produces the long untranslated leader region. The alternate riboswitch conformation is stable when metabolite is bound. X X X X X X Transcription continues. UUUUU RNA polymerase 34 21 Courtesy of R. Breaker, Yale U.

18 Case 2: Cellular concentration of metabolite (X) is high. X X X X X Transcription continues. RNA polymerase Now, RNA folding leads to formation of an intrinsic terminator. UUUUU XX 34 213421 Courtesy of R. Breaker, Yale U.

19 Case 2: Cellular concentration of metabolite (X) is high. X X X X X Transcription continues. RNA polymerase Now, RNA folding leads to formation of an intrinsic terminator. UUUUU X The transcript is never completed and the metabolite biosynthetic or transport protein is not produced. 3421 Courtesy of R. Breaker, Yale U.

20 What does this ncRNA bind?

21 Can we predict functions without strict measure of significance (no sequence or structural similarity)? This is done by machine-trained (objective) jury-like system using inference

22 Comparative genomics predicts protein interactions (Rosetta Stone) In yeast, topoisomerase II has two domains that correspond to gyrases A and B Sequence comparisons show that these two domains are individual proteins in E. coli The implication is that these two proteins interact, and that their fusion was favored during the evolution

23 Predicting protein function by genome context

24 Krr1/Rrp20 Rio1/Rio2 Tif11 Spo11 What does gene colinearity mean?

25 Not much, unless supported by phylogeny and function

26 The case of Fibrillarin/Nop56 colinearity

27 Fibrillarin and Nop56 DO interact

28 Functional clues for hypothetical proteins based on genomic context analysis

29 High-throughput approaches Had to be developed quickly to match the speed of genome sequencing As a general rule, most experimental approaches can be adapted for high- throughput –Protein interactions (two hybrid, TAP) –Protein localizations –Gene regulations (microarray) –Structure determination (more recent, still gaining speed)

30 What is a high-throughput experiment? Usually done at the level of whole organism (whole genome) under different conditions HT experiments are aided by: –Equipment miniaturization –Robotics –Other automated procedures In almost all instances, heavy data analysis and processing is required

31 General properties of HT experiments Collect large amounts of data under many different conditions –Err on the side of collecting too much data, disk storage is cheap Process raw data (computers) Analyze data (computers) Integrate data from various sources (computers) Identify patterns and cluster the results based on similarity (computers)

32 Integrating heterogonous data to predict protein interactions

33 Analysis of different data types is usually based on Bayesian inference Example protein interactions: ● Proteins more likely to interact if they are co-expressed ● Proteins more likely to interact if they are co-localized in cell ● Proteins more likely to interact if they are co-localized in genome ● Proteins more likely to interact if they are parts of the same cellular process

34 Predicting large protein complexes from individual parts

35 Beware of erroneous annotations


Download ppt "What is Comparative Genomics? Insights gained through comparison of genomes from different species."

Similar presentations


Ads by Google