Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.

Slides:



Advertisements
Similar presentations
A Simpler 1.5-Approximation Algorithm for Sorting by Transpositions Tzvika Hartman Weizmann Institute.
Advertisements

Gene an d genome duplication Nadia El-Mabrouk Université de Montréal Canada.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Phylogenetic reconstruction
History, protohistory and prehistory of the Arabidopsis thaliana chromosome complement Henry Yves et al 2006, in press.
Molecular Evolution Revised 29/12/06
Evaluating the Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand.
Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
Non-Linear Problems General approach. Non-linear Optimization Many objective functions, tend to be non-linear. Design problems for which the objective.
The Cobweb of life revealed by Genome-Scale estimates of Horizontal Gene Transfer Fan Ge, Li-San Wang, Junhyong Kim Mourya Vardhan.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand.
Bioinformatics and Phylogenetic Analysis
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
The dynamics of nuclear gene order in the eukaryotes.
Of Mice and Men Learning from genome reversal findings Genome Rearrangements in Mammalian Evolution: Lessons From Human and Mouse Genomes and Transforming.
Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Similar Sequence Similar Function Charles Yan Spring 2006.
Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1.
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.
8/22/03 CS RA fair Comparative genome mapping Todd Vision Department of Biology University of North Carolina at Chapel Hill.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Yeast genome sequencing: the power of comparative genomics MEDG 505, 03/02/04, Han Hao Molecular Microbiology (2004)53(2), 381 – 389.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
whole-genome duplications and large segmental duplications… …seem to be a common feature in eukaryotic genome evolution …play a crucial role in the evolution.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Identifying conserved segments in rearranged and divergent genomes Bob Mau, Aaron Darling, Nicole T. Perna Presented by Aaron Darling.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Phylogeny Ch. 7 & 8.
Comparative genomics of Gossypium and Arabidopsis: Unraveling the consequences of both ancient and recent polyploidy Junkang Rong, John E. Bowers, Stefan.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Genome Rearrangement By Ghada Badr Part I.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept.
Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
PHYOGENY & THE Tree of life Represent traits that are either derived or lost due to evolution.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
A Statistical Framework for Spatial Comparative Genomics Rose Hoberman Carnegie Mellon University, March 2007 Thesis Committee Dannie Durand (chair) Andrew.
Tests for Gene Clustering
Lecture 3: Genome Rearrangements and Duplications
Identifying conserved spatial patterns in genomes
By Chunfang Zheng and David Sankoff, 2014
Mattew Mazowita, Lani Haque, and David Sankoff
Identifying conserved spatial patterns in genomes
A Statistical Framework for Spatial Comparative Genomics Thesis Proposal Rose Hoberman Carnegie Mellon University, August 2005 Thesis Committee.
Multiple Genome Rearrangement
How to Define a Cluster? constructive or algorithmic definition
Tests of Significance Section 10.2.
Rearrangement Phylogeny of Genomes in Contig form
Presentation transcript:

Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff

Identification of homologous chromosomal segments is a key task in comparative genomics Genome evolution Reconstruct history of chromosomal rearrangements Infer ancestral genetic map Phylogeny reconstruction … Pevzner, Tesler. Genome Research 2003

… Genome self-comparisons evidence for ancient whole-genome duplications … McLysaght, Hokamp, Wolfe. Nature Genetics, Identification of homologous chromosomal segments is a key task in comparative genomics

… Understand gene function and regulation in bacteria Predict operons Identify horizontal transfers Infer functional associations Snel, Bork, Huynen. PNAS 2002 Identification of homologous chromosomal segments is a key task in comparative genomics

What do such homologous segments look like? Why is identifying them a difficult problem?

Gene content and order are highly conserved rearrangement, mutation Similarity in gene content large scale duplication or speciation event original genome gene clusters Neither gene content nor order is strictly preserved

Whole Genome Comparison of Human with Human McLysaght, Hokamp, Wolfe. Nature Genetics, Could this pattern have occurred by chance?

Approach Genome as a sequence of genes (or markers) a single chromosome genes are unique each gene has at most one match in the other genome Hypothesis testing Alternate hypothesis: common ancestry Null hypothesis: random gene order

Gene Clusters Similar gene content Neither gene content nor order is strictly preserved

Max-Gap Clusters The gap between genes is the number of intervening genes A set of genes form a max-gap cluster if the gap between adjacent genes is never greater than g on either genome g  3

Max-Gap Clusters are Commonly Used in Genomic Analyses Blanc et al 2003, recent polyploidy in Arabidopsis Venter et al 2001, sequence of the human genome Overbeek et al 1999, inferring functional coupling of genes in bacteria Vandepoele et al 2002, duplications in Arabidopsis through comparison with rice Vision et al 2000, duplications in Eukaryotes Lawrence and Roth 1996, identification of horizontal transfers Tamames 2001, evolution of gene order conservation in prokaryotes Wolfe and Shields 1997, ancient yeast duplication McLysaght02, genomic duplication during early chordate evolution Coghlan and Wolfe 2002, comparing rates of rearrangements Seoighe and Wolfe 1998, genome rearrangements after duplication in yeast Chen et al 2004, operon prediction in newly sequenced bacteria Blanchette et al 1999, breakpoints as phylogenetic features... no analytical statistical model for max-gap clusters statistical significance assessed through randomization

Statistics for max-gap gene clusters 1. Reference set: 2. Whole Genome Comparison Inputs 1. a genome: G = 1, …, n of unique genes 2. a set of m special genes

Test statistic: the maximum gap observed between adjacent blue genes P-value: the probability of observing a maximum gap ≤ g, under the null hypothesis Significance of a complete cluster g = 2 m = 7

Compute probabilities by counting Set of all permutations Permutations where the maximum gap ≤ g The problem is how to count this

number of ways to start a cluster, e.g. ways to place the first gene and still have w-1 slots left w = (m-1)g + m

ways to place the remaining m-1 blue genes, so that no gap exceeds g g number of ways to start a cluster, e.g. ways to place the first gene and still have w-1 slots left

edge effects w = (m-1)g + m number of ways to start a cluster, e.g. ways to place the first gene and still have w-1 slots left ways to place the remaining m-1 blue genes, so that no gap exceeds g

I used this equation to calculate probabilities for various parameter values  Adding edge effects… Hoberman, Sankoff, Durand. JCB 2005.

Probability of Observing a Complete Cluster g m n = 500

Statistics for max-gap gene clusters Reference set Whole Genome Comparison Inputs 1. two genomes of n genes 2. m homologous genes pairs 3. a maximum gap size g

Whole Genome Comparison What is the probability of observing a maximal max-gap cluster of size exactly h, if both genomes are randomly ordered? g  3

Compute probabilities by counting Configurations that contain a cluster of exactly size h ?? All configurations of two genomes

Constructive Approach number of ways to place h genes so they form a cluster in both genomes number of ways to place m-h remaining genes so they do not extend the cluster Number of configurations that contain a cluster of exactly size h

Switching Representations

m=5 h=3 g=1

m=5 h=3 g=1 X

m=5 h=3 g=1

m=5 h=3 g=1 X

There are no other homologs within g of this cluster on both genomes, yet this cluster is not maximal Greedy agglomerative algorithm doesn’t find all max-gap clusters There is an efficient divide-and-conquer algorithm to find maximal max-gap clusters (Bergeron, Corteel, Raffinot 2002) Why is counting hard in this case? g = 1 h = 3

Bounding the Cluster Probabilities  Upper bound: Erroneously enumerates this configuration as a maximal cluster of size three  Lower bound: Fails to enumerate this permutation as containing a maximal cluster of size three

Probability of observing a maximal max-gap cluster of size h by chance Whole-genome comparison n=1000, m=250, g=10 Cluster size

Probability of observing a maximal max-gap cluster of size h by chance …is no longer strictly decreasing! Whole-genome comparison Cluster size n=1000, m=250, g=20

Conclusions Presented statistical tests for max-gap clusters  Evaluate the significance of observed clusters –Choose parameters effectively –Understand trends

Conclusions Presented statistical tests for max-gap clusters –Evaluate the significance of observed clusters –Choose parameters effectively –Understand trends Significant Parameter Values (α = 0.001)

Conclusions Presented statistical tests for max-gap clusters –Evaluate the significance of clusters of a pre- specified set of genes –Choose parameters effectively  Understand trends

Thank You