Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1.

Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1 Department of Mathematics, University of Southern California; Departments of 2 Operations Research and 3 Biology University of North Carolina at Chapel Hill

comparative maps Spaghetti Diagram Crop Circle Livingstone et al 1999 Genetics 152:1183 Gale & Devos 1998 PNAS 95:1972

some terms Feature: a gene or some other marker Segment: a string or substring of features Homology: descent from a common ancestor Block: a pair of segments that are putatively homologous. These are what we seek!

local genome alignment Consider each chromosome to be a string of features Assign common letters to homologous features Identify segments sharing multiple pairs of common letters Differences from DNA/protein alignment –high frequency of gaps relative to matches –inversions may occur within the alignment

homology matrix gene1 2 3 4 5 6 7 8 1- 0 0 0 1 0 0 0 20 - 0 0 0 1 0 0 30 0 - 0 0 0 1 0 40 0 0 - 0 0 0 1 51 0 0 0 - 0 0 0 60 1 0 0 0 - 0 0 70 0 1 0 0 0 - 0 80 0 0 1 0 0 0 - 1234 5678

duplication and multiplication there is not necessarily a one-to-one alignment

genome rearrangements inversionreciprocal translocation homologous segments may be small

Bancroft (2001) TIG 17, 89 after Ku et al (2000) PNAS 97, 9121 Non-homologous features may be abundant within homologous segments

We must allow some non-colinearity in marker order between segmental homologs

homology matrix for Arabidopsis

going beyond eyeballing LineUp – Hampson et al (2003) –Designed for genetic maps with error ADHoRe – Van der Poele (2002) –Designed for unambiguous marker order data Both provide automatic detection of blocks For statistics, both employ permutation tests –Computationally intensive –p-values are approximate

FISH: Fast Identification of Segmental Homology Block identification –Dynamic programming provides speed and optimality guarantee –Can be generalized to multiple alignments Statistical assessment –Null model of duplication and transposition –Closed-form equation for calculating p- values (i.e. no permutation testing)

from homology matrix to graph nodes (  ) –represent dots in the homology matrix

from homology matrix to graph nodes (  ) –represent dots in the homology matrix edges (  ) –connect nodes with nearest neighbors –are unidirectional –have an associated distance –must be shorter than some threshold

from homology matrix to graph nodes (  ) –represent dots in the homology matrix edges (  ) –connect nodes with nearest neighbors –are unidirectional –have an associated distance –must be shorter than some threshold paths (  ) –traverse shortest available edges –can be efficiently computed –can be considered candidate blocks

null model Within a genome: homologies are due to the duplication of individual features followed by insertion into a (uniformly) random position Between genomes: homologies are due to the above process plus the transposition of randomly chosen features into randomly chosen positions.

computing neighborhood size h = # nodes / # cells in matrix n = # cells in neighborhood Prob(neighborhood has  1 node) p = 1 – (1-h) n Threshold distance for p=T under Manhattan distance (  x+  y) d T = 0.5 + sqrt[(log(1-p)/log(1-h)+0.25] T is analogous to a gap parameter –small T: few false positive edges, short blocks –large T: more false positive edges, longer blocks

neighborhood geometry

blocks of nearest neighbors

block statistics Chen-Stein Theorem: number of blocks with k nodes is approximately Poisson Expected number of blocks = cp u Conservative matrix-wide p-value Prob(X  k) < 1 – e –cp u where c is the # of cells in the matrix and p u = h(nh) k-1

identifying blocks Let edge from i to j have weight w ij =1 Initialize: score of block terminating at i S i = 0 Recursion for block scores S j = max(S i + w ij )  i such that j  T i Dynamic programming can be used to find all maximally extended blocks

simulation experiment kobsstderrupboundlowbound 245.80.0647.640.1 32.280.022.391.78 40.1130.0030.1200.079 50.0060.0010.0060.004 60.00030.00020.00030.0002 How often are blocks of size k observed under the null model compared with expectation?

FISH v.1.0 http://www.bio.unc.edu/faculty/vision/lab/FISH –source code –compiled executables –documentation –sample data Adjustable parameters (e.g. T) Reports statistics on blocks Is fast and memory-efficient

applications Automated pairwise alignment of genome maps as part of Phytome project Prediction of gene content in regions of unsequenced genomes Studies of genome evolution, especially duplication and gene order rearrangement

future work Biologically motivated neighborhood geometries (ADHoRe) Non-discrete marker positions (LineUp) –Genetic versus physical maps –Map uncertainty Robustness to deviation from null model (permutation tests) Extension to homologies among 3 or more segments

Thanks!  Sugata Chakravarty Peter Calabrese  U.S. National Science Foundation

http://www.bio.unc.edu/vision/faculty/lab/FISH

Ghosts Simillion, Vandepoele, Van Montagu, Zabeau, Van de Peer (2002) PNAS 99, 13627

Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1.

Similar presentations

Presentation on theme: "Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1.

Similar presentations

Presentation on theme: "Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1."— Presentation transcript:

Similar presentations

About project

Feedback