Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1.

Slides:

Advertisements

Similar presentations

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

Advertisements

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

Genomics of Water Use Efficiency Advisory Committee Meeting Nov 2003 Comparative mapping –FISH software and related computational methods –Application.

Evaluating the Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand.

Overview of research program Todd Vision

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.

The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand.

Heuristic alignment algorithms and cost matrices

Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.

Automated Extraction and Parameterization of Motions in Large Data Sets SIGGRAPH’ 2004 Lucas Kovar, Michael Gleicher University of Wisconsin-Madison.

CPM '05 Sensitivity Analysis for Ungapped Markov Models of Evolution David Fernández-Baca Department of Computer Science Iowa State University (Joint work.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

The dynamics of nuclear gene order in the eukaryotes.

Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.

Similar Sequence Similar Function Charles Yan Spring 2006.

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.

We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.

CMPT-825 (Natural Language Processing) Presentation on Zipf’s Law & Edit distance with extensions Presented by: Kaustav Mukherjee School of Computing Science,

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Graph Regularized Dual Lasso for Robust eQTL Mapping Wei Cheng 1 Xiang Zhang 2 Zhishan Guo 1 Yu Shi 3 Wei.

Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.

Phylogenetic Tree Construction and Related Problems Bioinformatics.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

8/22/03 CS RA fair Comparative genome mapping Todd Vision Department of Biology University of North Carolina at Chapel Hill.

Plant genomes: phenotypes evolving by new rules Todd J. Vision Department of Biology University of North Carolina at Chapel Hill.

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.

TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Multiple testing correction

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.

1/27 Discrete and Genetic Algorithms in Bioinformatics 許聞廉中央研究院資訊所.

FISH Fast Identification of Segmental Homology University of North Carolina at Chapel Hill Shian-Gro Wu.

Comp. Genomics Recitation 3 The statistics of database searching.

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.

Identifying conserved segments in rearranged and divergent genomes Bob Mau, Aaron Darling, Nicole T. Perna Presented by Aaron Darling.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Applied Bioinformatics Week 3. Theory I Similarity Dot plot.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Presentation: Genetic clustering of social networks using random walks ELSEVIER Computational Statistics & Data Analysis February 2007 Genetic clustering.

ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.

ENGR 610 Applied Statistics Fall Week 7 Marshall University CITE Jack Smith.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Pairwise Sequence Alignment and Database Searching

Learning to Align: a Statistical Approach

Probabilistic Data Management

Anastasia Baryshnikova Cell Systems

Cereal Genome Evolution: Grasses, line up and form a circle

ADAGE weights reflect a gene’s common regulatory and process features.

Presentation transcript:

Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1 Department of Mathematics, University of Southern California; Departments of 2 Operations Research and 3 Biology University of North Carolina at Chapel Hill

comparative maps Spaghetti Diagram Crop Circle Livingstone et al 1999 Genetics 152:1183 Gale & Devos 1998 PNAS 95:1972

some terms Feature: a gene or some other marker Segment: a string or substring of features Homology: descent from a common ancestor Block: a pair of segments that are putatively homologous. These are what we seek!

local genome alignment Consider each chromosome to be a string of features Assign common letters to homologous features Identify segments sharing multiple pairs of common letters Differences from DNA/protein alignment –high frequency of gaps relative to matches –inversions may occur within the alignment

homology matrix gene

duplication and multiplication there is not necessarily a one-to-one alignment

genome rearrangements inversionreciprocal translocation homologous segments may be small

Bancroft (2001) TIG 17, 89 after Ku et al (2000) PNAS 97, 9121 Non-homologous features may be abundant within homologous segments

We must allow some non-colinearity in marker order between segmental homologs

homology matrix for Arabidopsis

going beyond eyeballing LineUp – Hampson et al (2003) –Designed for genetic maps with error ADHoRe – Van der Poele (2002) –Designed for unambiguous marker order data Both provide automatic detection of blocks For statistics, both employ permutation tests –Computationally intensive –p-values are approximate

FISH: Fast Identification of Segmental Homology Block identification –Dynamic programming provides speed and optimality guarantee –Can be generalized to multiple alignments Statistical assessment –Null model of duplication and transposition –Closed-form equation for calculating p- values (i.e. no permutation testing)

from homology matrix to graph nodes (  ) –represent dots in the homology matrix

from homology matrix to graph nodes (  ) –represent dots in the homology matrix edges (  ) –connect nodes with nearest neighbors –are unidirectional –have an associated distance –must be shorter than some threshold

from homology matrix to graph nodes (  ) –represent dots in the homology matrix edges (  ) –connect nodes with nearest neighbors –are unidirectional –have an associated distance –must be shorter than some threshold paths (  ) –traverse shortest available edges –can be efficiently computed –can be considered candidate blocks

null model Within a genome: homologies are due to the duplication of individual features followed by insertion into a (uniformly) random position Between genomes: homologies are due to the above process plus the transposition of randomly chosen features into randomly chosen positions.

computing neighborhood size h = # nodes / # cells in matrix n = # cells in neighborhood Prob(neighborhood has  1 node) p = 1 – (1-h) n Threshold distance for p=T under Manhattan distance (  x+  y) d T = sqrt[(log(1-p)/log(1-h)+0.25] T is analogous to a gap parameter –small T: few false positive edges, short blocks –large T: more false positive edges, longer blocks

neighborhood geometry

blocks of nearest neighbors

block statistics Chen-Stein Theorem: number of blocks with k nodes is approximately Poisson Expected number of blocks = cp u Conservative matrix-wide p-value Prob(X  k) < 1 – e –cp u where c is the # of cells in the matrix and p u = h(nh) k-1

identifying blocks Let edge from i to j have weight w ij =1 Initialize: score of block terminating at i S i = 0 Recursion for block scores S j = max(S i + w ij )  i such that j  T i Dynamic programming can be used to find all maximally extended blocks

simulation experiment kobsstderrupboundlowbound How often are blocks of size k observed under the null model compared with expectation?

FISH v –source code –compiled executables –documentation –sample data Adjustable parameters (e.g. T) Reports statistics on blocks Is fast and memory-efficient

applications Automated pairwise alignment of genome maps as part of Phytome project Prediction of gene content in regions of unsequenced genomes Studies of genome evolution, especially duplication and gene order rearrangement

future work Biologically motivated neighborhood geometries (ADHoRe) Non-discrete marker positions (LineUp) –Genetic versus physical maps –Map uncertainty Robustness to deviation from null model (permutation tests) Extension to homologies among 3 or more segments

Thanks!  Sugata Chakravarty Peter Calabrese  U.S. National Science Foundation

Ghosts Simillion, Vandepoele, Van Montagu, Zabeau, Van de Peer (2002) PNAS 99, 13627