Presentation is loading. Please wait.

Presentation is loading. Please wait.

RECOMBINOMICS: Myth or Reality? Laxmi Parida IBM Watson Research New York, USA.

Similar presentations


Presentation on theme: "RECOMBINOMICS: Myth or Reality? Laxmi Parida IBM Watson Research New York, USA."— Presentation transcript:

1 RECOMBINOMICS: Myth or Reality? Laxmi Parida IBM Watson Research New York, USA

2 IBM Computational Biology Center 2 1. Motivation 2. Reconstructability (Random Graphs Framework) 3. Reconstruction Algorithm (DSR Algorithm) 4. Conclusion RoadMap

3 IBM Computational Biology Center 3

4 4 www.nationalgeographic.com/genographic

5 IBM Computational Biology Center 5 www.ibm.com/genographic

6 IBM Computational Biology Center 6  Five year study, launched in April 2005 to address anthropological questions on a global scale using genetics as a tool  Although fossil records fix human origins in Africa, little is known about the great journey that took Homo sapiens to the far reaches of the earth. How did we, each of us, end up where we are?  Samples all around the world are being collected and the mtDNA and Y-chromosome are being sequenced and analyzed phylogeographic question

7 IBM Computational Biology Center 7 DNA material in use under unilinear transmission 58 mill bp 0.38% 16000 bp

8 IBM Computational Biology Center 8 Missing information in unilinear transmissions past present

9 IBM Computational Biology Center 9 Table Mountain Cape Town, South Africa

10 IBM Computational Biology Center 10 Paradigm Shift in Locus & Analysis Using recombining DNA sequences  Why?  Nonrecombining gives a partial story 1.represents only a small part of the genome 2.behaves as a single locus 3.unilinear (exclusively male of female) transmission  Recombining towards more complete information  Challenges  Computationally very complex  How to comprehend complex reticulations?

11 IBM Computational Biology Center 11 1. Motivation 2. Reconstructability (Random Graphs Framework) 3. Reconstruction Algorithm (DSR Algorithm) 4. Conclusion RoadMap L Parida, Pedigree History: A Reconstructability Perspective using Random-Graphs Framework, Under preparation.

12 IBM Computational Biology Center 12 GRAPH DEF: 1. Infinite number of vertices arranged in finite sized rows 2. Edges introduced via a random process across immediate rows PROPERTIES: Address some topological questions 1.First, identify a Probability Space 2.Then, pose and address specific questions (such as expected depth of LCA etc..) The Random Graphs Framework

13 IBM Computational Biology Center 13 1. Infinite number of vertices with a specific organization 2. Edges introduced via a random process satisfying specific rules 3. Address some topological questions 1.Define a Probability Space 2.Pose and answer specific questions (such as expected depth of LCA etc..) The Random Graphs Framework Wright-Fisher Model 1. Constant population 2. Non-overlapping generations 3. Panmictic

14 IBM Computational Biology Center 14 The Random Graphs Framework

15 IBM Computational Biology Center 15 Properties of this Pedigree Graph 1. DAG Directed Acyclic Graph 2. |E| = O (|V|) for any finite fragment; sparse graph … Vertex-centric view.. 3. Focus on the flow of genetic material: relevant pedigree graph

16 IBM Computational Biology Center 16 Pedigree Graph: G PG (K,N)  K no of extant units  2N population size/generation  Can the model ignore color of vertex?

17 IBM Computational Biology Center 17 Pedigree Graph: G PG (K,N)  K no of extant units  2N population size/generation  Can the model ignore color of vertex? Forbidden Structure

18 IBM Computational Biology Center 18 Probability Space  Space is non-enumerable  Uniform probability measure? WF pop  Probability of some event F(h) for a fixed depth, h, & take limit:

19 IBM Computational Biology Center 19 Topological Property of G PG (K,N) Least Common Ancestor (LCA) of ALL (K) extant vertices ------TMRCA or GMRCA-------  How many LCA’s ?  Expected Depth of the shallowest LCA

20 IBM Computational Biology Center 20 Infinite No. of LCA’s in a G PG (4,3) instance ….. In fact, there exist infinite such instances!

21 IBM Computational Biology Center 21 Topological Property of G PG (K,N) Least Common Ancestor (LCA) ------TMRCA or GMRCA-------  How many LCA’s ?  Expected Depth of the shallowest “LCA” MEASURE OF RECONSTRUCTABILITY

22 IBM Computational Biology Center 22 (Genetic Exchange) Sexual Reproduction vs Graph Model Ancestor without ancestry

23 IBM Computational Biology Center 23 1. Graph Theoretic (topological):  CAcommon ancestor  LCALeast CA or Shallowest CA MRCA Most Recent CA TMRCA The MRCA 2. Graph Theoretic + Biology (Genetic Exchange):  CAA common ancestor-&-ancestry  LCAALeast CAA GMRCAGrand MRCA Unilinear Transmission Graph Theory vis-à-vis Population Genetics

24 IBM Computational Biology Center 24 Different Models as Subgraphs mtDNA Tree NRY Tree Genetic Exchange Model (ARG) Pedigree Graph G PG (K,N) each vertex has 2 parents 1. Red Subgraph G PTX (K,N) Blue Subgraph G PTY (K,N) each vertex has 1 parent 2. Mixed Subgraph G PGE (K,N,M) No of vertices/row no more than KM each vertex has 1 OR 2 parents M is no. of completely linked segs in each extant unit

25 IBM Computational Biology Center 25 Different Models G PG (4,8) G PTY (4,8) G PGE (4,8,2)

26 IBM Computational Biology Center 26 Different Models as Subgraphs LCA g GMRCA LCA h TMRCA LCA g GMRCA Pedigree Graph G PG (K,N) 1. Red Subgraph G PTX (K,N) Blue Subgraph G PTY (K,N) 2. Mixed Subgraph G PGE (K,N,M)

27 IBM Computational Biology Center 27 G PGE (K,N,M) h ARG  Ancestral Recombinations Graph Griffiths & Marjoram ‘97  Embellish G PGE (K,N,M) with Genetic Exchanges (GE)  Each extant unit has M segments  No vertex with zero ancestral segments (to extant units)

28 IBM Computational Biology Center 28 1. Plausible GE assignment? 2. Can G PGE (K,N,M) go colorless?  Yes....through algorithmic subsampling… Mixed Subgraph G PGE (K,N,M)

29 IBM Computational Biology Center 29 Algorithm: Embellish G PGE (K,N,M) 1. Assign sequence, s, to an instance eg. s = K, (2K), (2K-7), (2K-15), ………. 2. Construct M sequences s i  Each s i is monotonically decreasing;  s i [j] no bigger than s[j] 3. Associate each s i with a segment and each element s i [j] = k to k randomly selected vertices at depth j

30 IBM Computational Biology Center 30 Algorithm: Constructing seqs…

31 IBM Computational Biology Center 31 “Topological” Defn of LCAA in G PGE (K,N,M)  Input: G PGE (K,N,M) with GE embellishment  LCAA 1.CA in all M subgraphs (trees) 2.Least such CA

32 IBM Computational Biology Center 32 Different Models as Subgraphs LCAA h GMRCA LCA h TMRCA LCAA h GMRCA Pedigree Graph G PG (K,N) 1. Red Subgraph G PTX (K,N) Blue Subgraph G PTY (K,N) 2. Mixed Subgraph G PGE (K,N,M)

33 IBM Computational Biology Center 33 Probability of Instances with Unique LCA/LCAA Pedigree Graph G PG (K,N) 1. Red Subgraph G PTX (K,N) Blue Subgraph G PTY (K,N) 2. Mixed Subgraph G PGE (K,N,M)

34 IBM Computational Biology Center 34 GMRCA h LCAA l LCA & lone pair TMRCA h LCA GMRCA h LCAA l LCA & lone node Pedigree Graph G PG (K,N) 1. Red Subgraph G PTX (K,N) Blue Subgraph G PTY (K,N) 2. Mixed Subgraph G PGE (K,N,M) “Topological” Defns of LCAA

35 IBM Computational Biology Center 35 Expected Depth E(D) of LCA/LCAA O(N 2 ) O(K) O(KM) Pedigree Graph G PG (K,N) 1. Red Subgraph G PTX (K,N) Blue Subgraph G PTY (K,N) 2. Mixed Subgraph G PGE (K,N,M)

36 IBM Computational Biology Center 36 RECONSTRUCTABILITY O(N 2 ) O(K) O(KM) Pedigree Graph G PG (K,N) 1. Red Subgraph G PTX (K,N) Blue Subgraph G PTY (K,N) 2. Mixed Subgraph G PGE (K,N,M)

37 IBM Computational Biology Center 37 Summary: History Reconstruction? 1. Mixed Subgraph models recombinations Only fragments of the chromosome 2. In reality, only a minimal structure (HUD) of the G PGE (K,N,M) or ARG can be estimated  Forbidden structures ….

38 IBM Computational Biology Center 38 1. Motivation 2. Reconstructability (Random Graph Framework) 3. Reconstruction Algorithm (DSR Algorithm) 4. Conclusion RoadMap L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns Journal of Computational Biology, vol 15(9), pp 1—22, 2008 L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium, Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009

39 IBM Computational Biology Center 39 OUTPUT: Recombinational Landscape (Recotypes) INPUT: Chromosomes (haplotypes)

40 IBM Computational Biology Center 40 Granularity g Analyze Results YES NO IRiS Acceptable p-value? Our Approach statistical combinatorial M Mele, A Javed, F Calafell, L Parida, J Bertranpetit and Genographic Consortium Recombination-based genomics: a genetic variation analysis in human populations, under submission.

41 IBM Computational Biology Center 41 Preprocess: Dimension reduction via Clustering 11 12 13 14 15 16 0 17 1 18 4 19 6 5 20 8 21 9 10 7 22 23 3 2 24

42 IBM Computational Biology Center 42 Granularity g Analyze Results YES NO IRiS Acceptable p-value? Analysis Flow statistical combinatorial

43 IBM Computational Biology Center 43 p-value Estimation

44 IBM Computational Biology Center 44 Comparison of the Randomization Schemes

45 IBM Computational Biology Center 45 SNP Blocks (granularity g=3)

46 IBM Computational Biology Center 46 Granularity g Analyze Results YES NO IRiS Acceptable p-value? Analysis Flow statistical combinatorial

47 IBM Computational Biology Center 47 Stage Haplotypes: use SNP block patterns Segment along the length: infer trees Infer network (ARG) biological insights computational insights IRiS ( I dentifying R ecombinations i n S equences) L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns Journal of Computational Biology, vol 15(9), pp 1—22, 2008

48 IBM Computational Biology Center 48 Segmentation 12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345 11111111111111111111111111111111111111112222222222222222222222222222222222233333333344444444455555555555555----

49 IBM Computational Biology Center 49 Segmentation

50 IBM Computational Biology Center 50 Consensus of Trees

51 IBM Computational Biology Center 51 Algorithm Design 1. Ensure compatibility of component trees 2. Parsimony model: minimize the no. of recombinations

52 IBM Computational Biology Center 52 Algorithm Design 1. Ensure compatibility of component trees 2. Parsimony model: minimize the no. of recombinations Theorem: The problem is NP-Hard. “It is impossible to design an algorithm that guarantees optimality.”

53 IBM Computational Biology Center 53 DSR Scheme (Dominant—Subdominant---Recombinant)

54 IBM Computational Biology Center 54 DSR Scheme: Level 1

55 IBM Computational Biology Center 55 DSR Assignment Rules 1. At most one D per row and column; if no D, at most one S per row and column 2. At most one non- R in the row and column, but not both

56 IBM Computational Biology Center 56 DSR Assignment Rules 1. Each row and each column has at most one D ELSE has at most one S 2. A non-R can have other non-Rs either in its row or its column but NOT both

57 IBM Computational Biology Center 57 DSR Scheme: Level 1

58 IBM Computational Biology Center 58 DSR Scheme: Level 2

59 IBM Computational Biology Center 59 DSR Scheme: Level 2

60 IBM Computational Biology Center 60 DSR Scheme: Level 3

61 IBM Computational Biology Center 61 DSR Scheme: Level 3

62 IBM Computational Biology Center 62 DSR Scheme: Level 4

63 IBM Computational Biology Center 63 DSR Scheme: Level 5

64 IBM Computational Biology Center 64 Mathematical Analysis: Approximation Factor  Greedy DSR Scheme  Z and Y are computable functions of the input L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium, Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009

65 IBM Computational Biology Center 65 Granularity g Analyze Results YES NO IRiS Acceptable p-value? Analysis Flow statistical combinatorial

66 IBM Computational Biology Center 66 IRiS Output: RECOTYPE Recombination vectors R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 ………. s1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 ………. s2 0 1 0 1 1 1 0 1 0 0 1 0 0 0 ………..

67 IBM Computational Biology Center 67 Quick Sanity Check: Ultrametric Network on RECOTYPES

68 IBM Computational Biology Center 68 Stage Haplotypes: use SNP block patterns Segment along the length: infer trees Infer network (ARG) biological insights computational insights IRiS ( I dentifying R ecombinations i n S equences) L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns Journal of Computational Biology, vol 15(9), pp 1—22, 2008 IRiS software will be released by the end of summer ’09 Asif Javed

69 IBM Computational Biology Center 69 What’s in a name? 1. Allele-frequency variations between populations is also reflected in the purely recombination-based variations 2. Detects subcontinental divide from short segments  based on populations level analysis 3. Detects populations from short segments  based on recombination events analysis RECOMBIN-OMICS Jaume Bertranpetit RECOMBIN-OMETRICS Robert Elston

70 IBM Computational Biology Center 70 1. Allele-frequency variations between populations is also reflected in the purely recombination-based variations 2. Detects subcontinental divide from short segments  based on populations level analysis 3. Detects populations from short segments  based on recombination events analysis Are we ready for the OMICS / OMETRICS? o population-specific signals ? o other critical signals ? o anything we didn’t already know?

71 IBM Computational Biology Center 71 Thank you!!

72 IBM Computational Biology Center 72


Download ppt "RECOMBINOMICS: Myth or Reality? Laxmi Parida IBM Watson Research New York, USA."

Similar presentations


Ads by Google