Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.

Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University

The Issue at Hand Modern Genome Assembly techniques fragment human DNA into pieces short enough to be sequenced by current technology The known reads are then algorithmically reassembled using overlaps between the fragments Humans are diploid: they have a gene from their mother and a gene from their father. This assembly process destroys information about which fragment came from which chromosome.

Relevance Accurate knowledge of haplotype phasing is an important next step in genetics Autism strongly correlates with the age of the father, and not with the age of the mother. Study of plant genetics is made more difficult because plants are polyploid Diseases that occur because of multiple SNPs on the same strand, indistinguishable from some mutations in the mother and some mutations in the father

Formulating the Problem (1) Human genomes are differentiated only by a vector of SNPs (single nucleotide polymorphisms) which is substantially smaller than the entire genome. Infinite sites assumption: the genome is so large that the likelihood of an allele being triallelic or more is vanishingly small. Therefore, any SNP will only come in two versions.

Formulating the Problem (2) We can take a person’s vector of SNPs, and map the more common allele to 0 and the less common allele ot 1 (major and minor allele frequency) Alleles:Mapping: Genotype: ACACTTGCT010100010 010220020 ACAGGTGAT010010000

Formulating the Problem (3) The input to the algorithm is an mxn matrix of 0s, 1s, and 2s. The output of the algorithm is a 2mxn matrix of haplotypes Input: 0010120 Output: 0010110 0010100

Inferring a “correct” Phasing The goal of the algorithm is to produce the biologically accurate phasing of the input Several heuristics exist, but some assumptions must be made: Parsimony: phasings with fewer haplotypes tend to be more accurate than phasings with more haplotypes, due to heredity Haplotype “blocking”: Due to linkage disequilibrium, sequences of SNPs tend to be more likely

An Angle of Attack We can algorithmically and biologically determine these haplotype blocks and use the probability of a sequence occurring in a particular block to impute an ambiguous SNP (a maximum likelihood approach) Any haplotypes which are not empirically observed are unlikely to occur, and any haplotypes which are observed very few times are likely due to sequencing error (a way to reduce the state space)

Variable Memory Some inferences can be made with local information “The man used a leash to walk the __” “…leash to walk the ___” “…walk the ___” Some inferences improve when you look further back “The Wright Brothers invented the _____” “…Brothers invented the _____” “…invented the _____”

Formalizing the Problem We use probability theory to quantify our beliefs about what will happen next What is the most likely “next thing” to happen? How likely is a particular chain of events? Predictive Algorithms We can observe a sample and calculate empirical probabilities How far back should we look in order to make good predictions?

Solving the Phasing Problem Variable-Memory Algorithms are well-suited to the problem of long-range haplotype phasing Natural way to capture blocks of LD, which are of variable length Given a sequence with missing characters, what are the most probable missing characters  What is the most probable phasing of an ambiguous genotype

Representing Variable Memory We can use a Probabilistic Finite Automaton M = (Q, , , ,  ) Q--------------------- a finite set of states  --------------------- a finite alphabet  : Q    Q --------- transition function  : Q    [0,1] ------ next symbol probability  : Q  [0,1] ----------- initial state probability

Variable Memory Data Structures The following can all represent a variable memory model in equivalent ways L-Order Markov Chains Probabilistic Suffix Automata Predictive Suffix Trees

Formalizing Long-Range Phasing Q = {0, 1}^n, n = 1…L  = {0, 1}  : Q    Q ={0,1}^n  {0,1}  : Q    [0,1] = {0,1 }^ n  {0,1}  [0,1]  : Q  [0,1] = {0,1}^n  [0,1]

2-Order Markov Chain

Probabilistic Suffix Automata

Predictive Suffix Trees

Applications We can use these to Generate strings Given the suffix of a string, use the transition function and a random number generator to choose a character to append Calculate the likelihood of a string Given a string, find the longest relevant suffix at every position and calculate the probability that the subsequent character would occur next Predict the next character of a string Given the suffix of a string, find the most likely character that would follow

Remarks L-Order Markov Chains are impractical because the number of states explodes for anything but small numbers of L Probabilistic Suffix Automata are the most compact way to represent variable order transition probabilities Predictive Suffix Trees are easy to “learn” with an algorithm

Further Remarks We can use techniques from the analysis of Markov Chains to write proofs for what is possible for a PSA in general If a PST “learns” the distribution that respects some PSA, then their equivalence implies that the property that we proved for the PSA holds for the PST Rigorous proofs are carried out on the underlying PSA, and applications rely on learning the equivalent PST

Some Loose Ends “Sufficiently Similar” implies a notion of distance Kullback-Liebler Divergence Statistically significant difference Fisher’s exact test Pearson’s chi squared test Top-down vs. Bottom up Top-down implementations are potentially as inefficient as constructing an L-order Markov Chain, but more space efficient in the long run Can be avoided by only considering strings that occur in sufficient number – nodes populated by a cursory read of the input string

Future Goals A functional implementation of a variant of this algorithm that learns a PST with an epsilon-close approximation to the PSA that generates a long-range haplotype phasing More thorough exploration of the consequences of “merging nodes” found in Browning and Browning A comparison of the results yielded by a PST with pruning with that of a PST with merging, as found in the Browning & Browning long-range phasing algorithm An extension of the algorithm to include other phasing desiderata, such as parsimony and IBD

Citations Ron, Dana, Yoram Singer, and Naftali Tishby. "The power of amnesia: Learning probabilistic automata with variable memory length." Machine learning 25.2-3 (1996): 117-149. Browning, Brian L., and Sharon R. Browning. "Efficient multilocus association testing for whole genome association studies using localized haplotype clustering." Genetic epidemiology 31.5 (2007): 365-375.

Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.

Similar presentations

Presentation on theme: "Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.

Similar presentations

Presentation on theme: "Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University."— Presentation transcript:

Similar presentations

About project

Feedback