Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Similar presentations


Presentation on theme: "Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,"— Presentation transcript:

1 xgram and phylo-grammars A brief intro

2 What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix, R (continuous-time Markov chain) Edge probabilities: exp(R*t) –Stochastic grammar Grammar symbols (nonterminals and terminals) Production rules (with probabilities)

3 Grammars and dependencies Nested dependencies (context-free; Chomsky) Cross-serial dependencies (“mildly” context- sensitive; Joshi) TIR LTR Adjacent dependencies (regular/HMMs) ATATATATATATATATATTAT Microsatellite Fuzzy duck, fuzzy duck, duckie fuzz, duckie fuzz, duckie fuzz…

4 PFOLD: Knudsen & Hein, 1999 S -> LS | L; F -> dFd | LS; L -> s | dFd; dd: s:

5 Codon evolution and exon models Goldman & Kosiol, MS in prep

6 EvoGene (Pedersen & Hein) c.f. Exoniphy (Siepel & Haussler) whose “null model” is considerably more sophisticated (context-dependent substitutions are explicitly modeled, yielding higher- order dependence between noncoding bases)

7 PHASTCONS phylo-HMM

8 xgram command line usage

9 xgram examples Load alignment from file “align.stk”; load grammar from file “grammar.eg”; estimate tree by neighbor-joining (if there isn’t a tree annotated to the alignment already); do CYK algorithm (or Viterbi, as appropriate); annotate alignment; print to standard output xgram align.stk -g grammar.eg Load alignment from file “align2.stk”; load grammar from file “grammar2.eg”; optimize branch lengths of tree; do EM by iterating Inside-Outside algorithm (or Forward-Backward); save grammar to file “trained.eg”; print log messages to level 5 xgram align2.stk -g grammar2.eg -b -t trained.eg --noannotate -log 5

10 xgram grammar elements Alphabet –Tokens –Complementarity –Degeneracies Grammar –Chains Pseudoterminals Initial probabilities Mutation rates Update policy –Production rules (& nonterminals) Emissions (paired emit-null states) –Annotation labels (optional) –Gap model (optional) Transitions (aka “null” states) Bifurcations –Parameters (optional) Rate parameters Probability parameters (alphabet (name RNA) (token (a c g u)) (complement (u g c a)) (extend (to n) (from a) (from c) (from g) (from u)) (extend (to x) (from a) (from c) (from g) (from u)) (extend (to t) (from u)) (extend (to r) (from a) (from g)) (extend (to y) (from c) (from u)) (extend (to m) (from a) (from c)) (extend (to k) (from g) (from u)) (extend (to s) (from c) (from g)) (extend (to w) (from a) (from u)) (extend (to h) (from a) (from c) (from u)) (extend (to b) (from c) (from g) (from u)) (extend (to v) (from a) (from c) (from g)) (extend (to d) (from a) (from g) (from u)) (wildcard *) ) ;; end alphabet RNA (grammar (name pfold) (update-rates 1) (update-rules 1) (chain (update-policy rev) (terminal (LNUC RNUC)) ;; initial probability distribution (initial (state (a a)) (prob 0.001167)) (initial (state (c a)) (prob 0.001806)) (initial (state (g a)) (prob 0.001058)) (initial (state (u a)) (prob 0.177977)) (initial (state (a c)) (prob 0.001806)) (initial (state (c c)) (prob 0.000391)) (initial (state (g c)) (prob 0.266974)) (initial (state (u c)) (prob 0.000763)) (initial (state (a g)) (prob 0.001058)) …….

11 Types of production rule ;; state pfoldS (transform (from (pfoldS)) (to (pfoldL)) (prob 0.131488)) (transform (from (pfoldS)) (to (pfoldB)) (prob 0.868742)) ;; state pfoldF (transform (from (pfoldF)) (to (LNUC pfoldF' RNUC)) (gaps-ok) (annotate (row PFOLD) (column LNUC) (label <)) (annotate (row PFOLD) (column RNUC) (label >))) (transform (from (pfoldF')) (to (pfoldF)) (prob 0.787854)) (transform (from (pfoldF')) (to (pfoldB)) (prob 0.212421)) ;; state pfoldL (transform (from (pfoldL)) (to (pfoldF)) (prob 0.105404)) (transform (from (pfoldL)) (to (pfoldU)) (prob 0.895025)) ;; state pfoldB (transform (from (pfoldB)) (to (pfoldL pfoldS))) ;; state pfoldU (transform (from (pfoldU)) (to (NUC pfoldU')) (gaps-ok)) (transform (from (pfoldU')) (to ()) (prob 1)) Emit Null Null (end) Bifurcate

12 Types of rate matrix update-policy for a chain can be –rind i.e. R(i,j) = R*pi(j) Could also implement this with parametric (see below) –rev i.e. reversible: pi(i) * R(i,j) = pi(j) * R(j,i) –irrev i.e. irreversible (more general) –parametric i.e. R(i,j) = f(a,b,c,d,e….) (a,b,c,d,e….) are independent parameters

13 Annotation & supervised learning Use the “annotate” element to add annotation lines to the alignment Add these annotation lines yourself prior to training, to force a particular parse (supervised learning) Period characters are treated as wildcards (partially supervised learning) (transform (from (pfoldF)) (to (LNUC pfoldF' RNUC)) (gaps-ok) (annotate (row PFOLD) (column LNUC) (label <)) (annotate (row PFOLD) (column RNUC) (label >))) #=GC PFOLD................... >>>>.. >>. >>>>>>..>>>>>>.................

14 Context-dependent substitutions Let P(n) be the probability of column n of an alignment Context-independent model says that P(1..N) = P(1)P(2)P(3)…P(N) More generally (context-dependence): P(1..N) = P(1)P(2|1)P(3|1,2)P(4|1,2,3)… Siepel & Haussler approximated this by P(1..N) ≈ P(1)P(2|1)P(3|2)…P(n|n-1)… Here P(n|n-1) is obtained from a dinucleotide model for {n-1,n}, using Bayes’ theorem: P(n|n-1) = P(n-1,n) / P(n-1)

15 Context-dependent emit rules (transform (from (PREVNUC S)) (to (PREVNUC EMITNUC S'))) Here {PREVNUC,EMITNUC} are the pseudoterminals for a dinucleotide chain NB context-dependent substitution models generally irreversible (CpG -> TpG)

16 Parametric models Any rate or probability in a grammar can be replaced by a parametric function This is useful to constrain models e.g. PHASTCONS phylo-HMM

17 How the “-length” argument works DP is an iteration over subsequences In long (e.g. genomic) alignments, you can save time by not considering all subseqs –e.g. you probably don’t expect bases over 1MB apart to be paired The -length command-line argument allows you to limit the maximum length of subseqs that will be iterated over –All suffix subseqs are always included, however, so there is always a valid global parse tree Care must be given to design of grammars, particularly if you want to find “local” features

18 Speed tips Use “-length” –Also “minlen” and “maxlen” in grammar Replace emit loops with bifurcations –E.g. instead of “S -> x S” use “S -> X S; X -> x” and limit “maxlen” for X to 1 –NB this goes against the “standard” SCFG dogma of minimizing bifurcations; reason is that with big trees, emissions become more expensive Turn off logging once model debugged

19 DART logging and dartlog.pl

20 Model debugging tips Minimal test cases to reproduce errors Use logging during model development –Examine source code for log messages –E.g. “-log CYK_MATRIX” Use Makefiles for reproducibility To protect against EM getting stuck in local minima, try training a low-dimensional model first (e.g. using “rind”) then move to models with more degrees of freedom (rev -> irrev)

21 Perl xgram modules In dart/perl –Stockholm.pm Stockholm alignment class. Pretty basic –DartSexpr.pm Class for working with S-expressions –PhyloGram.pm Subclass of DartSexpr for phylo-grammars Has accessors/helpers for various common tasks e.g. creating & populating new chains, emit rules, etc. Subclasses: DNA.pm and Protein.pm Related class: Chain.pm –Not much documentation (but see first few lines of each file for examples (or bug me))

22 Rudimentary indel models Highly experimental Attempt to deal with gaps more intelligently than just ignoring them Falls somewhat short of a full “statistical alignment” treatment (transform (from (S)) (to (X S')) (gap-model (extend-prob 0.5) (insert-rate 0.01) (delete-rate 0.01)))


Download ppt "Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,"

Similar presentations


Ads by Google