Tree Reconciliation: Notung Reconciliations Notes on how to map Notung format files to a reconciliation map that can be imported to TR database
Notung Notung software is a framework for tree reconciliation using duplication loss parsimony Notung can resolve polytomies Input parameters – Cost of duplication – Cost of loss – Conditional Duplication cost – Edge weight threshold Input files – Gene Tree (NH or NHX) – Species Tree (NH or NHX) Output – Reconciled gene tree (Newick, NHX, or Notung) Availability –
Notung: File Format Notung is a modified NHX file format – All nodes in species-tree and gene-tree are named – Species-tree node names are used to map gene tree nodes to the species tree – Losses are encoded in the leaf node name using species tree node ids: Taxon*LOST Murinae*LOST[&&NHX:S=Murinae] – NHX tags map gene tree nodes to the species tree – Reconciled tree exists in a single line terminating with semicolon – File includes the species tree [&&NOTUNG-SPECIES-TREE(((human,pan)Homo/Pan/Gorillagroup,mac)Catarrhini,(mouse,rat)Murinae)Euarchontoglires] – File ncludes meta data abut program parameters [&&NOTUNG-PARAMETERS:T=90.0:VERSION=2.6:CL=1.0:CD=1.5:CCD=0.0]
Notung: Example Reconciliation Reconciliation as drawn by Notung GUI.
rat*LOST pan mac human mouse Murinae Homo/Pan/Gorillagroup Euarchontoglires Catarrhini Human-gene BA3 Human-gene BA1 Human-gene BA2 Human-gene BA6 Human-gene BA4 Human-gene BA5 Pan-gene BA2 Pan-gene BA1 mac-gene BA mac-gene BB2 Human-gene BB1 rat-gene A mouse-geneB mac-gene AA mouse-geneA mac-gene A Human-gene A Pan-gene A R523 n21 R521 n14 n16 n11 r520 n25 r706 Pan*LOST r511 r510 Notung reconciliation drawn in “fat-tree” format n2 r507 n7 n8 n38 n37 Notung: Example Reconciliation Human-gene BB2 Pan*LOST r707 r514 mac-gene BB1 r708 n35 Murinae*LOST r709 Homo/Pan/Gorilla*LOST r710 r509 rat
rat*LOST pan mac human mouse Murinae Homo/Pan/Gorillagroup Euarchontoglires Catarrhini Human-gene BA3 Human-gene BA1 Human-gene BA2 Human-gene BA6 Human-gene BA4 Human-gene BA5 Pan-gene BA2 Pan-gene BA1 mac-gene BA mac-gene BB2 Human-gene BB1 rat-gene A mouse-geneB mac-gene AA mouse-geneA mac-gene A Human-gene A Pan-gene A R523 n21 R521 n14 n16 n11 r520 n25 r706 Pan*LOST r511 r510 Bootstrap values available for the n* labeled nodes n2 r507 n7 n8 n38 n37 Notung: Example Reconciliation Human-gene BB2 Pan*LOST r707 r514 mac-gene BB1 r708 n35 Murinae*LOST r709 Homo/Pan/Gorilla*LOST r710 r509 rat
Notung: File Format (((mouse-gene-A[&&NHX:S=mouse],rat-gene-A[&&NHX:S=rat])n2:56.0[&&NHX:S=Murinae:D=N:B=56.0],((human-gene- A[&&NHX:S=human],pan-gene-A[&&NHX:S=pan])r507[&&NHX:S=Homo/Pan/Gorillagroup:D=N],mac-gene- A[&&NHX:S=mac])n7:70.0[&&NHX:S=Catarrhini:D=N:B=70.0])n8:100.0[&&NHX:S=Euarchontoglires:D=N:B=100.0],(((((((((human-gene- BA4[&&NHX:S=human],human-gene-BA5[&&NHX:S=human])n14:86.0[&&NHX:S=human:D=Y:B=86.0],human-gene- BA6[&&NHX:S=human])n16:78.0[&&NHX:S=human:D=Y:B=78.0],( (human-gene-BA1[&&NHX:S=human],human-gene-BA2[&&NHX:S=human])r523[&&NHX:S=human:D=Y],human-gene- BA3[&&NHX:S=human])n21:76.0[&&NHX:S=human:D=Y:B=76.0])r521[&&NHX:S=human:D=Y],(pan-gene-BA1[&&NHX:S=pan],pan- gene-BA2[&&NHX:S=pan])n11:97.0[&&NHX:S=pan:D=Y:B=97.0])r520[&&NHX:S=Homo/Pan/Gorillagroup:D=N],mac-gene- BA[&&NHX:S=mac])n25:73.0[&&NHX:S=Catarrhini:D=N:B=73.0],((human-gene- BB1[&&NHX:S=human],pan*LOST[&&NHX:S=pan])r706[&&NHX:S=Homo/Pan/Gorillagroup],mac-gene- BB2[&&NHX:S=mac])r511[&&NHX:S=Catarrhini:D=N])r510[&&NHX:S=Catarrhini:D=Y],((human-gene- BB2[&&NHX:S=human],pan*LOST[&&NHX:S=pan])r707[&&NHX:S=Homo/Pan/Gorillagroup],mac-gene- BB1[&&NHX:S=mac])r514[&&NHX:S=Catarrhini:D=N])r509[&&NHX:S=Catarrhini:D=Y],(mouse-gene- B[&&NHX:S=mouse],rat*LOST[&&NHX:S=rat])r708[&&NHX:S=Murinae])n35:98.0[&&NHX:S=Euarchontoglires:D=N:B=98.0],((mac-gene- AA[&&NHX:S=mac],Homo/Pan/Gorillagroup*LOST[&&NHX:S=Homo/Pan/Gorillagroup])r709[&&NHX:S=Catarrhini],Murinae*LOST[&&NH X:S=Murinae])r710[&&NHX:S=Euarchontoglires])n37:100.0[&&NHX:S=Euarchontoglires:D=Y:B=100.0])n38:94.0[&&NHX:S=Euarchontogl ires:D=Y:B=94.0]; [&&NOTUNG-SPECIES-TREE(((human,pan)Homo/Pan/Gorillagroup,mac)Catarrhini,(mouse,rat)Murinae)Euarchontoglires] [&&NOTUNG-PARAMETERS:T=90.0:VERSION=2.6:CL=1.0:CD=1.5:CCD=0.0]
Notung: NHX Tags Notung NHX Tags – Gene tree node tags S = Node in the species tree that the gene tree node is mapped to – This will match a name used in &&NOTUNG-SPECIES-TREE D = Boolean – Y = a duplication node: the gene tree node maps to the edge leading up to the species tree node identified by S – N = a speciation node: the gene tree node maps on the node in the species tree identified by S B = Bootstrap value – Double precision/Float number (ranges from 0.0 to 100.0) – [&&NOTUNG-Parameters … T = Edge weight threshold Version = Notung version used CL = Cost of loss CD = Cost of duplication CCD = Cost of conditional duplications – Includes the species tree [&&NOTUNG-SPECIES-TREE(((human,pan)Homo/Pan/Gorillagroup,mac)Catarrhini,(mouse,rat)Murinae)Euarchontoglires] – Includes meta data abut program parameters [&&NOTUNG-PARAMETERS:T=90.0:VERSION=2.6:CL=1.0:CD=1.5:CCD=0.0]
Notung: Questions … Can multiple gene trees be included in a single Notung format file? How does Notung treat multiple trees with same parsimony score in its output? What is How are unique names preserved for multiple loss events on a single edge? – These are leaf nodes so does not really matter How does Notung handle internal nodes not named in the input species tree?
Notung & NHX Parsing Attempting to parse the NHX tags with existing Bioperl NHX parser throws an error after parsing the reconciled gene tree: j-macbook01:scripts jestill$./tr_test_species_tree.pl -i sandbox/notung/exercise5_genetree_reconciled_resolvedPolytomies --format nhx EXCEPTION: Bio::Root::Exception MSG: Unrecognized, non &&NHX string: >>Euarchontoglires<<; lastevent is ) STACK: Error::throw STACK: Bio::Root::Root::throw /opt/local/lib/perl5/site_perl/5.8.9//Bio/Root/Root.pm:368 STACK: Bio::TreeIO::nhx::next_tree /opt/local/lib/perl5/site_perl/5.8.9//Bio/TreeIO/nhx.pm:246 STACK:./tr_test_species_tree.pl:
Notung & NHX Parsing Trimming the non-standard lines from the bottom of the NHX files does allow the program to parse the output without error: j-macbook01:scripts jestill$./tr_test_species_tree.pl -i sandbox/notung/exercise5_genetree_reconciled_resolvedPolytomiesTrimmed --format nhx NUM TAXA:25 Taxon Ids: mouse-gene-A rat-gene-A … …. r710 r709 mac-gene-AA Homo/Pan/Gorillagroup*LOST Murinae*LOST NODES WITH IDS:49 NODES WITH BOOTSTRAP VALUES:11 NODES WITH BRANCH LENGTH VALUES:11
Notung & NHX Parsing A trick to parsing these files is to read in input files and echo lines terminated by semicolon to a file handle passed to the TreeIO object while ( ) { chomp; if (m/(.*)\;/) { $tree_num++; # Using a pipe to create a new filehand by echoing the # single line of interest... seems sloppy but this # works. There is probably a more elegant way to do this my $tree_handle = new FileHandle("echo \'$_\' |") || die "Can not echo filehandle"; my $die_msg = "Can not open $format format tree file:\n$infile"; my $tree_in = new Bio::TreeIO(-fh => $tree_handle, -format => $format) || die $die_msg; while( my $tree = $tree_in->next_tree ) { # Do stuff with the tree
Notung and TR Database Notung produces a format that is different from the PRIME format used by PrimeGSR and Treebest. Working now to see if the Notung file format is compatible with existing NHX module in BioPerl or if a new module will be needed? – The NHX module of BioPerl CAN be used with some creative parsing of the Notung format