NeXML A future data exchange standard for phylogenetics Rutger Vos University of British Columbia
Increased automation in evolutionary informatics is hampered by poorly defined “standards” Introduction (1/7) The problem Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Addressing interoperability problems by coding our way out of it Syntax: NeXML Semantics: CDAO Transport: PhyloWS Introduction (2/7) EvoInfo interests Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Introduction (3/7) This subproject’s mission To create a file format like nexus* *Maddison, Swofford and Maddison, NEXUS: An Extensible File Format for Systematic Information. Syst. Biol. 46(4): Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources Fix (some) problems with nexus Give access to data at higher level Be extensible Expose data to xml goodies, but:
#NEXUS BEGIN TAXA; DIMENSIONS NTAX=3; TAXLABELS taxon_1 taxon_2 taxon_3; END; BEGIN CHARACTERS; DIMENSIONS NCHAR=2; FORMAT DATATYPE=STANDARD GAP=- MISSING=? SYMBOLS="0 1 2"; MATRIX taxon_1 00 taxon_2 11 taxon_3 22; END; BEGIN TREES; TRANSLATE 1 taxon_1, 2 taxon_2, 3 taxon_3; TREE Tree1 = ((1:0.12,2:0.12):9.88,3:10.0); END;
Introduction (4/7) Nexus issues Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources No explicit versions Nothing ever deprecated No public extensions Leads to hacks such as ‘mixed’ data, ‘hot comments’ Phylogenetics post-’80s in private blocks Hard/impossible to validate
Introduction (5/7) Parsing plain text versus parsing XML Processing nexus data involves lexing + parsing + processing XML allows choosing a parser library, data can be processed as a structure that hides tokenization issues Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Introduction (6/7) Extensibility Extensible file format should provide the ability to: Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources Define new data types that implement described ‘interfaces’ Attach typed data structures to core types Attach custom XML
Introduction (7/7) XML goodies Large stack of off-the-shelf tools: Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources XML parser libraries Web service toolkits Native XML databases Editors / IDEs Serialization / data binding tools
Design (1/5) Design principles Re-use of prior art Follow design patterns Referencing Verbose and compact representations Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Design (2/5) Re-use of prior art Generic key/value attachments following apple’s plist semantics: prior 0.78 Trees and networks following graphml General file structure following nexus concepts, i.e. blocks that reference each other Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources Avoid tag soup! Will return to this later… Avoid tag soup! Will return to this later…
Design (3/5) XML design patterns “Declare before use” Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources “Metadata first” “Venetian blinds” Abstract inheritance through extension, concrete inheritance through restriction
Design (4/5) Inheritance IDTagged (required id attribute) Labelled (optional label attribute) Annotated (optional dict elements) Base (optional base/lang/href attributes) AbstractElement (in root schema) ConcreteElement (in instance document) extends restricts Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Design (5/5) Referencing Elements sometimes refer to other elements, much like in nexus In nexml, elements refer to the id of other elements by the name of the referenced element: Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Schema design Community feedback through wiki, , telecon, projects (evoinfo, ppod, MIAPA) etc. Processors (perl, java, python, c++, VB, JavaScript) development in parallel Experiments with xml tools (ws, db, data binding tools) Implementation (1/6) Approach Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Implementation (2/6) Entity relationships Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Implementation (3/6) inheritance tree for elements Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Implementation (4/6) anatomy of a “block” <characters id="c1" xsi:type="nex:DnaSeqs" otus="t1"> desc description … Contents… Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Implementation (5/6) Character Classes RestrictionCellsRestrictionSeqs Restriction ContinuousCellsContinuousSeqs Continuous StandardCellsStandardSeqs Standard ProteinCellsProteinSeqs Protein RnaCellsRnaSeqs RNA DnaCellsDnaSeqs DNA CellsSequence Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Implementation (6/6) Tree Classes IntTreeFloatTree Tree IntNetworkFloatNetwork Network IntFloat Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Current status (1/4) Schema blocks Done: o OTUs o characters: dna, rna, nucleotide, protein, categorical, continuous, restriction (compact and verbose) o trees: graphml trees and networks, various edge formats and rootings Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Nexml parsers and writers: mesquite (java NeXML class libraries) Bio::Phylo (BioPerl compatible) pyNexml (python) DAMBE (Visual Basic) NCL (C++) JavaScript Current status (2/4) Parsers and writers Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Semantic annotation (CDAO) using SAWSDL Current status (3/4) Experiments Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources Scalability: Indexed files in dbxml Created large files from tolweb, rbcl XInclude with tinyseq xml REST Web services: ToL service validation service nexml2json, nexus2xml Schema inclusion in wsdl
Publish standard More restricted vocabulary attachments (e.g. Darwin core, CDAO- mediated terms) Substitution model descriptions Sets (in progress, using class identifiers) Distances Splits Current status (4/4) To do Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
NeXML Base URL: Wiki: /wiki Mailing list: /mail Issue tracker: /tracker SVN repository: /code EvoInfo: CDAO: Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources
Acknowledgements Contributions: Jason Caravas, Mark Holder, Peter Midford, Jeet Sukumaran, Xuhua Xia Feedback: wg-evoinfo, pPOD, Wayne Maddison, David Maddison Additional funding, support: NESCent, GSoC