Superset Me—Not: Why the JPTS Is Sufficient if You Use Appropriate Layer Validation Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con.

Slides:



Advertisements
Similar presentations
Information Security of Embedded Systems : Design of Secure Systems Prof. Dr. Holger Schlingloff Institut für Informatik und Fraunhofer FIRST.
Advertisements

Summary Overview of Vireo Student Submission of ETDs
Open Days 2010 D. Gubbels Professionalization within the range of volunteer work New challenges for volunteering organizations - Ehrenamt professionalisieren!
January 12, 2010 Updated February 4, Starting in TEA will collect Teacher Class Assignments and Student Course Completion data at the.
® Microsoft Office 2010 Excel Tutorial 3: Working with Formulas and Functions.
August 4, The following PEIMS reporting changes have been made to the PEIMS Collection in order to collect the Classroom Link information.
Hash Tables and Constant Access Time CS-2303, C-Term Hash Tables and Constant Access Time CS-2303 System Programming Concepts (Slides include materials.
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
Tutorial 1 Creating a Database
ACOT Intro/Copyright Succeeding in Business with Microsoft Excel 2010: Chapter1.
Xiang Fu Hofstra University Chung-Chih Li Illinois State University 04/13/20101NFM 2010.
XML: Extensible Markup Language
Collaboration Works! 10/20/20101 Planning Research Institutional Effectiveness.
Quick Training Guide New SpringerLink, August 2010.
Chapter 13 – Aggregate Planning
Tutorial 8 Sharing, Integrating, and Analyzing Data
XHTML Basics.
An Introduction to XML Based on the W3C XML Recommendations.
1 DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs.
XML Document Type Definitions ( DTD ). 1.Introduction to DTD An XML document may have an optional DTD, which defines the document’s grammar. Since the.
1 XML DTD & XML Schema Monica Farrow G30
Bookshelf Leafing through XML NLM Journal Article Tag Suite Conference 2010 Martin Latterner and Marilu Hoeppner National Center for Biotechnology Information.
 2002 Prentice Hall, Inc. All rights reserved. ISQA 407 XML/WML Winter 2002 Dr. Sergio Davalos.
XML Introduction What is XML –XML is the eXtensible Markup Language –Became a W3C Recommendation in 1998 –Tag-based syntax, like HTML –You get to make.
Introduction to XML This material is based heavily on the tutorial by the same name at
MEDIN Standards Workshop Standards / XML / Validation / Transformation / ESRI.
Chapter 12 Creating and Using XML Documents HTML5 AND CSS Seventh Edition.
XML, DITA and Content Repurposing By France Baril.
Chapter 1 Variables in the Web Design Environment
XP New Perspectives on XML Tutorial 4 1 XML Schema Tutorial – Carey ISBN Working with Namespaces and Schemas.
10/14/2001 Coping with Semantics in XML Document Management Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics.
XP New Perspectives on XML Tutorial 3 1 DTD Tutorial – Carey ISBN
Luc Audrain Hachette Livre Head of digitalization
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
November 1&2, Are we there yet? YES What to expect along the way A Brief History Some Jargon you may need to know First Detour: NLM DTD vs PMC.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
XML CPSC 315 – Programming Studio Fall 2008 Project 3, Lecture 1.
Document Type Definitions Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
(the NLM DTDs) Update on the NLM Journal Article Tag Suite Jeffrey Beck
XP 1 DECLARING A DTD A DTD can be used to: –Ensure all required elements are present in the document –Prevent undefined elements from being used –Enforce.
XML – Tools and Trends Schematron Tim Bornholtz Session 55.
FIGIS’ML Hands-on training - © FAO/FIGIS An introduction to XML Objectives : –what is XML? –XML and HTML –XML documents structure well-formedness.
November 1, 2006IU DLP Brown Bag : Fall Data Integrity and Document- centric XML Using Schematron for Managing Text Collections Dazhi Jiao, Tamara.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
Avoid using attributes? Some of the problems using attributes: Attributes cannot contain multiple values (child elements can) Attributes are not easily.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
XP Tutorial 9 1 Working with XHTML. XP SGML 2 Standard Generalized Markup Language (SGML) A standard for specifying markup languages. Large, complex standard.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
An OO schema language for XML SOX W3C Note 30 July 1999.
An Introduction to XML Sandeep Bhattaram
What it is and how it works
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Tutorial 13 Validating Documents with Schemas
Schematron Tim Bornholtz. Schema languages Many people turn to schema languages when they want to be sure that an XML instance follows certain rules –DTD.
A Use-Case Driven Approach to the Development of Reusable Stylesheet Modules Terry Brady LexisNexis.
MEDIN Standards Workshop Standards / XML / Validation / Transformation / ESRI / Search.
MEDIN Standards Workshop Standards / XML / Validation / Transformation / ESRI / Search.
Introduction to DTDs. Introduction We learned how to structure information using XML Learned XML grammar Learned the rules for XML encoding We learned.
XML Validation. a simple element containing text attribute; attributes provide additional information about an element and consist of a name value pair;
CHAPTER NINE Accessing Data Using XML. McGraw Hill/Irwin ©2002 by The McGraw-Hill Companies, Inc. All rights reserved Introduction The eXtensible.
Updating image To update the background image: Go to ‘View’ Select ‘Slide Master’ Select the page with the image Right click on the image and select ‘Change.
CITA 330 Section 2 DTD. Defining XML Dialects “Well-formedness” is the minimal requirement for an XML document; all XML parsers can check it Any useful.
XML: Extensible Markup Language
XML QUESTIONS AND ANSWERS
XML in Web Technologies
New Perspectives on XML
New Perspectives on XML
Presentation transcript:

Superset Me—Not: Why the JPTS Is Sufficient if You Use Appropriate Layer Validation Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con November 2, 2010

Summary We have built a superset of the NLM Journal Publishing Tag Set in order to enforce business rules, data types, and house style and, having done that, realized that a JPTS subset could have been sufficient to meet AGU's needs if it were used in conjunction with the appropriate layer validation technology, such as Schematron Alexander (“Sasha”) Schwarzman2Superset Me—Not JATS-Con Nov 2, 2010

Contents Why we built a JPTS superset DTD vs. Schematron – Attribute values – Number of element occurrences – Element position & sequence – References Lessons learned Alexander (“Sasha”) Schwarzman3Superset Me—Not JATS-Con Nov 2, 2010

Why we built a JPTS superset No generic book model Lack of familiarity with Schematron Lack of mature tool support (running SVRL not a viable option in Production environment) Lack of expertise on integrating Schematron with validation against relational DB JATS v2.3: no Compound Keywords, not all content models parameterized Alexander (“Sasha”) Schwarzman4Superset Me—Not JATS-Con Nov 2, 2010

DTD vs. Schematron: Attribute values Requirement: Article type is required and can be one of three types: a regular article (rga), a correction (cor), or an editorial (edt) Strict DTD JPTS Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 20105

DTD vs. Schematron: Attribute values (cont’d) XML instance (contains non-allowed article type) ' ' not allowed, must be 'rga', 'cor', or edt' Schematron 'xxx' not allowed, must be 'rga', 'cor', or 'edt' Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 20106

DTD vs. Schematron: Number of element occurrences Requirement: Acknowledgments, if present, must contain exactly one paragraph, except for two journals (journal code ‘ja’ and ‘rg’) where Acknowledgments must contain two paragraphs Strict DTD JPTS Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 20107

DTD vs. Schematron: Number of occurrences (cont’d) XML instance (wrong number of paragraphs)... jb... Blah Blah-blah Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 20108

DTD vs. Schematron: Number of occurrences (cont’d) Schematron ' ' in ' ' must contain exactly two paragraphs ' ' in ' ' must contain only one paragraph Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 20109

DTD vs. Schematron: Number of occurrences (cont’d) Schematron message 'ack' in 'jb' must contain only one paragraph Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: Element position & sequence Requirement: If a journal has subj. grouping (ToC category, subset) & article belongs to sp. collection (sp. section, theme), then subj. grouping info must precede special collection info Strict DTD <!ELEMENT article-categories (subject-group*, special-collection?) > JPTS <!ELEMENT article-categories (subj-group*) > Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: Element position & sequence (cont’d) XML instance (wrong sequence of subject groups) New Methods and Applications of Earthquake Early Warning Solid Earth Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: Element position & sequence (cont’d) Schematron <rule context="article-categories/ <assert test="not(following-sibling:: type'/>' must appear after a ToC Category or a Subset when either is present Schematron message must appear after a ToC Category or a Subset when either is present Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References Validating references is a challenge: Variety vs. the need to enforce editorial style Strict DTD: Fixed element order, no mixed content Punctuation, spacing, face markup – on output JPTS: Lots of elements, any order, mixed content Punctuation, spacing, face markup included Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) Strict DTD <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > <!ATTLIST book-standalone-citation id ID #REQUIRED > Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) JPTS <!ELEMENT mixed-citation (#PCDATA | person-group | string-name | year | source | edition | size | elocation-id | publisher-name | publisher-loc |... |...)* > <!ATTLIST mixed-citation id ID #IMPLIED publication-type CDATA #IMPLIED > Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) Example: Mood, A. M., and F. A. Graybill (1963), Introduction to the Theory Statistics, 2nd ed., 295 pp., McGraw-Hill, New York. Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) XML instance (strict DTD) Mood A. M. Graybill F. A Introduction to the Theory Statistics 2nd 295 pp McGraw-Hill New York Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) XML instance (JPTS) Mood, A. M., and F. A. Graybill ( 1963 ), Introduction to the Theory Statistics, 2 nd ed., 295 pp., McGraw-Hill, New York. Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) Schematron can check that all required elements are present and are in the correct sequence (note the required elements and that edition, if present, follows source ): <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) Schematron can check that all required elements are present: <assert test="(person-group | string-name) and year and source and publisher-name and publisher-loc"> required element missing & that the elements are in the correct sequence: Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) XML instance (JPTS) (edition is in the wrong place) Mood, A. M., and F. A. Graybill ( 1963 ), 2 nd ed., Introduction to the Theory …, 295 pp., McGraw-Hill, New York. Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) This Schematron uses positional predicate [1] to check that year is immediately followed by source : <rule 'book-standalone']/year"> ' ' must be followed by 'source', not by ' ' Schematron message 'year' must be immediately followed by 'source', not by 'edition' Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) But how to check the sequence of required elements when there might be optional elements interspersed between them? This Schematron checks that required publisher-name is preceded by required source, regardless of any optional elements that may occur in-between: <rule 'book-standalone']/publisher-name"> ' ' must be preceded by 'source' Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) Rick Jelliffe’s approach combines flexibility of JPTS with benefits of a DTD-like fixed element order: – Each element rewritten as a string of its element names – Content model represented as a regular expression – Schematron checks the string of names against regex – Schematron generates an error message if content does not match the model Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) An XML file, e.g., citation-models.xml, specifies structured citation models:... ((string-name | person-group), year, source, edition, (string-name | person-group)?, size?, elocation-id?, publisher-name, publisher-loc)... Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

DTD vs. Schematron: References (cont’d) Advantages: – DTD is still DTD-valid – Mixed content is permitted – Type-sensitive handling of references is possible Caveat: XSLT 2.0! Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2,

Lessons learned AGU Tag Set + Schematron (200+ checks) – Ensures data quality – Ensures markup integrity – Provides control over production processes AGU Tag Set is a superset of JPTS – Based on JPTS – Uses the same modularization principles – Can be easily mapped to JPTS Were we to do this again we would have developed JPTS subset and a Schematron Alexander (“Sasha”) Schwarzman28Superset Me—Not JATS-Con Nov 2, 2010

Lessons learned (cont’d) Appropriate layer validation – Even the most “Prussian” DTD can’t enforce all business rules, data types, and house style – Rules-based checking needed anyway – May as well use “Californian” JPTS (de facto industry standard) adopted by publishers, conversion & composition vendors, archives, etc. Paradigm shift: the crux of validation shifts from XML parser to Schematron engine Alexander (“Sasha”) Schwarzman29Superset Me—Not JATS-Con Nov 2, 2010

Lessons learned (cont’d) This shift is not without costs: – Content may be valid to JPTS but make no sense – Dependency on Schematron for semantic integrity – Constraints on business partners: must be Schematron-capable and have tools – Schematron does not “fix” problems—people do. Processes and procedures must be well-defined Alexander (“Sasha”) Schwarzman30Superset Me—Not JATS-Con Nov 2, 2010

Lessons learned (cont’d) Writing a simple Schematron is easy; building a complex and efficient one is not: – Elicit, document, convey, and clarify the Requirements – Ensure Schematron fits into your workflow – Modularize Schematron – Ensure that individual Schematron rules aren’t in conflict – Optimize Schematron performance – Employ XSLT 2.0 – Test, test, test – Cultivate Schematron & XSLT 2.0 expertise in-house Alexander (“Sasha”) Schwarzman31Superset Me—Not JATS-Con Nov 2, 2010

Conclusion What about content that is not like a journal article, e.g., generic (non-NCBI) books and their parts/chapters? When this deficiency is addressed, the NLM Archiving and Interchange Tag Suite could truly say: “Superset Me—Not!” Alexander (“Sasha”) Schwarzman32Superset Me—Not JATS-Con Nov 2, 2010