Copyright 2010 Inera Incorporated. All Rights Reserved NLM DTD Flexibility: How and Why Applications of the NLM DTD Vary Presented by Bruce D. Rosenblum CEO Inera Incorporated Journal Article Tag Suite Conference, 1 November 2010
Copyright 2010 Inera Incorporated. All Rights Reserved Remember When…
Copyright 2010 Inera Incorporated. All Rights Reserved Scholarly DTDs, Circa 2001 ISO Elsevier Elsevier Elsevier Elsevier 4.1 Blackwell 2.2 Blackwell 3.0 Blackwell 4.0 Keton Camdus Capital City Charlesworth Alden Highwire PMC 1.0 AIP UCP Wiley IEEE Nature BioOne U Chicago Press Cambridge Univeristy Press American GeoPhysical American Medical New England Journal American Chemical National Resarch Canada Academic Press Oxford University Press Academic Press Springer Lkuwer Academic
Copyright 2010 Inera Incorporated. All Rights Reserved Scholarly DTDs 2010 NLM DTD Elsevier DTD Springer DTD Wiley-Blackwell DTD And a few others… No longer a grand mess, but… NLM DTD Suite applications vary Specific tagging practices meet publisher-specific requirements
Copyright 2010 Inera Incorporated. All Rights Reserved Data and Methodology Data from 25 eXtyles and refXpress implementations since 2003 Not a scientific survey However useful to show NLM DTD usage variations Supplier requirements differ from publishers Serve multiple publishers who deliver to different platforms
Copyright 2010 Inera Incorporated. All Rights Reserved NLM DTD Adoption By Year OrganizationDTDYearVersionPrior XML Publisher 1Archive * † No Publisher 2Archive No Publisher 3Archive No Publisher 4Archive No Publisher 5Archive Yes Publisher 6Publish Yes Publisher 7Publish & book No Publisher 8Book * No Publisher 9Publish No Publisher 10Archive No Publisher 11Publish No Publisher 12Publish No Publisher 13Publish Yes Publisher 14Publish & book No Publisher 15Publish No Publisher 16Book No Publisher 17Publish No Publisher 18Publish No Publisher 19Publish * No Publisher 20Archive No Publisher 21Publish * Yes JATS-conAuthoring Yes Supplier 1Publish Yes Supplier 2Publish No Supplier 3Book Yes * Customized version of DTD beyond OASIS-CALS addition † Upgraded from 1.0 to 3.0 in 2010
Copyright 2010 Inera Incorporated. All Rights Reserved Year of DTD Adoption Few implementations prior to 2006 Mostly related to PMC deposit Adoption rate grows in 2006 and later Maturity of version 2.0 in August 2004 Greater public awareness by 2006 Freely available and modifiable Flexible Not just for life science content More off-the-shelf tool support from NCBI and others 3.0 upgrade not automatic; not fully backwards compatible
Copyright 2010 Inera Incorporated. All Rights Reserved Prior Markup Experience Most had not used full-text XML or SGML Driven to NLM DTD for: More modern XML-based workflow Desire for full-text to drive HTML and archive needs PMC deposit Those with SGML experience SGML to XML conversion choice Convert existing DTD to XML Adopt NLM DTD
Copyright 2010 Inera Incorporated. All Rights Reserved DTD Selection Most adopters use Journal Publishing (blue) DTD Early adopters chose Archive and Interchange (green) DTD Blue was too restrictive prior to 2.0 ISSN optional in green; hosts non-serial publications without modification Book DTD use growing in recent years Not as mature as journals, but useful
Copyright 2010 Inera Incorporated. All Rights Reserved Implementation Characteristics OrganizationChar EncodingMathTablesList LabelsRef PCDATA Publisher 1ISOMathMLHTMLDROP Publisher 2ISOGraphicHTMLDROP Publisher 3UnicodeGraphicHTMLDROPKEEP Publisher 4ISOMathMLCALSDROPKEEP Publisher 5ISOMathMLHTMLDROPKEEP Publisher 6ISOTeXCALSDROPKEEP Publisher 7UnicodeGraphicHTMLDROP Publisher 8ISOGraphicCALSKEEP Publisher 9ISOMathMLHTMLDROP Publisher 10UnicodeGraphicHTMLDROP Publisher 11UnicodeMathMLHTMLDROP Publisher 12UnicodeMathMLHTMLKEEP Publisher 13UnicodeGraphicCALSKEEP Publisher 14UnicodeGraphicCALSKEEP Publisher 15UnicodeGraphicHTMLDROP Publisher 16UnicodeGraphicHTMLKEEP Publisher 17UnicodeGraphicHTMLDROPKEEP Publisher 18UnicodeGraphicHTMLDROPKEEP Publisher 19UnicodeGraphicCALSKEEPNA Publisher 20UnicodeNA KEEP Publisher 21UnicodeMathMLCALSKEEP JATS-conUnspecifiedMathMLHTMLDROPKEEP Supplier 1UnicodeMathMLHTMLKEEP Supplier 2ISOTeXCALSKEEP Supplier 3UnicodeMathML+graphicCALSKEEP
Copyright 2010 Inera Incorporated. All Rights Reserved Character Encoding Most implementations use Unicode entities (e.g., β) Quasi-human readable (unlike UTF-8) Some use ISO entities (e.g. β) Most human-readable But Transform required for HTML
Copyright 2010 Inera Incorporated. All Rights Reserved Generated and Boilerplate text Generated Text: Inconsequential, formulaic, or stereotypical text, punctuation, and formatting omitted from an XML file, which is applied to content by a style sheet when an XML file is rendered Boilerplate Text: Inconsequential, formulaic, or stereotypical text, punctuation, and formatting that could have been omitted but which the publisher has chosen to keep in the XML file rather than to generate with a style sheet
Copyright 2010 Inera Incorporated. All Rights Reserved NLM DTD Structure NLM DTD is flexible Permits generated or boilerplate text Degree varies by tag set Green DTD allows greatest degree of Boilerplate Text Includes the element Hypothesis: Flexibility of generated versus boilerplate text increased NLM DTD adoption
Copyright 2010 Inera Incorporated. All Rights Reserved List Labels List-type attribute carries format information Most publishers don’t keep list label Possibly because HTML excludes list label Books are an exception List label useful for dis-continuous lists (e.g. items 1 to 4, intervening text, then items 5 to 8)
Copyright 2010 Inera Incorporated. All Rights Reserved Early Reference Models Versions 1.0 through version 2.3 had the and elements allowed PCDATA and any element order allowed only elements in proscribed order No way to restrict PCDATA without enforcing element order Problematic when mixing parsed and unparsed references (e.g. gray literature)
Copyright 2010 Inera Incorporated. All Rights Reserved Reference Tagging 3.0 and Former allows PCDATA Latter allows only semantic elements Neither proscribes order
Copyright 2010 Inera Incorporated. All Rights Reserved Reference Tagging Most publishers keep PCDATA All suppliers keep PCDATA Reasons Less style sheet setup (PDF, HTML, etc.) PCDATA can easily be dropped Suppliers: multiple publisher styles require less setup
Copyright 2010 Inera Incorporated. All Rights Reserved PCDATA Correlations All element-citation users drop list labels Some mixed-citation users drop list labels Publishers decide on boilerplate text on per- element basis, not global all or nothing
Copyright 2010 Inera Incorporated. All Rights Reserved Math & Tables by Comp Application OrganizationComposition ApplicationMathTables Publisher 83B2GraphicCALS Publisher 213B2MathMLCALS Publisher 63B2TeXCALS Supplier 23B2TeXCALS Publisher 13B2MathMLHTML Supplier 13B2MathMLHTML Publisher 53B2 & InDesignMathMLHTML Publisher 11Antenna HouseMathMLHTML Publisher 4FrameMathMLCALS Publisher 19InDesignGraphicCALS Publisher 2InDesignGraphicHTML Publisher 3InDesignGraphicHTML Publisher 15InDesignGraphicHTML Publisher 16InDesignGraphicHTML Publisher 18InDesignGraphicHTML Publisher 13InDesign/TypefiGraphicCALS Publisher 14InDesign/TypefiGraphicCALS Supplier 3InDesign/TypefiMathML+graphicCALS Publisher 7InDesign/TypefiGraphicHTML JATS-conNAMathMLHTML Publisher 20NA Publisher 17PDF from WordGraphicHTML Publisher 9PDF from WordMathMLHTML Publisher 12PDF from WordMathMLHTML Publisher 10VenturaGraphicHTML
Copyright 2010 Inera Incorporated. All Rights Reserved Table Markup XHTML is default NLM DTD model CALS requires DTD modification CALS has cell borders and table groups InDesign & Frame support CALS, but not XHTML tables 3B2 users seem to prefer CALS tables Must be converted to XHTML for online delivery Theory: publishers adopt CALS when more appropriate for PDF/print composition systems
Copyright 2010 Inera Incorporated. All Rights Reserved Math Markup NLM DTD permits MathML, TeX, pointers to graphic files MathML is native XML markup, but… MathML has limited browser support Firefox is good; Safari is OK; IE has no MathML support Most publishers deliver online math as images MathML has limited composition support InDesign does not have native MathML rendering 3B2 native rendering is TeX Math model driven by PDF creation requirements
Copyright 2010 Inera Incorporated. All Rights Reserved Composition and Hosting OrganizationComp ApplicationComp LocationOnlinePMC Publisher 13B2OutsourceSelf-hostedNo Publisher 2InDesignIn-HouseSelf-hostedYes Publisher 3InDesignIn-HouseSelf-hostedYes Publisher 4FrameIn-HouseSelf-hostedNo Publisher 53B2 & InDesignOutsourceHighwireYes Publisher 63B2OutsourceHighwireYes Publisher 7InDesign/TypefiIn-HouseSelf-hostedYes Publisher 83B2OutsourceSelf-hostedNo Publisher 9PDF from WordIn-HouseSelf-hostedYes Publisher 10VenturaIn-HouseSelf-hostedNo Publisher 11Antenna HouseIn-HouseSelf-hostedYes Publisher 12PDF from WordIn-HouseSelf-hostedYes Publisher 13InDesign/TypefiIn-HouseHighwireYes Publisher 14InDesign/TypefiIn-HouseSelf-hostedNo Publisher 15InDesignIn-HouseSelf-hostedNo Publisher 16InDesignIn-HouseSelf-hostedNo Publisher 17PDF from WordIn-HouseSelf-hostedYes Publisher 18InDesignIn-HouseSelf-hostedYes Publisher 19InDesignIn-HouseSelf-hostedNo Publisher 20NA Self-hostedNo Publisher 213B2In-HouseSelf-hostedSome JATS-conNA Self-hostedNo Supplier 13B2SupplierVariousSome Supplier 23B2SupplierVariousNo Supplier 3InDesign/TypefiSupplierVariousNo
Copyright 2010 Inera Incorporated. All Rights Reserved Composition and Online Hosting Majority of users Typeset in-house Self-host online version PMC delivery requirement for half of users However… this correlation may be significant only among organizations that have chosen to create XML in-house
Copyright 2010 Inera Incorporated. All Rights Reserved Conclusions NLM DTD flexibility led to broader adoption Application of DTD can be adjusted to meet needs of specific publishing requirements or tools NLM DTD standard facilitates in-house XML implementation Eliminates R&D requirement to create a DTD Customizable off-the-shelf tools available Cost-effective solution for small and medium-size publishers
Copyright 2010 Inera Incorporated. All Rights Reserved Questions? Bruce Rosenblum Inera Incorporated +1 (617)