Digital recordkeeping and preservation II Introduction to XML ARK2200 Digital recordkeeping and preservation II 2017 Thomas Sødring thomas.sodring@hioa.no P48-R407 67238287
This session We will develop four different XML-files name.xml person.xml personList.xml book.xml Everyone should have these four files in a directory by the end of the week
The extraction process Records Management Extraction Long term preservation Variable, 10 years Approx 5-10 hours forever
How could we extract data? Data is stored in tables in the database Needs to be extracted and stored in a 'neutral' format How can we store data taken out of the database Fixed-width file Comma separated values Markup languages
Fixed-width file 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address 10002__Pål_____Solberg___Storgata_______40182Oslo 10002__Thomas__Hansen____Bakken_______1081406Ski 10003__Eli_____Rørvik____Saturnringen__471808Askim 10004__Børre___Andersen__Bekkefaret_____50348Oslo 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address 42-44:House number 45-48:Zip code 49-58:Town
Fixed width problems If the data is updated so that the width boundaries don't match any more, then programs reading fixed- width files with have a problem
Fixed-width file with problem 10002__Pål_____Solberg___Storgata_______40182Oslo 10002__Thomas__Hansen____Bakken_______108b1406Ski 10003__Eli_____Rørvik____Saturnringen__471808Askim 10004__Børre___Andersen__Bekkefaret_____50348Oslo 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address 42-44:House number 45-48:Zip code 49-58:Town
Fixed-width problems There is a dependency between data, data boundaries and the program reading the data This could perhaps be dealt with a prolog at the top of the file describing the widths or perhaps a version number of the structure The dependency is causing problems
Comma separated values Comma separated values, csv, is a way of delimiting data Normally uses a comma as a delimitor Each row from the database corresponds to a row in the csv file Each field corresponds to a given colum 10002, Pål, Solberg, Storgata, 4, 0182, Oslo
CSV Each line represents a person – ID, firstname, surname, address, house number, zip code, town – start 10002, Pål, Solberg, Storgata, 4, 0182, Oslo 10002, Thomas, Hansen, Bakken, 8b, 1406, Ski 10003, Eli, Rørvik, Saturnringen, 47, 1808, Askim 10004, Børre, Andersen, Bekkefaret, 5, 0348, Oslo – stopp
“10002”, “Pål”, “Solberg”, “Storgata”, “4”, “0182”, “Oslo” CSV limitations Very dependent on ordering structure What happens if a comma appears in data? Typically have a field delimiter and data delimiter , for field and “” for data within a field If a comma appears in data, then it can be ignored “10002”, “Pål”, “Solberg”, “Storgata”, “4”, “0182”, “Oslo”
Comma separated with errors It may be difficult to detect mistakes and missing information in the file Especially when large files are processed in bulk What happens if a field is missing? Each line represents a person – ID, firstname, surname, address, house number zip code, town – start 10002, Solberg, Pål, Storgata, 4, 0182, Oslo 10002, Thomas, Bakken, 8b, 1406, Ski 10003, Eli, Rørvik, Saturnringen, 47, 1808, Askim 10004, Børre, Andersen, 5, 0348, Oslo – stopp First and last name mixed Address becomes surname Address becomes house number
It's about interoperability The two examples with errors are weak, although they are valid Fixed width and CSV work perfectly well in a controlled environment When you need to exchange information between systems it can be difficult to control the quality of the files CSV and fixed-width quickly fail Difficult to handle versions of files
XML If only there were a way to tag the data so that we knew what each field meant XML
Like this? <id> 10002 </id> <surname> Solberg </surname> <address> Storgata </address> <firstname> Pål </firstname> <houseNumber> 4 </houseNumber> <zipCode> 0182 </zipCode> <town> Oslo </town>
XML Suddenly ordering is irrelevant and it's easy to discover if we are missing any fields Additional fields can be added and easily ignored or error reporting in the import program
XML as an extraction format XML is a markup language and can be used as an extraction format for transferring information to an archive format for long-term storage an interoperability format A markup language combines text and extra information (metadata) about the text
Why XML for extractions Self descriptive Interoperability Easy to further develop Non proprietary
Markup language The term "markup language" comes from the process of marking a manuscript, where symbolic instructions are added to a manuscript and interpreted when printing GenCode from 1960s Scribe 1980 GML SGML HTML XML xhtml
XML Markup language eXtensible Markup Language Simplifies some aspects of SGML Much more flexible and adaptable than HTML Published by W3 Consortium http://www.w3.org/
A 'suite' of technologies XML Markup language XSD/DTD Defines structure and can be used to validate XSLT Used to present and or change data XPath / XQuery For searching There are more but out of scope here
Data / Structure / Presentation The XML book Hans Hansen Introduction This is a book About XML. XML Elements And attributes ..... Physical book Data eBook A book consists of title, author chapters and paragraphs Presentation Audio book Structure and validation
From an archive perspective The XML book Hans Hansen Introduction This is a book About XML. XML Elements And attributes ..... Data A book consists of title, author chapters and paragraphs Structure and validation
XML An XML document is a document that contains data that is marked up in a particular way XML is a meta language for creating different text markup languages used to describe any text It is a tool Used for Noark 4/5 transfer and use of electronic archive packages (OAIS/DIAS) It is an important standard during records management and long term preservation phases Interoperability, extractions, long-term preservation
An XML document Sensible element names makes reading and understanding an XML file intuitive An XML document consists of a prologue and a root element (and all the xml) The root element is also called the document element It is the first element in the XML document Anything after the root element terminates is deemed trash <root> </root>
An XML document An XML document consists of Prologue Document element
An XML Prologue The prologue consists of XML declaration Comments Blank lines Structure validation Processing instructions e.g. style information <?xml version="1.0" encoding="UTF-8"?> <!-- This is an example comment--> <!DOCTYPE arkiv SYSTEM "http://www.kdrs.no/dtd/fonds.dtd"> <?xml-stylesheet href="fonds.css" type="text/css"?>
Document element Document element is also called the root element and all content has this root element as a parent <documentElement> </documentElement> Document element content
A quick example Document element content <fonds> <series> <file> </file> </series> </fonds> Document element content
First practical We will now create our first XML document with a declaration and comment but not the document element <?xml version="1.0" encoding="UTF-8"?> <!-- This is a comment -->
xmlcopyeditor We use a program called xmlcopyeditor to work with XML Can be downloaded from http://xml-copy-editor.sourceforge.net/ Sufficient for an introductory course If you work with XML in an archive institution you will use proprietary XML editing software They have more scalability and usability We want to learn some basic principles and xmlcopyeditor is fine for that Other XML editors :http://alekdavis.blogspot.com/2009_06_01_archive.html
xmlcopyeditor
Tags and elements Understanding the concept of tags is paramount to understanding how to build content in XML documents A tag is defined by both a < and a > Tags normally occur in pairs <name> is a start tag </name> is an end tag The / denotes it closes already defined start tag <name>Hans</name> is an element
<author>Hans Hansen</author> Elements Elements are the foundation for the marking up data in XML files and consist of Start tag Content End tag Content <author>Hans Hansen</author> Start tag End tag
Elements The name you choose for the start and end tag should describe the content That is why we say that XML is self-descriptive When we surround data with start and end tags we are 'marking up' data This is one of the reasons why XML is the right format when it comes to preserving data Or why mark-up languages are useful for preservation Suggestions for an element name that can be used to describe a person? <person></person> <name></name> <age></age>
name.xml Where is the prologue? What is the root element? What happens if we add anything after </name>? <?xml version="1.0" encoding="UTF-8"?> <!-- This is a comment --> <name> <firstname>Hans</firstname> <surname>Hansen</surname> </name>
* handled differently in attributes White spaces in data* 'White spaces' is a common term for Space, tab, return The XML-parser does nothing with white spaces in data, rather it leaves it up to the application to interpret white spaces Whitespace handling can be configured in XSD <author> Hans Hansen </author> * handled differently in attributes
XML to describe a person We need to record the following information for a person firstname, middlename, surname, social security number and gender How do we describe this in XML?
person.xml <?xml version="1.0" encoding="UTF-8"?> <person> <firstname>Hans</firstname> <middlename>John</middlename> <surname>Hansen</surname> <socialSecurityNumber>01108298649</socialSecurityNumber> <gender>male</gender> </person>
compared to database ... What we have just looked at, how does it compare to information in a database or ER-modelling? The root element <person> is an entity so it corresponds to table name The elements names <firstname> etc. correspond to attributes/columns from a table The content of elements correspond to a tuple of data from a table/relation Important to begin to see this parallel with data in a database
But what I wanted was a list of people What happens if we try to add another <person></person> in the same file? Junk after the root element! <personList></personList> This is a classic problem beginners face when learning XML You really have to plan your XML Important to understand in assignment and exam But planning is something we do when we work with XSD
personList.xml <?xml version="1.0" encoding="UTF-8"?> <firstname>Hans</firstname> <middlename>John</middlename> <surname>Hansen</surname> <socialSecurityNumber>01108298649</socialSecurityNumber> <gender>male</gender> </person> </personList>
compared to database ... Slightly different now, the root element no longer corresponds to the entity but is an indication of a plural of the entity But there is no requirement in element naming The <person> element still corresponds to an instance of an entity, but now acts as a delimiter of multiple instances of the entity A table row delimiter if you like ...
XML to describe a book Next we want to describe the concept of a book Some basic elements we would expect in a book Book name, authors, chapters, A chapter has some elements Chapter title, paragraphs of text We ow begin to see that there is a requirement to be able to define structure
book.xml <?xml version="1.0" encoding="UTF-8"?> <book> <author>Hans Hansen</author> <bookTitle>The book about XML</bookTitle> <chapter> <chapterTitle>Introduction</chapterTitle> <paragraph></paragraph> </chapter> <chapterTitle>XML Root element</chapterTitle> </book>
Empty elements Elements that have no data are considered empty <paragraph></paragraph> And can be written as <paragraph/> Empty elements should never be in a Noark 5 extraction Requirement 5.12.5 Metadata Items that do not have a value, must be excluded from the extraction
Empty elements <?xml version="1.0" encoding="UTF-8"?> <!-- This is a comment --> <book> <author>Hans Hansen</author> <bookTitle>The book about XML</bookTitle> <chapter> <chapterTitle>Introduction</chapterTitle> <paragraph></paragraph> <paragraph/> </chapter> <chapterTitle>XML Root element</chapterTitle> </book>
Element types In XML we have a clear distinction between an element that is a simpleType and one that is a complexType simpleType contains only data complexType can contain children elements <name> <firstname>Hans</firstname> <surname>Hansen</surname> </name>
Recap We have introduced XML and looked at XML document Structure and document element Basic rules Elements Definition and types Identified that there is a need for structure but defined any particular need
An XML Prologue The prologue consists of XML declaration Comments Blank lines Structure validation Processing instructions e.g style information <?xml version="1.0" encoding="UTF-8"?> <!-- This is an example comment--> <!DOCTYPE arkiv SYSTEM "http://www.kdrs.no/dtd/fonds.dtd"> <?xml-stylesheet href="fonds.css" type="text/css"?>
Processing instructions Processing instructions are not part of an XML document Provides instructions on how an (external) application should process the XML document e.g. convert it to another format Starts with <? and ends with ?> An example <?xml-stylesheet href="fonds.css" type="text/css"?> XML file will formatted according CSS format instructions specified in the file fonds.css
Processing instructions xslt specifies how an XML will be processed to create a new document Need a XSLT processor (firefox) xml xslt xslt prosessor new document
Process fonds.xml Download the files fonds.xml, fonds.xslt, fonds.css from http://edu.hioa.no/ark2200/current/aids/ CSS and xslt not part of this course, this is just to show you what they do
Processing instructions Currently the RM/Archive profession is not that concerned about this, but they will at some stage But when an archive uses XML is it to process it or preserve it? My impression that the field is very dependent on RDBMS for processing data
XML Criticisms Unnecessary! Too much markup data The file size larger than what is necessary Could be a problem where there is limited broadband / Internet is expensive <person> <firstname>Hans</firstname> <surname>Hansen</surname> <alder>45</alder> </person> (person firstname(Hans) surname(Hansen) alder(45) ) 95 characters 61 characters
Advantages Markup is plain text and human readable Data is separated from presentation Independent of system, software and hardware Non-proprietary (Relatively) easy to implement solutions on top of XML Easy to import or export data to a database using XML
XML as an archive format XML used in preservation Transfer to an archive institution for Noark 5 The DIAS standard (national archive package format) is based on XML XML used during recordkeeping Export / Import data to a Noark system BEST-standard for exchange of case documents