Digital recordkeeping and preservation II

Digital recordkeeping and preservation II
Introduction to XML ARK2200 Digital recordkeeping and preservation II 2017 Thomas Sødring P48-R407

This session We will develop four different XML-files name.xml
person.xml personList.xml book.xml Everyone should have these four files in a directory by the end of the week

The extraction process
Records Management Extraction Long term preservation Variable, 10 years Approx 5-10 hours forever

How could we extract data?
Data is stored in tables in the database Needs to be extracted and stored in a 'neutral' format How can we store data taken out of the database Fixed-width file Comma separated values Markup languages

Fixed-width file 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address
10002__Pål_____Solberg___Storgata_______40182Oslo 10002__Thomas__Hansen____Bakken_______ Ski 10003__Eli_____Rørvik____Saturnringen__471808Askim 10004__Børre___Andersen__Bekkefaret_____50348Oslo 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address 42-44:House number 45-48:Zip code 49-58:Town

Fixed width problems If the data is updated so that the width boundaries don't match any more, then programs reading fixed- width files with have a problem

Fixed-width file with problem
10002__Pål_____Solberg___Storgata_______40182Oslo 10002__Thomas__Hansen____Bakken_______108b1406Ski 10003__Eli_____Rørvik____Saturnringen__471808Askim 10004__Børre___Andersen__Bekkefaret_____50348Oslo 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address 42-44:House number 45-48:Zip code 49-58:Town

Fixed-width problems There is a dependency between data, data boundaries and the program reading the data This could perhaps be dealt with a prolog at the top of the file describing the widths or perhaps a version number of the structure The dependency is causing problems

Comma separated values
Comma separated values, csv, is a way of delimiting data Normally uses a comma as a delimitor Each row from the database corresponds to a row in the csv file Each field corresponds to a given colum 10002, Pål, Solberg, Storgata, 4, 0182, Oslo

CSV Each line represents a person
– ID, firstname, surname, address, house number, zip code, town – start 10002, Pål, Solberg, Storgata, 4, 0182, Oslo 10002, Thomas, Hansen, Bakken, 8b, 1406, Ski 10003, Eli, Rørvik, Saturnringen, 47, 1808, Askim 10004, Børre, Andersen, Bekkefaret, 5, 0348, Oslo – stopp

“10002”, “Pål”, “Solberg”, “Storgata”, “4”, “0182”, “Oslo”
CSV limitations Very dependent on ordering structure What happens if a comma appears in data? Typically have a field delimiter and data delimiter , for field and “” for data within a field If a comma appears in data, then it can be ignored “10002”, “Pål”, “Solberg”, “Storgata”, “4”, “0182”, “Oslo”

Comma separated with errors
It may be difficult to detect mistakes and missing information in the file Especially when large files are processed in bulk What happens if a field is missing? Each line represents a person – ID, firstname, surname, address, house number zip code, town – start 10002, Solberg, Pål, Storgata, 4, 0182, Oslo 10002, Thomas, Bakken, 8b, 1406, Ski 10003, Eli, Rørvik, Saturnringen, 47, 1808, Askim 10004, Børre, Andersen, 5, 0348, Oslo – stopp First and last name mixed Address becomes surname Address becomes house number

It's about interoperability
The two examples with errors are weak, although they are valid Fixed width and CSV work perfectly well in a controlled environment When you need to exchange information between systems it can be difficult to control the quality of the files CSV and fixed-width quickly fail Difficult to handle versions of files

XML If only there were a way to tag the data so that
we knew what each field meant XML

Like this? <id> 10002 </id>
<surname> Solberg </surname> <address> Storgata </address> <firstname> Pål </firstname> <houseNumber> 4 </houseNumber> <zipCode> 0182 </zipCode> <town> Oslo </town>

XML Suddenly ordering is irrelevant and it's easy to discover if we are missing any fields Additional fields can be added and easily ignored or error reporting in the import program

XML as an extraction format
XML is a markup language and can be used as an extraction format for transferring information to an archive format for long-term storage an interoperability format A markup language combines text and extra information (metadata) about the text

Why XML for extractions
Self descriptive Interoperability Easy to further develop Non proprietary

Markup language The term "markup language" comes from the process of marking a manuscript, where symbolic instructions are added to a manuscript and interpreted when printing GenCode from 1960s Scribe 1980 GML SGML HTML XML xhtml

XML Markup language eXtensible Markup Language
Simplifies some aspects of SGML Much more flexible and adaptable than HTML Published by W3 Consortium

A 'suite' of technologies
XML Markup language XSD/DTD Defines structure and can be used to validate XSLT Used to present and or change data XPath / XQuery For searching There are more but out of scope here

Data / Structure / Presentation
The XML book Hans Hansen Introduction This is a book About XML. XML Elements And attributes ..... Physical book Data eBook A book consists of title, author chapters and paragraphs Presentation Audio book Structure and validation

From an archive perspective
The XML book Hans Hansen Introduction This is a book About XML. XML Elements And attributes ..... Data A book consists of title, author chapters and paragraphs Structure and validation

XML An XML document is a document that contains data that is marked up in a particular way XML is a meta language for creating different text markup languages used to describe any text It is a tool Used for Noark 4/5 transfer and use of electronic archive packages (OAIS/DIAS) It is an important standard during records management and long term preservation phases Interoperability, extractions, long-term preservation

An XML document Sensible element names makes reading and understanding an XML file intuitive An XML document consists of a prologue and a root element (and all the xml) The root element is also called the document element It is the first element in the XML document Anything after the root element terminates is deemed trash <root> </root>

An XML document An XML document consists of Prologue Document element

An XML Prologue The prologue consists of XML declaration Comments
Blank lines Structure validation Processing instructions e.g. style information <?xml version="1.0" encoding="UTF-8"?>  <!DOCTYPE arkiv SYSTEM " <?xml-stylesheet href="fonds.css" type="text/css"?>

Document element Document element is also called the root element and all content has this root element as a parent <documentElement> </documentElement> Document element content

A quick example Document element content <fonds> <series>
<file> </file> </series> </fonds> Document element content

First practical We will now create our first XML document with a declaration and comment but not the document element <?xml version="1.0" encoding="UTF-8"?>

xmlcopyeditor We use a program called xmlcopyeditor to work with XML
Can be downloaded from Sufficient for an introductory course If you work with XML in an archive institution you will use proprietary XML editing software They have more scalability and usability We want to learn some basic principles and xmlcopyeditor is fine for that Other XML editors :

xmlcopyeditor

Tags and elements Understanding the concept of tags is paramount to understanding how to build content in XML documents A tag is defined by both a < and a > Tags normally occur in pairs <name> is a start tag </name> is an end tag The / denotes it closes already defined start tag <name>Hans</name> is an element

<author>Hans Hansen</author>
Elements Elements are the foundation for the marking up data in XML files and consist of Start tag Content End tag Content <author>Hans Hansen</author> Start tag End tag

Elements The name you choose for the start and end tag should describe the content That is why we say that XML is self-descriptive When we surround data with start and end tags we are 'marking up' data This is one of the reasons why XML is the right format when it comes to preserving data Or why mark-up languages are useful for preservation Suggestions for an element name that can be used to describe a person? <person></person> <name></name> <age></age>

name.xml Where is the prologue? What is the root element?
What happens if we add anything after </name>? <?xml version="1.0" encoding="UTF-8"?>  <name> <firstname>Hans</firstname> <surname>Hansen</surname> </name>

* handled differently in attributes
White spaces in data* 'White spaces' is a common term for Space, tab, return The XML-parser does nothing with white spaces in data, rather it leaves it up to the application to interpret white spaces Whitespace handling can be configured in XSD <author> Hans Hansen </author> * handled differently in attributes

XML to describe a person
We need to record the following information for a person firstname, middlename, surname, social security number and gender How do we describe this in XML?

person.xml <?xml version="1.0" encoding="UTF-8"?> <person>
<firstname>Hans</firstname> <middlename>John</middlename> <surname>Hansen</surname> <socialSecurityNumber> </socialSecurityNumber> <gender>male</gender> </person>

compared to database ... What we have just looked at, how does it compare to information in a database or ER-modelling? The root element <person> is an entity so it corresponds to table name The elements names <firstname> etc. correspond to attributes/columns from a table The content of elements correspond to a tuple of data from a table/relation Important to begin to see this parallel with data in a database

But what I wanted was a list of people
What happens if we try to add another <person></person> in the same file? Junk after the root element! <personList></personList> This is a classic problem beginners face when learning XML You really have to plan your XML Important to understand in assignment and exam But planning is something we do when we work with XSD

personList.xml <?xml version="1.0" encoding="UTF-8"?>
<firstname>Hans</firstname> <middlename>John</middlename> <surname>Hansen</surname> <socialSecurityNumber> </socialSecurityNumber> <gender>male</gender> </person> </personList>

compared to database ... Slightly different now, the root element no longer corresponds to the entity but is an indication of a plural of the entity But there is no requirement in element naming The <person> element still corresponds to an instance of an entity, but now acts as a delimiter of multiple instances of the entity A table row delimiter if you like ...

XML to describe a book Next we want to describe the concept of a book
Some basic elements we would expect in a book Book name, authors, chapters, A chapter has some elements Chapter title, paragraphs of text We ow begin to see that there is a requirement to be able to define structure

book.xml <?xml version="1.0" encoding="UTF-8"?> <book>
<author>Hans Hansen</author> <bookTitle>The book about XML</bookTitle> <chapter> <chapterTitle>Introduction</chapterTitle> <paragraph></paragraph> </chapter> <chapterTitle>XML Root element</chapterTitle> </book>

Empty elements Elements that have no data are considered empty
<paragraph></paragraph> And can be written as <paragraph/> Empty elements should never be in a Noark 5 extraction Requirement Metadata Items that do not have a value, must be excluded from the extraction

Empty elements <?xml version="1.0" encoding="UTF-8"?>
 <book> <author>Hans Hansen</author> <bookTitle>The book about XML</bookTitle> <chapter> <chapterTitle>Introduction</chapterTitle> <paragraph></paragraph> <paragraph/> </chapter> <chapterTitle>XML Root element</chapterTitle> </book>

Element types In XML we have a clear distinction between an element that is a simpleType and one that is a complexType simpleType contains only data complexType can contain children elements <name> <firstname>Hans</firstname> <surname>Hansen</surname> </name>

Recap We have introduced XML and looked at XML document
Structure and document element Basic rules Elements Definition and types Identified that there is a need for structure but defined any particular need

An XML Prologue The prologue consists of XML declaration Comments
Blank lines Structure validation Processing instructions e.g style information <?xml version="1.0" encoding="UTF-8"?>  <!DOCTYPE arkiv SYSTEM " <?xml-stylesheet href="fonds.css" type="text/css"?>

Processing instructions
Processing instructions are not part of an XML document Provides instructions on how an (external) application should process the XML document e.g. convert it to another format Starts with <? and ends with ?> An example <?xml-stylesheet href="fonds.css" type="text/css"?> XML file will formatted according CSS format instructions specified in the file fonds.css

xslt specifies how an XML will be processed to create a new document Need a XSLT processor (firefox) xml xslt xslt prosessor new document

Process fonds.xml Download the files fonds.xml, fonds.xslt, fonds.css from CSS and xslt not part of this course, this is just to show you what they do

Currently the RM/Archive profession is not that concerned about this, but they will at some stage But when an archive uses XML is it to process it or preserve it? My impression that the field is very dependent on RDBMS for processing data

XML Criticisms Unnecessary! Too much markup data
The file size larger than what is necessary Could be a problem where there is limited broadband / Internet is expensive <person> <firstname>Hans</firstname> <surname>Hansen</surname> <alder>45</alder> </person> (person firstname(Hans) surname(Hansen) alder(45) ) 95 characters 61 characters

Advantages Markup is plain text and human readable
Data is separated from presentation Independent of system, software and hardware Non-proprietary (Relatively) easy to implement solutions on top of XML Easy to import or export data to a database using XML

XML as an archive format
XML used in preservation Transfer to an archive institution for Noark 5 The DIAS standard (national archive package format) is based on XML XML used during recordkeeping Export / Import data to a Noark system BEST-standard for exchange of case documents

Digital recordkeeping and preservation II

Similar presentations

Presentation on theme: "Digital recordkeeping and preservation II"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Digital recordkeeping and preservation II

Similar presentations

Presentation on theme: "Digital recordkeeping and preservation II"— Presentation transcript:

Similar presentations

About project

Feedback