Download presentation
Presentation is loading. Please wait.
1
Digital recordkeeping and preservation II
Introduction to XML ARK2200 Digital recordkeeping and preservation II 2017 Thomas Sødring P48-R407
2
This session We will develop four different XML-files name.xml
person.xml personList.xml book.xml Everyone should have these four files in a directory by the end of the week
3
The extraction process
Records Management Extraction Long term preservation Variable, 10 years Approx 5-10 hours forever
4
How could we extract data?
Data is stored in tables in the database Needs to be extracted and stored in a 'neutral' format How can we store data taken out of the database Fixed-width file Comma separated values Markup languages
5
Fixed-width file 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address
10002__Pål_____Solberg___Storgata_______40182Oslo 10002__Thomas__Hansen____Bakken_______ Ski 10003__Eli_____Rørvik____Saturnringen__471808Askim 10004__Børre___Andersen__Bekkefaret_____50348Oslo 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address 42-44:House number 45-48:Zip code 49-58:Town
6
Fixed width problems If the data is updated so that the width boundaries don't match any more, then programs reading fixed- width files with have a problem
7
Fixed-width file with problem
10002__Pål_____Solberg___Storgata_______40182Oslo 10002__Thomas__Hansen____Bakken_______108b1406Ski 10003__Eli_____Rørvik____Saturnringen__471808Askim 10004__Børre___Andersen__Bekkefaret_____50348Oslo 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address 42-44:House number 45-48:Zip code 49-58:Town
8
Fixed-width problems There is a dependency between data, data boundaries and the program reading the data This could perhaps be dealt with a prolog at the top of the file describing the widths or perhaps a version number of the structure The dependency is causing problems
9
Comma separated values
Comma separated values, csv, is a way of delimiting data Normally uses a comma as a delimitor Each row from the database corresponds to a row in the csv file Each field corresponds to a given colum 10002, Pål, Solberg, Storgata, 4, 0182, Oslo
10
CSV Each line represents a person
– ID, firstname, surname, address, house number, zip code, town – start 10002, Pål, Solberg, Storgata, 4, 0182, Oslo 10002, Thomas, Hansen, Bakken, 8b, 1406, Ski 10003, Eli, Rørvik, Saturnringen, 47, 1808, Askim 10004, Børre, Andersen, Bekkefaret, 5, 0348, Oslo – stopp
11
“10002”, “Pål”, “Solberg”, “Storgata”, “4”, “0182”, “Oslo”
CSV limitations Very dependent on ordering structure What happens if a comma appears in data? Typically have a field delimiter and data delimiter , for field and “” for data within a field If a comma appears in data, then it can be ignored “10002”, “Pål”, “Solberg”, “Storgata”, “4”, “0182”, “Oslo”
12
Comma separated with errors
It may be difficult to detect mistakes and missing information in the file Especially when large files are processed in bulk What happens if a field is missing? Each line represents a person – ID, firstname, surname, address, house number zip code, town – start 10002, Solberg, Pål, Storgata, 4, 0182, Oslo 10002, Thomas, Bakken, 8b, 1406, Ski 10003, Eli, Rørvik, Saturnringen, 47, 1808, Askim 10004, Børre, Andersen, 5, 0348, Oslo – stopp First and last name mixed Address becomes surname Address becomes house number
13
It's about interoperability
The two examples with errors are weak, although they are valid Fixed width and CSV work perfectly well in a controlled environment When you need to exchange information between systems it can be difficult to control the quality of the files CSV and fixed-width quickly fail Difficult to handle versions of files
14
XML If only there were a way to tag the data so that
we knew what each field meant XML
15
Like this? <id> 10002 </id>
<surname> Solberg </surname> <address> Storgata </address> <firstname> Pål </firstname> <houseNumber> 4 </houseNumber> <zipCode> 0182 </zipCode> <town> Oslo </town>
16
XML Suddenly ordering is irrelevant and it's easy to discover if we are missing any fields Additional fields can be added and easily ignored or error reporting in the import program
17
XML as an extraction format
XML is a markup language and can be used as an extraction format for transferring information to an archive format for long-term storage an interoperability format A markup language combines text and extra information (metadata) about the text
18
Why XML for extractions
Self descriptive Interoperability Easy to further develop Non proprietary
19
Markup language The term "markup language" comes from the process of marking a manuscript, where symbolic instructions are added to a manuscript and interpreted when printing GenCode from 1960s Scribe 1980 GML SGML HTML XML xhtml
20
XML Markup language eXtensible Markup Language
Simplifies some aspects of SGML Much more flexible and adaptable than HTML Published by W3 Consortium
21
A 'suite' of technologies
XML Markup language XSD/DTD Defines structure and can be used to validate XSLT Used to present and or change data XPath / XQuery For searching There are more but out of scope here
22
Data / Structure / Presentation
The XML book Hans Hansen Introduction This is a book About XML. XML Elements And attributes ..... Physical book Data eBook A book consists of title, author chapters and paragraphs Presentation Audio book Structure and validation
23
From an archive perspective
The XML book Hans Hansen Introduction This is a book About XML. XML Elements And attributes ..... Data A book consists of title, author chapters and paragraphs Structure and validation
24
XML An XML document is a document that contains data that is marked up in a particular way XML is a meta language for creating different text markup languages used to describe any text It is a tool Used for Noark 4/5 transfer and use of electronic archive packages (OAIS/DIAS) It is an important standard during records management and long term preservation phases Interoperability, extractions, long-term preservation
25
An XML document Sensible element names makes reading and understanding an XML file intuitive An XML document consists of a prologue and a root element (and all the xml) The root element is also called the document element It is the first element in the XML document Anything after the root element terminates is deemed trash <root> </root>
26
An XML document An XML document consists of Prologue Document element
27
An XML Prologue The prologue consists of XML declaration Comments
Blank lines Structure validation Processing instructions e.g. style information <?xml version="1.0" encoding="UTF-8"?> <!-- This is an example comment--> <!DOCTYPE arkiv SYSTEM " <?xml-stylesheet href="fonds.css" type="text/css"?>
28
Document element Document element is also called the root element and all content has this root element as a parent <documentElement> </documentElement> Document element content
29
A quick example Document element content <fonds> <series>
<file> </file> </series> </fonds> Document element content
30
First practical We will now create our first XML document with a declaration and comment but not the document element <?xml version="1.0" encoding="UTF-8"?> <!-- This is a comment -->
31
xmlcopyeditor We use a program called xmlcopyeditor to work with XML
Can be downloaded from Sufficient for an introductory course If you work with XML in an archive institution you will use proprietary XML editing software They have more scalability and usability We want to learn some basic principles and xmlcopyeditor is fine for that Other XML editors :
32
xmlcopyeditor
33
Tags and elements Understanding the concept of tags is paramount to understanding how to build content in XML documents A tag is defined by both a < and a > Tags normally occur in pairs <name> is a start tag </name> is an end tag The / denotes it closes already defined start tag <name>Hans</name> is an element
34
<author>Hans Hansen</author>
Elements Elements are the foundation for the marking up data in XML files and consist of Start tag Content End tag Content <author>Hans Hansen</author> Start tag End tag
35
Elements The name you choose for the start and end tag should describe the content That is why we say that XML is self-descriptive When we surround data with start and end tags we are 'marking up' data This is one of the reasons why XML is the right format when it comes to preserving data Or why mark-up languages are useful for preservation Suggestions for an element name that can be used to describe a person? <person></person> <name></name> <age></age>
36
name.xml Where is the prologue? What is the root element?
What happens if we add anything after </name>? <?xml version="1.0" encoding="UTF-8"?> <!-- This is a comment --> <name> <firstname>Hans</firstname> <surname>Hansen</surname> </name>
37
* handled differently in attributes
White spaces in data* 'White spaces' is a common term for Space, tab, return The XML-parser does nothing with white spaces in data, rather it leaves it up to the application to interpret white spaces Whitespace handling can be configured in XSD <author> Hans Hansen </author> * handled differently in attributes
38
XML to describe a person
We need to record the following information for a person firstname, middlename, surname, social security number and gender How do we describe this in XML?
39
person.xml <?xml version="1.0" encoding="UTF-8"?> <person>
<firstname>Hans</firstname> <middlename>John</middlename> <surname>Hansen</surname> <socialSecurityNumber> </socialSecurityNumber> <gender>male</gender> </person>
40
compared to database ... What we have just looked at, how does it compare to information in a database or ER-modelling? The root element <person> is an entity so it corresponds to table name The elements names <firstname> etc. correspond to attributes/columns from a table The content of elements correspond to a tuple of data from a table/relation Important to begin to see this parallel with data in a database
41
But what I wanted was a list of people
What happens if we try to add another <person></person> in the same file? Junk after the root element! <personList></personList> This is a classic problem beginners face when learning XML You really have to plan your XML Important to understand in assignment and exam But planning is something we do when we work with XSD
42
personList.xml <?xml version="1.0" encoding="UTF-8"?>
<firstname>Hans</firstname> <middlename>John</middlename> <surname>Hansen</surname> <socialSecurityNumber> </socialSecurityNumber> <gender>male</gender> </person> </personList>
43
compared to database ... Slightly different now, the root element no longer corresponds to the entity but is an indication of a plural of the entity But there is no requirement in element naming The <person> element still corresponds to an instance of an entity, but now acts as a delimiter of multiple instances of the entity A table row delimiter if you like ...
44
XML to describe a book Next we want to describe the concept of a book
Some basic elements we would expect in a book Book name, authors, chapters, A chapter has some elements Chapter title, paragraphs of text We ow begin to see that there is a requirement to be able to define structure
45
book.xml <?xml version="1.0" encoding="UTF-8"?> <book>
<author>Hans Hansen</author> <bookTitle>The book about XML</bookTitle> <chapter> <chapterTitle>Introduction</chapterTitle> <paragraph></paragraph> </chapter> <chapterTitle>XML Root element</chapterTitle> </book>
46
Empty elements Elements that have no data are considered empty
<paragraph></paragraph> And can be written as <paragraph/> Empty elements should never be in a Noark 5 extraction Requirement Metadata Items that do not have a value, must be excluded from the extraction
47
Empty elements <?xml version="1.0" encoding="UTF-8"?>
<!-- This is a comment --> <book> <author>Hans Hansen</author> <bookTitle>The book about XML</bookTitle> <chapter> <chapterTitle>Introduction</chapterTitle> <paragraph></paragraph> <paragraph/> </chapter> <chapterTitle>XML Root element</chapterTitle> </book>
48
Element types In XML we have a clear distinction between an element that is a simpleType and one that is a complexType simpleType contains only data complexType can contain children elements <name> <firstname>Hans</firstname> <surname>Hansen</surname> </name>
49
Recap We have introduced XML and looked at XML document
Structure and document element Basic rules Elements Definition and types Identified that there is a need for structure but defined any particular need
50
An XML Prologue The prologue consists of XML declaration Comments
Blank lines Structure validation Processing instructions e.g style information <?xml version="1.0" encoding="UTF-8"?> <!-- This is an example comment--> <!DOCTYPE arkiv SYSTEM " <?xml-stylesheet href="fonds.css" type="text/css"?>
51
Processing instructions
Processing instructions are not part of an XML document Provides instructions on how an (external) application should process the XML document e.g. convert it to another format Starts with <? and ends with ?> An example <?xml-stylesheet href="fonds.css" type="text/css"?> XML file will formatted according CSS format instructions specified in the file fonds.css
52
Processing instructions
xslt specifies how an XML will be processed to create a new document Need a XSLT processor (firefox) xml xslt xslt prosessor new document
53
Process fonds.xml Download the files fonds.xml, fonds.xslt, fonds.css from CSS and xslt not part of this course, this is just to show you what they do
54
Processing instructions
Currently the RM/Archive profession is not that concerned about this, but they will at some stage But when an archive uses XML is it to process it or preserve it? My impression that the field is very dependent on RDBMS for processing data
55
XML Criticisms Unnecessary! Too much markup data
The file size larger than what is necessary Could be a problem where there is limited broadband / Internet is expensive <person> <firstname>Hans</firstname> <surname>Hansen</surname> <alder>45</alder> </person> (person firstname(Hans) surname(Hansen) alder(45) ) 95 characters 61 characters
56
Advantages Markup is plain text and human readable
Data is separated from presentation Independent of system, software and hardware Non-proprietary (Relatively) easy to implement solutions on top of XML Easy to import or export data to a database using XML
57
XML as an archive format
XML used in preservation Transfer to an archive institution for Noark 5 The DIAS standard (national archive package format) is based on XML XML used during recordkeeping Export / Import data to a Noark system BEST-standard for exchange of case documents
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.