Digital recordkeeping and preservation II

Slides:



Advertisements
Similar presentations
CSCI N241: Fundamentals of Web Design Copyright ©2004 Department of Computer & Information Science Introducing XHTML: Module B: HTML to XHTML.
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
XML/EDI Overview West Chester Electronic Commerce Resource Center (ECRC)
XHTML Basics.
SPECIAL TOPIC XML. Introducing XML XML (eXtensible Markup Language) ◦A language used to create structured documents XML vs HTML ◦XML is designed to transport.
IS 373—Web Standards Todd Will
Sistemi basati su conoscenza XML Prof. M.T. PAZIENZA a.a
Sistemi basati su conoscenza XML Prof. M.T. PAZIENZA a.a
Tutorial 11 Creating XML Document
Introduction to XML: Yong Choi CSU Bakersfield.
Introducing XHTML: Module B: HTML to XHTML. Goals Understand how XHTML evolved as a language for Web delivery Understand the importance of DTDs Understand.
Introduction to XML This material is based heavily on the tutorial by the same name at
CIS101 Introduction to Computing Week 06. Agenda Your questions Excel Exam during second hour Our status after the snow day Introduction to the Internet.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Manohar – Why XML is Required Problem: We want to save the data and retrieve it further or to transfer over the network. This.
ECA 228 Internet/Intranet Design I Intro to XML. ECA 228 Internet/Intranet Design I HTML markup language very loose standards browsers adjust for non-standard.
Chapter 12 Creating and Using XML Documents HTML5 AND CSS Seventh Edition.
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
August Chapter 1 - Introduction Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology Radford.
XML introduction to Ahmed I. Deeb Dr. Anwar Mousa  presenter  instructor University Of Palestine-2009.
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
CREATED BY ChanoknanChinnanon PanissaraUsanachote
1Computer Sciences Department Princess Nourah bint Abdulrahman University.
ACOT Intro/Copyright Succeeding in Business with Microsoft Excel
CISC 3140 (CIS 20.2) Design & Implementation of Software Application II Instructor : M. Meyer Address: Course Page:
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
1 © Netskills Quality Internet Training, University of Newcastle Introducing XML © Netskills, Quality Internet Training University.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
 XML is designed to describe data and to focus on what data is. HTML is designed to display data and to focus on how data looks.  XML is created to structure,
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Tutorial 1: XML Creating an XML Document. 2 Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
XML eXtensible Markup Language. Topics  What is XML  An XML example  Why is XML important  XML introduction  XML applications  XML support CSEB.
E0262 – MIS – Multimedia Storage Techniques XML (Extensible Markup Language  XML is a markup language for creating documents containing structured information.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation An Introduction to XML.
HTML: Hyptertext Markup Language Doman’s Sections.
XHTML By Trevor Adams. Topics Covered XHTML eXtensible HyperText Mark-up Language The beginning – HTML Web Standards Concept and syntax Elements (tags)
WEB APPLICATION DEVELOPMENT For More visit:
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
_______________________________________________________________________________________________________________ PHP Bible, 2 nd Edition1  Wiley and the.
An Introduction to XML Sandeep Bhattaram
XML Introduction. What is XML? XML stands for eXtensible Markup Language XML stands for eXtensible Markup Language XML is a markup language much like.
The eXtensible Markup Language (XML). Presentation Outline Part 1: The basics of creating an XML document Part 2: Developing constraints for a well formed.
What it is and how it works
XML Introduction. Markup Language A markup language must specify What markup is allowed What markup is required How markup is to be distinguished from.
1 Tutorial 11 Creating an XML Document Developing a Document for a Cooking Web Site.
1 Tutorial 12 Working with Namespaces Combining XML Vocabularies in a Compound Document.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
HTML HYPER TEXT MARKUP LANGUAGE. INTRODUCTION Normal text” surrounded by bracketed tags that tell browsers how to display web pages Pages end with “.htm”
VCE IT Theory Slideshows by Mark Kelly study design By Mark Kelly, vceit.com, Begin.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
XML Introduction to XML Extensible Markup Language.
Blended HTML and CSS Fundamentals 3 rd EDITION Tutorial 1 Using HTML to Create Web Pages.
Extensible Markup Language (XML) Pat Morin COMP 2405.
Unit 4 Representing Web Data: XML
Essential Tags Web Design – Sec 3-3
WORKING WITH NAMESPACES
Digital recordkeeping and preservation II
Chapter 7 Representing Web Data: XML
Creating an XML Document
Introducing HTML & XHTML:
Structuring Content in a Web Document
CSE591: Data Mining by H. Liu
Lesson 2: HTML5 Coding.
Extensible Markup Language (XML)
HTML5 and CSS3 Illustrated Unit B: Getting Started with HTML
Presentation transcript:

Digital recordkeeping and preservation II Introduction to XML ARK2200 Digital recordkeeping and preservation II 2017 Thomas Sødring thomas.sodring@hioa.no P48-R407 67238287

This session We will develop four different XML-files name.xml person.xml personList.xml book.xml Everyone should have these four files in a directory by the end of the week

The extraction process Records Management Extraction Long term preservation Variable, 10 years Approx 5-10 hours forever

How could we extract data? Data is stored in tables in the database Needs to be extracted and stored in a 'neutral' format How can we store data taken out of the database Fixed-width file Comma separated values Markup languages

Fixed-width file 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address 10002__Pål_____Solberg___Storgata_______40182Oslo 10002__Thomas__Hansen____Bakken_______1081406Ski 10003__Eli_____Rørvik____Saturnringen__471808Askim 10004__Børre___Andersen__Bekkefaret_____50348Oslo 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address 42-44:House number 45-48:Zip code 49-58:Town

Fixed width problems If the data is updated so that the width boundaries don't match any more, then programs reading fixed- width files with have a problem

Fixed-width file with problem 10002__Pål_____Solberg___Storgata_______40182Oslo 10002__Thomas__Hansen____Bakken_______108b1406Ski 10003__Eli_____Rørvik____Saturnringen__471808Askim 10004__Børre___Andersen__Bekkefaret_____50348Oslo 1-8:ID 9-17:Firstname 18-28:Surname 29-41:Address 42-44:House number 45-48:Zip code 49-58:Town

Fixed-width problems There is a dependency between data, data boundaries and the program reading the data This could perhaps be dealt with a prolog at the top of the file describing the widths or perhaps a version number of the structure The dependency is causing problems

Comma separated values Comma separated values, csv, is a way of delimiting data Normally uses a comma as a delimitor Each row from the database corresponds to a row in the csv file Each field corresponds to a given colum 10002, Pål, Solberg, Storgata, 4, 0182, Oslo

CSV Each line represents a person – ID, firstname, surname, address, house number, zip code, town – start 10002, Pål, Solberg, Storgata, 4, 0182, Oslo 10002, Thomas, Hansen, Bakken, 8b, 1406, Ski 10003, Eli, Rørvik, Saturnringen, 47, 1808, Askim 10004, Børre, Andersen, Bekkefaret, 5, 0348, Oslo – stopp

“10002”, “Pål”, “Solberg”, “Storgata”, “4”, “0182”, “Oslo” CSV limitations Very dependent on ordering structure What happens if a comma appears in data? Typically have a field delimiter and data delimiter , for field and “” for data within a field If a comma appears in data, then it can be ignored “10002”, “Pål”, “Solberg”, “Storgata”, “4”, “0182”, “Oslo”

Comma separated with errors It may be difficult to detect mistakes and missing information in the file Especially when large files are processed in bulk What happens if a field is missing? Each line represents a person – ID, firstname, surname, address, house number zip code, town – start 10002, Solberg, Pål, Storgata, 4, 0182, Oslo 10002, Thomas, Bakken, 8b, 1406, Ski 10003, Eli, Rørvik, Saturnringen, 47, 1808, Askim 10004, Børre, Andersen, 5, 0348, Oslo – stopp First and last name mixed Address becomes surname Address becomes house number

It's about interoperability The two examples with errors are weak, although they are valid Fixed width and CSV work perfectly well in a controlled environment When you need to exchange information between systems it can be difficult to control the quality of the files CSV and fixed-width quickly fail Difficult to handle versions of files

XML If only there were a way to tag the data so that we knew what each field meant XML

Like this? <id> 10002 </id> <surname> Solberg </surname> <address> Storgata </address> <firstname> Pål </firstname> <houseNumber> 4 </houseNumber> <zipCode> 0182 </zipCode> <town> Oslo </town>

XML Suddenly ordering is irrelevant and it's easy to discover if we are missing any fields Additional fields can be added and easily ignored or error reporting in the import program

XML as an extraction format XML is a markup language and can be used as an extraction format for transferring information to an archive format for long-term storage an interoperability format A markup language combines text and extra information (metadata) about the text

Why XML for extractions Self descriptive Interoperability Easy to further develop Non proprietary

Markup language The term "markup language" comes from the process of marking a manuscript, where symbolic instructions are added to a manuscript and interpreted when printing GenCode from 1960s Scribe 1980 GML SGML HTML XML xhtml

XML Markup language eXtensible Markup Language Simplifies some aspects of SGML Much more flexible and adaptable than HTML Published by W3 Consortium http://www.w3.org/

A 'suite' of technologies XML Markup language XSD/DTD Defines structure and can be used to validate XSLT Used to present and or change data XPath / XQuery For searching There are more but out of scope here

Data / Structure / Presentation The XML book Hans Hansen Introduction This is a book About XML. XML Elements And attributes ..... Physical book Data eBook A book consists of title, author chapters and paragraphs Presentation Audio book Structure and validation

From an archive perspective The XML book Hans Hansen Introduction This is a book About XML. XML Elements And attributes ..... Data A book consists of title, author chapters and paragraphs Structure and validation

XML An XML document is a document that contains data that is marked up in a particular way XML is a meta language for creating different text markup languages used to describe any text It is a tool Used for Noark 4/5 transfer and use of electronic archive packages (OAIS/DIAS) It is an important standard during records management and long term preservation phases Interoperability, extractions, long-term preservation

An XML document Sensible element names makes reading and understanding an XML file intuitive An XML document consists of a prologue and a root element (and all the xml) The root element is also called the document element It is the first element in the XML document Anything after the root element terminates is deemed trash <root> </root>

An XML document An XML document consists of Prologue Document element

An XML Prologue The prologue consists of XML declaration Comments Blank lines Structure validation Processing instructions e.g. style information <?xml version="1.0" encoding="UTF-8"?> <!-- This is an example comment--> <!DOCTYPE arkiv SYSTEM "http://www.kdrs.no/dtd/fonds.dtd"> <?xml-stylesheet href="fonds.css" type="text/css"?>

Document element Document element is also called the root element and all content has this root element as a parent <documentElement> </documentElement> Document element content

A quick example Document element content <fonds> <series> <file> </file> </series> </fonds> Document element content

First practical We will now create our first XML document with a declaration and comment but not the document element <?xml version="1.0" encoding="UTF-8"?> <!-- This is a comment -->

xmlcopyeditor We use a program called xmlcopyeditor to work with XML Can be downloaded from http://xml-copy-editor.sourceforge.net/ Sufficient for an introductory course If you work with XML in an archive institution you will use proprietary XML editing software They have more scalability and usability We want to learn some basic principles and xmlcopyeditor is fine for that Other XML editors :http://alekdavis.blogspot.com/2009_06_01_archive.html

xmlcopyeditor

Tags and elements Understanding the concept of tags is paramount to understanding how to build content in XML documents A tag is defined by both a < and a > Tags normally occur in pairs <name> is a start tag </name> is an end tag The / denotes it closes already defined start tag <name>Hans</name> is an element

<author>Hans Hansen</author> Elements Elements are the foundation for the marking up data in XML files and consist of Start tag Content End tag Content <author>Hans Hansen</author> Start tag End tag

Elements The name you choose for the start and end tag should describe the content That is why we say that XML is self-descriptive When we surround data with start and end tags we are 'marking up' data This is one of the reasons why XML is the right format when it comes to preserving data Or why mark-up languages are useful for preservation Suggestions for an element name that can be used to describe a person? <person></person> <name></name> <age></age>

name.xml Where is the prologue? What is the root element? What happens if we add anything after </name>? <?xml version="1.0" encoding="UTF-8"?> <!-- This is a comment --> <name> <firstname>Hans</firstname> <surname>Hansen</surname> </name>

* handled differently in attributes White spaces in data* 'White spaces' is a common term for Space, tab, return The XML-parser does nothing with white spaces in data, rather it leaves it up to the application to interpret white spaces Whitespace handling can be configured in XSD <author> Hans Hansen </author> * handled differently in attributes

XML to describe a person We need to record the following information for a person firstname, middlename, surname, social security number and gender How do we describe this in XML?

person.xml <?xml version="1.0" encoding="UTF-8"?> <person> <firstname>Hans</firstname> <middlename>John</middlename> <surname>Hansen</surname> <socialSecurityNumber>01108298649</socialSecurityNumber> <gender>male</gender> </person>

compared to database ... What we have just looked at, how does it compare to information in a database or ER-modelling? The root element <person> is an entity so it corresponds to table name The elements names <firstname> etc. correspond to attributes/columns from a table The content of elements correspond to a tuple of data from a table/relation Important to begin to see this parallel with data in a database

But what I wanted was a list of people What happens if we try to add another <person></person> in the same file? Junk after the root element! <personList></personList> This is a classic problem beginners face when learning XML You really have to plan your XML Important to understand in assignment and exam But planning is something we do when we work with XSD

personList.xml <?xml version="1.0" encoding="UTF-8"?> <firstname>Hans</firstname> <middlename>John</middlename> <surname>Hansen</surname> <socialSecurityNumber>01108298649</socialSecurityNumber> <gender>male</gender> </person> </personList>

compared to database ... Slightly different now, the root element no longer corresponds to the entity but is an indication of a plural of the entity But there is no requirement in element naming The <person> element still corresponds to an instance of an entity, but now acts as a delimiter of multiple instances of the entity A table row delimiter if you like ...

XML to describe a book Next we want to describe the concept of a book Some basic elements we would expect in a book Book name, authors, chapters, A chapter has some elements Chapter title, paragraphs of text We ow begin to see that there is a requirement to be able to define structure

book.xml <?xml version="1.0" encoding="UTF-8"?> <book> <author>Hans Hansen</author> <bookTitle>The book about XML</bookTitle> <chapter> <chapterTitle>Introduction</chapterTitle> <paragraph></paragraph> </chapter> <chapterTitle>XML Root element</chapterTitle> </book>

Empty elements Elements that have no data are considered empty <paragraph></paragraph> And can be written as <paragraph/> Empty elements should never be in a Noark 5 extraction Requirement 5.12.5 Metadata Items that do not have a value, must be excluded from the extraction

Empty elements <?xml version="1.0" encoding="UTF-8"?> <!-- This is a comment --> <book> <author>Hans Hansen</author> <bookTitle>The book about XML</bookTitle> <chapter> <chapterTitle>Introduction</chapterTitle> <paragraph></paragraph> <paragraph/> </chapter> <chapterTitle>XML Root element</chapterTitle> </book>

Element types In XML we have a clear distinction between an element that is a simpleType and one that is a complexType simpleType contains only data complexType can contain children elements <name> <firstname>Hans</firstname> <surname>Hansen</surname> </name>

Recap We have introduced XML and looked at XML document Structure and document element Basic rules Elements Definition and types Identified that there is a need for structure but defined any particular need

An XML Prologue The prologue consists of XML declaration Comments Blank lines Structure validation Processing instructions e.g style information <?xml version="1.0" encoding="UTF-8"?> <!-- This is an example comment--> <!DOCTYPE arkiv SYSTEM "http://www.kdrs.no/dtd/fonds.dtd"> <?xml-stylesheet href="fonds.css" type="text/css"?>

Processing instructions Processing instructions are not part of an XML document Provides instructions on how an (external) application should process the XML document e.g. convert it to another format Starts with <? and ends with ?> An example <?xml-stylesheet href="fonds.css" type="text/css"?> XML file will formatted according CSS format instructions specified in the file fonds.css

Processing instructions xslt specifies how an XML will be processed to create a new document Need a XSLT processor (firefox) xml xslt xslt prosessor new document

Process fonds.xml Download the files fonds.xml, fonds.xslt, fonds.css from http://edu.hioa.no/ark2200/current/aids/ CSS and xslt not part of this course, this is just to show you what they do

Processing instructions Currently the RM/Archive profession is not that concerned about this, but they will at some stage But when an archive uses XML is it to process it or preserve it? My impression that the field is very dependent on RDBMS for processing data

XML Criticisms Unnecessary! Too much markup data The file size larger than what is necessary Could be a problem where there is limited broadband / Internet is expensive <person> <firstname>Hans</firstname> <surname>Hansen</surname> <alder>45</alder> </person> (person firstname(Hans) surname(Hansen) alder(45) ) 95 characters 61 characters

Advantages Markup is plain text and human readable Data is separated from presentation Independent of system, software and hardware Non-proprietary (Relatively) easy to implement solutions on top of XML Easy to import or export data to a database using XML

XML as an archive format XML used in preservation Transfer to an archive institution for Noark 5 The DIAS standard (national archive package format) is based on XML XML used during recordkeeping Export / Import data to a Noark system BEST-standard for exchange of case documents