Database to XML extractions

Slides:



Advertisements
Similar presentations
1 1 File Systems and Databases. 1 1 Introducing the Database 4Major Database Concepts u Data and information l Data - Raw facts l Information - Processed.
Advertisements

Relational Databases Chapter 4.
A Practical Introduction to Transactional Database Modeling and Design Mike Burr.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Database Technical Session By: Prof. Adarsh Patel.
1 The Relational Database Model. 2 Learning Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical.
Organizing Data Revision: pages 8-10, 31 Chapter 3.
1 CS 430 Database Theory Winter 2005 Lecture 7: Designing a Database Logical Level.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Digital recordkeeping and preservation I
Digital recordkeeping and preservation I
Database commands : DDL
U3O2: Structure & Role of Relational Databases
- The most common types of data models.
Lesson 10 Databases.
Visual Basic 2010 How to Program
Tables & Relationships
Digital recordkeeping and preservation I
Introduction to databases
Database Development (8 May 2017).
Databases Chapter 9 Asfia Rahman.
Miscellaneous Excel Combining Excel and Access.
Digital recordkeeping and preservation I
Databases Chapter 16.
Digital recordkeeping and preservation II
Relational Model By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany)
© The McGraw-Hill Companies, All Rights Reserved APPENDIX C DESIGNING DATABASES APPENDIX C DESIGNING DATABASES.
Database Keys and Constraints
NURS 736: Technology Solutions for Knowledge Generation in Healthcare
Database Management Systems (DBMS)
CIS 207 The Relational Database Model
Normalizing an Existing Table
CSCI-100 Introduction to Computing
Chapter 4 Relational Databases
Database Management System
XML and Databases.
CIS 336 Competitive Success/snaptutorial.com
CIS 336 Education for Service-- snaptutorial.com.
CIS 336 Teaching Effectively-- snaptutorial.com
Databases and Information Management
Chapter 3 The Relational Database Model
Microsoft Office Access 2003
Order Database – ER Diagram
Data Representation.
DATABASE SYSTEM UNIT I.
Databases: An Introduction
Tutorial 3 – Querying a Database
Physical Database Design
Microsoft Office Access 2003
Index Use Cases.
MANAGING DATA RESOURCES
Databases.
Database Management Concepts
Data Model.
Databases and Information Management
DATABASE SYSTEM.
Review of Week 1 Database DBMS File systems vs. database systems
Creating and Managing Database Tables
Spreadsheets, Modelling & Databases
logical design for relational database
Database Normalisation
Entity-Relationship Diagram (ERD)
DBMS ER-Relational Mapping
A Very Brief Introduction to Relational Databases
Databases 1.
Databases and Information Management
Manipulating Data Lesson 3.
Relational data model. Codd's Rule E.F Codd was a Computer Scientist who invented Relational model for Database management. Based on relational model,
Normalisation 1 Unit 3.1 Dr Gordon Russell, Napier University
Presentation transcript:

Database to XML extractions ARK2200 Digital recordkeeping and preservation II 2017 Thomas Sødring thomas.sodring@hioa.no P48-R407 67238287

What's the problem A government database contains information that likely will be subject to long term preservation DBMSs will probably have a 10-20 year life-time*, but less popular DBMS will have a shorter life-time DBMS products evolve over time We need to make the data in the database independent of the underlying DBMS *This figure is a guess

How do we convert? We could create a database dump, or a backup, of our data but that might result in a binary file and we might be unable to read the contents in a few years In a way the data is 'locked-in' at the physical layer We need to work with data at the logical layer We could store data in text files as fixed-width, csv or XML All are possible approaches, but we have seen how marking up data in XML makes sense

Converting to XML Migrating a database table to XML is pretty straight forward The table name is the root node and can also be the filename with a .xml extension Probably make it plural A table with name Car results in a <cars> root element The column/attribute names become the element names registrationNr, chassisNr etc. become <registrationNr>, <chassisNr>

Convert the table Car to XML registrationNr chassisNr colour manufacturer model LH12984 10946534 Red Volkswagen Golf DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon

Step 1: Create an XML-file Car registrationNr chassisNr colour manufacturer model LH12984 10946534 Red Volkswagen Golf DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon <?xml version="1.0" encoding="UTF-8"?> Cars.xml

Step 2 : Create root node from table name registrationNr chassisNr colour manufacturer model LH12984 10946534 Red Volkswagen Golf DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon Car <?xml version="1.0" encoding="UTF-8"?> <cars> </cars> Cars.xml

Step 3 : Row delimiter We need to identify a row delimiter The plural form of the entity will be the root node The singular form of the entity will be a row delimiter, an actual instance of the entity registrationNr chassisNr colour manufacturer model LH12984 10946534 Red Volkswagen Golf DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon Car <car> <registrationNr>LH12984</registrationNr> <chassisNr>10946534</chassisNr> <colour>Red</colour> <manufacturer>Volkswagen</manufacturer> <model>Golf</model> </car> Cars.xml

Step 4 : Copy each row to XML registrationNr chassisNr colour manufacturer model LH12984 10946534 Red Volkswagen Golf DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon Car <?xml version="1.0" encoding="UTF-8"?> <cars> <car> <registrationNr>LH12984</registrationNr> <chassisNr>10946534</chassisNr> <colour>Red</colour> <manufacturer>Volkswagen</manufacturer> <model>Golf</model> </car> <registrationNr>DK23491</registrationNr> <chassisNr>9648573</chassisNr> <colour>Blue</colour> <manufacturer>Toyota</manufacturer> <model>Yaris</model> </cars> Cars.xml <?xml version="1.0" encoding="UTF-8"?> <cars> <car> <registrationNr>LH12984</registrationNr> <chassisNr>10946534</chassisNr> <colour>Red</colour> <manufacturer>Volkswagen</manufacturer> <model>Golf</model> </car> <registrationNr>DK23491</registrationNr> <chassisNr>9648573</chassisNr> <colour>Blue</colour> <manufacturer>Toyota</manufacturer> <model>Yaris</model> </cars> Cars.xml

So far We have only copied the data, we have not identified table name, primary keys, foreign key relationships etc. The next step is to either create an additional file containing this information, or to include this information just after the root node

Describe DDL information <?xml version="1.0" encoding="UTF-8"?> <database> <tables> <table> <tablename>Cars</tablename> <primaryKey>registrationNr</primaryKey> <foreignKey/> <attributes> <attribute> <attributeName>registrationNr</attributeName> <attributeDataType>string</attributeDataType> <attributeNotNull>true</attributeNotNull> </attribute> </attributes> </table> </tables> </database>

So far The approach so far is very much a mapping of a single table to a single XML-file The weakness to this approach is that it is up to the person processing the files in the future to try and figure out how information across tables is related to each other The relational model and the use of normalisation may make analysis more difficult than it needs to be We might want to put related information into the same table, which may result in a 'de- normalisation'

Handling related information Student StudentNr Firstname Surname Surname 12345 Jan Karlson Karlson 23456 Pål Solberg Solberg 34567 Mette Johansen Johansen 45678 Ingrid Aleksandersen Aleksandersen StudentTelephoneNr StudentNr TelephoneNr 12345 76543829 12345 90783298 34567 99456543 34567 45990234

Handling related information Sometimes you naturally will see related information amongst the tables This is a result of the modelling and normalisation processes Extracting data will be done on the basis of a join Inner, left, right, full types of joins?? You may want to aggregate this information You may also not want to aggregate this information Your preservation strategy will be the deciding factor

Handling related information studentNr firstname Surname 12345 Jan Karlson 23456 Pål Solberg 34567 Mette Johansen 45678 Ingrid Aleksandersen surname Student telephoneNr 76543829 90783298 99456543 45990234 StudentTelephoneNr <?xml version="1.0" encoding="UTF-8"?> <students> <student> <studentNr>12345</studentNr> <firstname>Jan</firstname> <surname>Karlson</surname> <studentTelephoneNr> <telephoneNr>76543829</telephoneNr> <telephoneNr>90783298</telephoneNr> </studentTelephoneNr> </student> <studentNr>23456</studentNr> <firstname>Pål</firstname> <surname>Solberg</surname> <telephoneNr>99456543</telephoneNr> <telephoneNr>45990234</telephoneNr> </students> students.xml Where is studentNr ???

No clear answer There is no clear answer about which tables should be aggregated and which shouldn't There is no clear answer whether we should only preserve each table as an XML-file or attempt aggregations Aggregation of data undoes normalisation and can make the data slightly more difficult to handle into a database But the anomalies that normalisation solve are not relevant in a long term preservation perspective Insertion, deletion, update