Database to XML extractions

Database to XML extractions
ARK2200 Digital recordkeeping and preservation II 2017 Thomas Sødring P48-R407

What's the problem A government database contains information that likely will be subject to long term preservation DBMSs will probably have a year life-time*, but less popular DBMS will have a shorter life-time DBMS products evolve over time We need to make the data in the database independent of the underlying DBMS *This figure is a guess

How do we convert? We could create a database dump, or a backup, of our data but that might result in a binary file and we might be unable to read the contents in a few years In a way the data is 'locked-in' at the physical layer We need to work with data at the logical layer We could store data in text files as fixed-width, csv or XML All are possible approaches, but we have seen how marking up data in XML makes sense

Converting to XML Migrating a database table to XML is pretty straight forward The table name is the root node and can also be the filename with a .xml extension Probably make it plural A table with name Car results in a <cars> root element The column/attribute names become the element names registrationNr, chassisNr etc. become <registrationNr>, <chassisNr>

Convert the table Car to XML
registrationNr chassisNr colour manufacturer model LH12984 Red Volkswagen Golf DK23491 Blue Toyota Yaris BP12349 Green Skoda Fabia ZT97495 White Seat Leon

Step 1: Create an XML-file
Car registrationNr chassisNr colour manufacturer model LH12984 Red Volkswagen Golf DK23491 Blue Toyota Yaris BP12349 Green Skoda Fabia ZT97495 White Seat Leon <?xml version="1.0" encoding="UTF-8"?> Cars.xml

Step 2 : Create root node from table name
registrationNr chassisNr colour manufacturer model LH12984 Red Volkswagen Golf DK23491 Blue Toyota Yaris BP12349 Green Skoda Fabia ZT97495 White Seat Leon Car <?xml version="1.0" encoding="UTF-8"?> <cars> </cars> Cars.xml

Step 3 : Row delimiter We need to identify a row delimiter
The plural form of the entity will be the root node The singular form of the entity will be a row delimiter, an actual instance of the entity registrationNr chassisNr colour manufacturer model LH12984 Red Volkswagen Golf DK23491 Blue Toyota Yaris BP12349 Green Skoda Fabia ZT97495 White Seat Leon Car <car> <registrationNr>LH12984</registrationNr> <chassisNr> </chassisNr> <colour>Red</colour> <manufacturer>Volkswagen</manufacturer> <model>Golf</model> </car> Cars.xml

Step 4 : Copy each row to XML
registrationNr chassisNr colour manufacturer model LH12984 Red Volkswagen Golf DK23491 Blue Toyota Yaris BP12349 Green Skoda Fabia ZT97495 White Seat Leon Car <?xml version="1.0" encoding="UTF-8"?> <cars> <car> <registrationNr>LH12984</registrationNr> <chassisNr> </chassisNr> <colour>Red</colour> <manufacturer>Volkswagen</manufacturer> <model>Golf</model> </car> <registrationNr>DK23491</registrationNr> <chassisNr> </chassisNr> <colour>Blue</colour> <manufacturer>Toyota</manufacturer> <model>Yaris</model> </cars> Cars.xml <?xml version="1.0" encoding="UTF-8"?> <cars> <car> <registrationNr>LH12984</registrationNr> <chassisNr> </chassisNr> <colour>Red</colour> <manufacturer>Volkswagen</manufacturer> <model>Golf</model> </car> <registrationNr>DK23491</registrationNr> <chassisNr> </chassisNr> <colour>Blue</colour> <manufacturer>Toyota</manufacturer> <model>Yaris</model> </cars> Cars.xml

So far We have only copied the data, we have not identified table name, primary keys, foreign key relationships etc. The next step is to either create an additional file containing this information, or to include this information just after the root node

Describe DDL information
<?xml version="1.0" encoding="UTF-8"?> <database> <tables> <table> <tablename>Cars</tablename> <primaryKey>registrationNr</primaryKey> <foreignKey/> <attributes> <attribute> <attributeName>registrationNr</attributeName> <attributeDataType>string</attributeDataType> <attributeNotNull>true</attributeNotNull> </attribute> </attributes> </table> </tables> </database>

So far The approach so far is very much a mapping of a single table to a single XML-file The weakness to this approach is that it is up to the person processing the files in the future to try and figure out how information across tables is related to each other The relational model and the use of normalisation may make analysis more difficult than it needs to be We might want to put related information into the same table, which may result in a 'de- normalisation'

Handling related information
Student StudentNr Firstname Surname Surname 12345 Jan Karlson Karlson 23456 Pål Solberg Solberg 34567 Mette Johansen Johansen 45678 Ingrid Aleksandersen Aleksandersen StudentTelephoneNr StudentNr TelephoneNr 12345 12345 34567 34567

Sometimes you naturally will see related information amongst the tables This is a result of the modelling and normalisation processes Extracting data will be done on the basis of a join Inner, left, right, full types of joins?? You may want to aggregate this information You may also not want to aggregate this information Your preservation strategy will be the deciding factor

studentNr firstname Surname 12345 Jan Karlson 23456 Pål Solberg 34567 Mette Johansen 45678 Ingrid Aleksandersen surname Student telephoneNr StudentTelephoneNr <?xml version="1.0" encoding="UTF-8"?> <students> <student> <studentNr>12345</studentNr> <firstname>Jan</firstname> <surname>Karlson</surname> <studentTelephoneNr> <telephoneNr> </telephoneNr> <telephoneNr> </telephoneNr> </studentTelephoneNr> </student> <studentNr>23456</studentNr> <firstname>Pål</firstname> <surname>Solberg</surname> <telephoneNr> </telephoneNr> <telephoneNr> </telephoneNr> </students> students.xml Where is studentNr ???

No clear answer There is no clear answer about which tables should be aggregated and which shouldn't There is no clear answer whether we should only preserve each table as an XML-file or attempt aggregations Aggregation of data undoes normalisation and can make the data slightly more difficult to handle into a database But the anomalies that normalisation solve are not relevant in a long term preservation perspective Insertion, deletion, update

Database to XML extractions

Similar presentations

Presentation on theme: "Database to XML extractions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database to XML extractions

Similar presentations

Presentation on theme: "Database to XML extractions"— Presentation transcript:

Similar presentations

About project

Feedback