Database to XML extractions ARK2200 Digital recordkeeping and preservation II 2017 Thomas Sødring thomas.sodring@hioa.no P48-R407 67238287
What's the problem A government database contains information that likely will be subject to long term preservation DBMSs will probably have a 10-20 year life-time*, but less popular DBMS will have a shorter life-time DBMS products evolve over time We need to make the data in the database independent of the underlying DBMS *This figure is a guess
How do we convert? We could create a database dump, or a backup, of our data but that might result in a binary file and we might be unable to read the contents in a few years In a way the data is 'locked-in' at the physical layer We need to work with data at the logical layer We could store data in text files as fixed-width, csv or XML All are possible approaches, but we have seen how marking up data in XML makes sense
Converting to XML Migrating a database table to XML is pretty straight forward The table name is the root node and can also be the filename with a .xml extension Probably make it plural A table with name Car results in a <cars> root element The column/attribute names become the element names registrationNr, chassisNr etc. become <registrationNr>, <chassisNr>
Convert the table Car to XML registrationNr chassisNr colour manufacturer model LH12984 10946534 Red Volkswagen Golf DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon
Step 1: Create an XML-file Car registrationNr chassisNr colour manufacturer model LH12984 10946534 Red Volkswagen Golf DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon <?xml version="1.0" encoding="UTF-8"?> Cars.xml
Step 2 : Create root node from table name registrationNr chassisNr colour manufacturer model LH12984 10946534 Red Volkswagen Golf DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon Car <?xml version="1.0" encoding="UTF-8"?> <cars> </cars> Cars.xml
Step 3 : Row delimiter We need to identify a row delimiter The plural form of the entity will be the root node The singular form of the entity will be a row delimiter, an actual instance of the entity registrationNr chassisNr colour manufacturer model LH12984 10946534 Red Volkswagen Golf DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon Car <car> <registrationNr>LH12984</registrationNr> <chassisNr>10946534</chassisNr> <colour>Red</colour> <manufacturer>Volkswagen</manufacturer> <model>Golf</model> </car> Cars.xml
Step 4 : Copy each row to XML registrationNr chassisNr colour manufacturer model LH12984 10946534 Red Volkswagen Golf DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon Car <?xml version="1.0" encoding="UTF-8"?> <cars> <car> <registrationNr>LH12984</registrationNr> <chassisNr>10946534</chassisNr> <colour>Red</colour> <manufacturer>Volkswagen</manufacturer> <model>Golf</model> </car> <registrationNr>DK23491</registrationNr> <chassisNr>9648573</chassisNr> <colour>Blue</colour> <manufacturer>Toyota</manufacturer> <model>Yaris</model> </cars> Cars.xml <?xml version="1.0" encoding="UTF-8"?> <cars> <car> <registrationNr>LH12984</registrationNr> <chassisNr>10946534</chassisNr> <colour>Red</colour> <manufacturer>Volkswagen</manufacturer> <model>Golf</model> </car> <registrationNr>DK23491</registrationNr> <chassisNr>9648573</chassisNr> <colour>Blue</colour> <manufacturer>Toyota</manufacturer> <model>Yaris</model> </cars> Cars.xml
So far We have only copied the data, we have not identified table name, primary keys, foreign key relationships etc. The next step is to either create an additional file containing this information, or to include this information just after the root node
Describe DDL information <?xml version="1.0" encoding="UTF-8"?> <database> <tables> <table> <tablename>Cars</tablename> <primaryKey>registrationNr</primaryKey> <foreignKey/> <attributes> <attribute> <attributeName>registrationNr</attributeName> <attributeDataType>string</attributeDataType> <attributeNotNull>true</attributeNotNull> </attribute> </attributes> </table> </tables> </database>
So far The approach so far is very much a mapping of a single table to a single XML-file The weakness to this approach is that it is up to the person processing the files in the future to try and figure out how information across tables is related to each other The relational model and the use of normalisation may make analysis more difficult than it needs to be We might want to put related information into the same table, which may result in a 'de- normalisation'
Handling related information Student StudentNr Firstname Surname Surname 12345 Jan Karlson Karlson 23456 Pål Solberg Solberg 34567 Mette Johansen Johansen 45678 Ingrid Aleksandersen Aleksandersen StudentTelephoneNr StudentNr TelephoneNr 12345 76543829 12345 90783298 34567 99456543 34567 45990234
Handling related information Sometimes you naturally will see related information amongst the tables This is a result of the modelling and normalisation processes Extracting data will be done on the basis of a join Inner, left, right, full types of joins?? You may want to aggregate this information You may also not want to aggregate this information Your preservation strategy will be the deciding factor
Handling related information studentNr firstname Surname 12345 Jan Karlson 23456 Pål Solberg 34567 Mette Johansen 45678 Ingrid Aleksandersen surname Student telephoneNr 76543829 90783298 99456543 45990234 StudentTelephoneNr <?xml version="1.0" encoding="UTF-8"?> <students> <student> <studentNr>12345</studentNr> <firstname>Jan</firstname> <surname>Karlson</surname> <studentTelephoneNr> <telephoneNr>76543829</telephoneNr> <telephoneNr>90783298</telephoneNr> </studentTelephoneNr> </student> <studentNr>23456</studentNr> <firstname>Pål</firstname> <surname>Solberg</surname> <telephoneNr>99456543</telephoneNr> <telephoneNr>45990234</telephoneNr> </students> students.xml Where is studentNr ???
No clear answer There is no clear answer about which tables should be aggregated and which shouldn't There is no clear answer whether we should only preserve each table as an XML-file or attempt aggregations Aggregation of data undoes normalisation and can make the data slightly more difficult to handle into a database But the anomalies that normalisation solve are not relevant in a long term preservation perspective Insertion, deletion, update