Download presentation
Presentation is loading. Please wait.
Published byClarence Harrison Modified over 9 years ago
1
Get your hands dirty cleaning data. 2008 European EMu Users Meeting, 3rd June. - Elizabeth Bruton, Museum of the History of Science, Oxford elizabeth.bruton@mhs.ox.ac.uk
2
Outline ► Data Migration ► Problem -> Solution approach ► Tools ► Manual Data Cleaning ► Examples ► Current and Future Practices (Documentation, Policing, Review)
3
Data Migration ► First step towards better, cleaner data ► Steps: Prepare and analyse legacy system Data mapping KE EMu system design Data migration
4
Legacy System Analysis ► Prepare and analyse previous (legacy) system Data: structure and relationships - tables and fields. ► Primary ► Secondary ► Cross-reference Documentation and usage Redundant data
5
Legacy Data analysis
6
Data Mapping
7
KE EMu system design ► Default and additional fields across different modules ► Field titles ► Screen Designer e.g. Summary tab for ecatalogue module ► Finally data migration
8
Data cleaning overview ► Problem -> solution approach Input data Operations Output data ► Manual or automated operations or both? ► Which tools to use for automated operations? KE EMu tools – many powerful built-in tools within EMu Non-KE EMu tools – scripts to use on data imported from EMu; reimport back into EMu Both
9
KE EMu Tools: Texql ► queries ► KE Texpress Texql queries Similar syntax to mySQL or SQL ► Uses: Analysing data and data structure Analysing search queries Advanced search queries
10
KE EMu Tools: Global Replace ► Very useful, powerful but also potentially ‘dangerous’ tool ► Can use in combination with search query or list options within EMu ► Can use regular expressions and/or wildcard searches ► Powerful tool for single field or Field A->Field B operations
11
KE EMu Tools: Record Merge ► Does what it says on the tin ► Merge one or more duplicate record(s) into single record ► Only ‘attachments’ to different modules are merged into record not data ► Ditto tool can be used for easily copying data from one record to another ► Attachments to original duplicate record(s) are removed so records can be deleted
12
KE EMu Tools: Reports ► Tool to present information in assorted ways ► Can be used to produce reports but can also be used as data export tool ► Microsoft Excel or CSV format appropriate for more advanced data operations
13
Non-KE EMu Tools: Scripting ► Personally use php and mySQL ► Perl is also useful scripting tool; used by KE ► Have written CSV to mySQL file checker and converter in php ► Then run more advanced operations on data using php scripts ► PhpMyAdmin can export data in many formats including CSV
14
Non-KE EMu Tools: Scripting ► Systematic Approach Keep copy of original data Produce data mapping or data cleaning document Perform operations using php file on mySQL table Check data produced (manual or automatic) and output logs Validate data in EMu and then import
15
Manual Data Cleaning ► Some problems cannot be done automatically, either partially or entirely ► Need to be ‘eyeballed’ by a person, preferably someone familiar with the museum’s collections
16
Example: Parties Records ► Legacy system used two systems of noting object ‘makers’ Freetext ‘Maker’ field with no centralised system (1:1 ratio); used for applicable records Assigned makers with centralised system; only used for first 3,000 or so records ► Freetext data imported into EMu resulted in approximately 5,500 Parties records
17
Example: Parties Records ► Good example of mapping freetext field to more structured data field with 1:Many ratio ► KE ran script which ‘detected’ maker type and formatted accordingly, i.e. Maker Type etc ► But still much cleaning up to be done ► Two approaches: automatic then manual
18
Example: Parties Records ► Problem: Creation-related data within legacy system were all free-text fields ► The museum wanted to keep this data in some format as it contained valuable information, such as ambiguities or uncertainties ► e.g. Italy or France, Attributed to Smith & Jones, possibly last quarter of 19 th century etc
19
Example: Parties Records ► This data did not fit neatly into defined, structure fields such as Parties, Places or Creation Date ► Also wanted to clean Parties records ► Solution: Automatic batch process then manual cleaning
20
Example: Parties Records – Automatic Approach Exported Creation data (Parties, Place, Creation Date) from EMu Ran script which checked for and removed duplicates in Parties and Place Note: The above operation deleted rather than manipulated data but still integral part of data cleaning operation Copied cleaned Parties, Place, Creation Data into single free-text field: Creation Notes Re-imported data into EMu using Import Tool
21
Example: Parties Records – Automatic Approach Began data cleaning by running Global Replace operation within EMu eparties module, removing 'Signed by', 'Attributed to', or 'Made by' from the relevant parties records Next: Manual Approach
22
Example: Parties Records – Manual Approach Cleaned records: Check Parties Type (Person or Organisation) and edited records (Surname, Forename, Organisation etc) Merged and deleted duplicate records Checked and deleted unattached parties records
23
Example: Parties Records – End Result ► Currently have 3,300 cleaner Parties records
24
Current and Future Practices ► Current Systematic approach to data cleaning; incorporated into monthly museum EMu Users' Meeting Review ► In Progress Documentation ► Future Policing
25
Conclusion ► Data cleaning and policing is an ongoing process for an institution of any size ► Data standards must be set and adhered to ► Needs to be approached and done in a systematic way ► Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.