Download presentation
Presentation is loading. Please wait.
Published byRandolph Simon Modified over 8 years ago
1
3b. Data standards and capture methods www.dmtpsych.york.ac.uk
2
Some background “The purpose of preserving things is to enable access to them at some unspecified date in the future, very probably for purposes not anticipated by the creators. Digital information technology is barely 60 years old, and all of the software from the earliest machines is already lost. It is now clear that this was material of the greatest historical significance, but history was far from the minds of those of us keen to develop the future. We should plan that our digital information will still be safe and accessible in 100 years”, David Holdsworth (2007) – member of the team trying to rescue data from the BBC Domesday project. More on that project later.
3
Data standards Consistent data formats and the use of appropriate coding systems is key to effective electronic preservation It’s no use preserving data for the file format it is preserved in be unreadable in 20 years (or less) Proprietary file formats produced by esoteric software might not be the way forward A standard today might not be a standard tomorrow! It may also be the case that you may use custom written software in which you define the file format Also don’t forget you could leave understandable data held in a great format but store it on an unsuitable medium, e.g. CDR, memory stick… remember floppy discs! You need to put yourself in the mindset of someone who would want to access your data in 10 or 20 years time. This could even be you!
4
The BBC Domesday project In 1986, to commemorate the 900th anniversary of the Domesday Book, the BBC ran a project to collect a picture of Britain in 1986, to do so using modern technology, and to preserve the information so as to withstand the ravages of time. This was done using a [BBC] micro computer coupled to a Philips LaserVision player, with the data stored on two 12" video disks. Software was included with the package, some on ROM and some held on the disks, which then gave an interactive interface to this data. The disks themselves are robust enough to last a long time, but the device to read them is much more fragile, and has long since been superseded as a commercial product. Here is a clear example where the preservation decisions placed (mis-placed) faith in the media technology of the day, and more crucially in the survival of the information technology practices of the time.
5
In 2002, there were great fears that the discs would become unreadable as computers capable of reading the format had become rare and drives capable of accessing the discs even rarer. Aside from the difficulty of emulating the original code, a major issue was that the still images had been stored on the laserdisc as single-frame analogue video, which were overlaid by the computer system's graphical interface. The project had begun years before JPEG image compression and before truecolour computer video cards had become widely available. Still not fully accessible despite throwing millions of pounds at the problem over the past 25 years! Classic example of what can and does go wrong even on projects that cost millions to produce in the first place and who’s goal was long term preservation. Unfortunately the key developer recovering data died in 2008 (Adrian Pearce) which pretty much means all hope of recovery is lost. This project also had input from professionals in data preservation and archiving at every stage! The 1086 original Domesday book largely now online – go figure! The BBC Domesday project
6
I’m scared what standards should I follow? In short more than one and the more basic and common the better, e.g. CSV raw text is better than E-Prime E-DataAid files Keep the original data along with any versions translated into new emerging formats, e.g. MS Word 2 -> Word 4 -> Word 6 going forward... Update to new storage media as it becomes available, Floppy disk -> CDR - > DVDR… plus keep the originals Bear in mind companies go bust and take their software and file formats with them, e.g. WordStar, WordPerfect, Lotus 123… plus companies are taken over and change direction IBM SPSS! Be sure to describe your data properly using metadata (data about data) so you or someone else can understand it! You can fall under the proverbial bus and so can all your data so describe it fully Printed copies aren’t all bad but remember these can go in the bin if space is short
7
A worked example for psychologists Data collected using E-Prime stored in E-DataAid format files –CSV files with variables described via metadata –XLS MS Excel file –SAV SPSS format file Data stored on hard drive –CDR –External hard drive –Uploaded to institutional or commercial repository, e.g. DSPACE Original E-Prime scripts and materials, e.g. WAV sound files –MP3 (but be aware of possible quality loss) –WMA (Windows Media Audio format) –Ogg Vorbis By doing this you are instantly making your data and materials more portable and imminently more sharable. Increases your impact and chances of your data staying around for much longer
8
Data capture - Metadata: data about data Metadata helps the researcher explain how their data is structured, e.g. variables/fields, (structural metadata), who created it (descriptive metadata) and what the sharing and other rights are (administrative metadata) All this is in addition to the actual data A good example is the SPSS Data Dictionary which can create metadata for variables, e.g. so ResT = “Response Time for time to press answer key in response to a visual stimulus image” –If you left your data with the variable ResT would you know what this meant in 10 years time? The textual label is in essence the metadata which describes both the variable and the data itself. Which makes more sense?
9
Lab notebooks as metadata A lab notebook is a primary record of research. Researchers use a lab notebook to document their hypotheses, experiments and initial analysis or interpretation of these experiments. The notebook serves as an organizational tool, a memory aid, and can also have a role in protecting any intellectual property that comes from the research. The guidelines for lab notebooks vary widely between institution and individual labs, but some guidelines are fairly common. The lab notebook is typically permanently bound and pages are numbered. Dates are given as a rule. All entries are with a permanent marker, e.g., a ballpoint pen. The lab notebook is usually written as the experiments progress, rather than at a later date. In many laboratories, it is the original place of record of data as well as any observations or insights. For data recorded by other means (e.g., on a computer), the lab notebook will record that the data was obtained and the identification of the data set will be given in the notebook. Many adhere to the concept that a lab notebook should be thought of as a diary of activities described in sufficient detail to allow another scientist to replicate the steps.
10
Pages 40-1 of Alexander Graham Bell's unpublished laboratory notebook (1875-76), describing first successful experiment with the telephone
11
Metadata example: simple Dublin Core The Simple Dublin Core Metadata Element Set (DCMES) consists of 15 metadata elements (example of descriptive and administrative metadata in XML): 1.Title 2.Creator 3.Subject 4.Description 5.Publisher 6.Contributor 7.Date 8.Type 9.Format 10.Identifier 11.Source 12.Language 13.Relation 14.Coverage 15.Rights
12
Metadata for the actual data If you use good software and a recognised repository creating metadata should be fairly straightforward. You shouldn’t be editing raw XML files!
13
Summary Even the experts screw up! The BBC Domesday project for example Learn from their mistakes Storage media changes more rapidly than you think – replicate original media when new media superceeds the old (during the crossover period) File formats come and go as well – store your core data in 3 alternative formats where possible Data is of little use without metadata to describe it Metadata can be implicit (lab notebooks) or explicit (structured XML files) There are different kids of metadata: –Structural –Descriptive –Administrative Software helps take the pain out of metadata
14
What types of media are you storing your data on? Are there other media that you could also store your work on? What file formats are you using and can you think of 3 alternatives for each one? Do you currently create metadata and in what form? Could you understand your variable names in your data in 10 years? If not, why not? Time to think about data Small group exercises
15
Graphics acknowledgements Slide 2, Old Hard Drive... circa 1982 (1) - flickr.com photo by: knowprose Slide 3, Apollo Data Tape - gsfc Slide 4, OSHUG #7 Domesday Project - flickr.com photo by: 37996583811@N01 Slide 5, OSHUG: BBC Micro 0 - flickr.com photo by: 37996583811@N01 Slide 6, Gorgeous tape! - flickr.com photo by: wildwoman Slide 6, IBM System/360 Mainframe - flickr.com photo by: epitti Slide 6, 8 inch floppy disk - flickr.com photo by: wlef70 Slide 6, Free Photos – Floppy Disk 1.44 Mb - FDD - flickr.com photo by: free-stock Slide 6, Dead Media Society: Zip Disk - flickr.com photo by: thefrankfurtschool Slide 6, CD Spindle - flickr.com photo by: eyebee Slide 6, USB memory. - flickr.com photo by: mujitra Slide 6, How small is MicroSD? - flickr.com photo by: privatenobby Slide 7, lead by example - flickr.com photo by: monkeyc Slide 8, metadata - flickr.com photo by: mmahaffie Slide 9, lab notebook - flickr.com photo by: proteinbiochemist
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.