SAIL: Documenting data content and quality, letting the computer take the strain Caroline Brooks Senior Research Analyst, College of Medicine, Swansea University Ann Wrightson Lead Technical Design Architect, NHS Wales Informatics Service Hon. Research Associate, College of Medicine, Swansea University
Swansea Health Informatics Research & NWIS Partners in establishing and sustaining SAIL Wider collaboration in usability testing and innovation >Sharing skills & thinking around secondary uses of data
Ideas and facts General approaches in data research: People have ideas and test them using the available facts Ideas come from the available facts But – facts are not so easy to see in the data! Researchers need help... Which data resources contain the facts I need? What do I need to know about this data to use it well?
What’s in this repository, anyway? Dataset level – catalogue What/from where/from whom/how collected/rights to use Record level – dataset entry description Data model (entity-relationship model) Item level - field/attribute description Data types/ranges/controlled terms
How good is this data? What can it do for me? Item Population of this field/attribute - Why present? Why absent? Significance of this field/attribute – What does it mean for me? Record Evidential value of presence &/or absence of particular record Dataset What work has already been done with this data?
Work already done – SAIL databank website includes human readable dataset catalogue Description, source, related publications, data model Data Quality report (developed by SAIL team in 2013) Standardized informative documentation for each dataset Produced by automated analysis of data, published as PDF Working with Canadian colleagues (MCHP and Pop Data BC) Technology refresh of SAIL platform (CIPHER project – )
Work in progress Machine-readable format for catalogue and data quality information Data Documentation Initiative (DDI) format Initial target: publish on website as download link in catalogue Making outcomes of in-depth data quality work available for reuse Algorithms that instantiate clinical & social research concepts Evaluation of data coverage across populations of individuals Knowledge sharing with NWIS data warehouse team
Future directions Further work on characterizing concepts in data – reproducible, reusable How to make good use of SNOMED CT in source data New knowledge & skills needed, also issues with old/new data NWIS also working on this, another good area for collaboration More general use of knowledge models alongside data Comprehensive & integrated metadata reference architecture Data annotation, e.g. using biomedical science ontologies
Thank you for your attention