Download presentation
Presentation is loading. Please wait.
1
Making your data lovely
Making your data lovely! Prioritising, cleaning, extraction, transformation, automation
2
Tips for ensuring your data is usable
Adopt an approach of “data user and developer empathy” Build capacity by building into BAU, an initial focus that supports you Consume your own data & APIs (apps, datavis, BI, etc) Automation wherever possible! Ensure you consider: Discoverability – is it hosted or linked on data.govt.nz? Quality – no one can use bad data, but perfect the enemy of good Currency – is it up to date? How often is it updated? Machine readable or APIs – is it programmatically available? Publishing – have you provided supporting materials (taxonomies)? Reusability – have you tested it with data users? Licensing – Creative Commons By Attribution a good default Streamlining access – are logins really necessary?
3
Data on the inside Do you know what data you have internally?
Are you considering all data types? How embedded is data driven decision making? How can you upskill the whole organisation? Data for management, policy, design, tech, etc Do you know what your external data needs are? How are you measuring and monitoring success? Data infrastructure to support your organisation should be extendable to support sharing/publishing Data by design!
4
Rub a dub data If a machine can’t read it, a machine can’t make an API
Some data has specialised data formats, some commonalities Tabular, spatial, real time, unstructured, etc Most data comes from somewhere, use the source Luke! Machines and humans have different needs
5
What you need is clean sheets
Don’t merge cells. Sorting and other manipulations people may want to apply to your data assume that each cell belongs to one row and column. Don’t mix data and metadata (e.g. date of release, name of author) in the same sheet. The first row of a data sheet should contain column headers. None of these headers should be duplicates or blank. The column header should clearly indicate which units are used in that column, where this makes sense. The remaining rows should contain data, one datum per row. Don’t include aggregate statistics such as TOTAL or AVERAGE. You can put aggregate statistics in a separate sheet, if they are important. Numbers in cells should just be numbers. Don’t put commas in them, or stars after them, or anything else. If you need to add an annotation to some rows, use a separate column. Use standard identifiers: e.g. identify countries using ISO 3166 codes rather than names. Don’t use only colour or other stylistic cues to encode information. If you want to colour cells according to their value, use conditional formatting. Leave the cell blank if a value is not available. If you provide pivot tables, make sure the underlying data is available separately too. If you also want to create a human-friendly presentation of the data, do so by creating another sheet in the same workbook and referencing the appropriate cells in the canonical data sheet
6
Automate your reporting
7
Automating updates Automation involves system to system updates to save you time & money. Three broad approaches: Write scripts to push or pull data updates using an API directly from the source. Usually doesn’t require much data manipulation. Adopt a tool like Taverna, FME or Splunk to extract, clean/manipulate, and then push data to the data.gov.au (CKAN/geoserver) API directly. Use the data.govt.nz (CKAN) to schedule pull updates from your data, but most agencies don’t do that as they prefer to push updates. Get at least one geek in you data team so you can experiment with code and tools to best meet your needs. “With much help and encouragement from the support team at data.gov.au, we dipped our toes into the CKAN API waters. As a DotNet shop we were keen to limit the technology landscape and sought to automate the upload using DotNet. The CKAN API is refreshingly lightweight with a simple authentication process and messaging.” -- ABN Lookup Team Code at
8
Quality – improve over time
Sir Tim Berners-Lee's 5 Star Data Quality standard is on beta.data.govt.nz for testing Feedback welcome on basic technical quality framework
9
Sensitive data integration & aggregation
Most of your data is not sensitive. Challenging but great potential for improved policy/services. Unit record sharing is complex, privacy concerns for personal data. See great work by Stats (IDI) and upcoming SIU work. Personal unit record data is mostly useful to researchers, appropriate mechanisms with legal, technical, ethical constraints to access such data. Data aggregated by common spatial boundaries is comparative across datasets and over time. Unfortunately, data owners traditionally aggregate to boundaries that constantly change (electorates, postcodes, etc). Anonymisation on the fly APIs also provide mechanism for appropriate public/agency access to unit record level data.
10
DEMO of beta.data.govt.nz
Consistent catalogue Tabular data hosting Automated APIs Analytics & Reporting Basic Charts & Mapping Search Metadata/DCAT Harvesting Metadata mapping Data.govt.nz as your catalog Local cache of data Publish data Basic graphs and maps API access to machine readable data Basic graphs and maps Stats and analytics Broken links “Openness” rating Basic technical quality Survey
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.