Making your data lovely

Slides:



Advertisements
Similar presentations
Implementing Tableau Server in an Enterprise Environment
Advertisements

Usage Statistics in Context: related standards and tools Oliver Pesch Chief Strategist, E-Resources EBSCO Information Services Usage Statistics and Publishers:
John Williams LAPSI/EVPSI 10 July 2012 Standardisation of Licensing – the UK Example.
INSPIRE - how to use? Spatial data from diverse data sources INSPIRE and reporting data flows Example: spatial data sets related to MSFD, water and INSPIRE.
United Kingdom Statistics and Open Data: An Early Example Neighbourhood Statistics Service Dev Virdee.
Unit 9.2 / Lesson5 / presentation5a Setting up the spreadsheet.
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
Spreadsheet in excel o Spreadsheet in excel o Uses of spreadsheet o Advantages Prepared by: Yusra Waseem 8 th C.
Governance Issues Governance Dimensions of data access infrastructures Rob Atkinson Social Change Online.
EXCEL 101 Level 1 on a PC CORE (Centre for Organizational Resilience), For Youth Initiative.
Managing Data Interoperability with FME Tony Kent Applications Engineer IMGS.
Website design Feng Zhao College of Educatioin California State University, Northridge.
Another PillowTalk Presentation  2004 Dynamic Systems, Inc. Business Intelligence: Analytical Reporting.
0 A Workable Solution for Basic Metadata January 9, 2006.
Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.
HTML Basics BCIS 3680 Enterprise Programming. Web Client/Server Architecture 2  Your browser (the client) requests a Web page from a remote computer.
Hampshire Hub Data Platform Progress update 1 October Bill Roberts Swirrl.
Microsoft Excel – Pivot Tables Introduction to Microsoft Excel Pivot tables Please login to the computers and launch Microsoft Excel. Rob Jones Room WG43.
Research Grants and Projects Discovery Service ANDS Webinar 12th August 2015 Monica Omodei, ANDS.
Animal Shelter Activity 2.
INTRODUCTION TO SPREADSHEETS MICROSOFT EXCEL. Spreadsheets Allows users to perform simple and complex sorting Allows users to perform calculations quickly.
Methods and Techniques for Integration of Small Datasets September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban.
Excel and Data Analysis. Excel can be a powerful tool for analysis Excel provides many tools for analyzing data –Filtering –Sorting –Formulas –Charts.
Copyright 2010, The World Bank Group. All Rights Reserved. Recommended Tabulations and Dissemination Section B.
An introduction to the MEDIN Discovery Metadata Standard.
An introduction to the MEDIN Discovery Metadata Standard.
Excel Services Displays all or parts of interactive Excel worksheets in the browser –Excel “publish” feature with optional parameters defined in worksheet.
Spreadsheet Evidence By.... P2 – DEVELOP A COMPLEX SPREADSHEET MODEL TO MEET PARTICULAR NEEDS.
C&E Opening Revenue Data ; ‘from paper to pivot table’ 5th Administrative Data Seminar 12th April 2016 Pat Mulhall Statistics & Economic Research Branch.
Jaime Pérez Virginia Martín-Rubio TERENA Networking Conference Prague, May 2011.
Open Governance Platform
Inventories and Data Publishing a. k
Open Data for an Open Society
Building a Data-Driven Public Service
AP CSP: Cleaning Data & Creating Summary Tables
DevInfo as a Use Case for the CoE for UNSDI
Activities in a nutshell
Leveraging the Business Intelligence Features in SharePoint 2010
Mail Merge for Lotus Notes and Excel User Guide
Operation Data Analysis Hints and Guidelines
System Overview Training on the use of the new countrystat
Business Intelligence 101
Physical Changes That Don’t Change the Logical Design
Mail Merge for Lotus Notes and Excel User Guide
Steering Group Member, Link Digital
System Overview Training on the use of the new countrystat
Scotland’s Environment Web Environmental Data Portal Joanna Muse Scottish Environment Protection Agency.
Lifting Data Portals to the Web of Data
Tagging documents made easy, using machine learning
David Goldstein President, Mekko Graphics June 21, 2017
FIFA 18 Player Analytics Project 1 IMGD 2905.
WISE and the future of WFD reporting
Metadata Quality: Learning from Open Data Portalwatch
Data quality 1: Individual records
Using the Checklist for SDMX Data Providers
Improving public accessibility and user engagement
BRK2279 Real-World Data Movement and Orchestration Patterns using Azure Data Factory Jason Horner, Attunix Cathrine Wilhelmsen, Inmeta -
Enhance BI Applications and Simplify Development
OGC GeoPackage Format A Container to support the integration of Statistical and Geospatial Data Marcus Blake Assistant Director, Geospatial Solutions Australian.
Application profiles and cataloging a manifestation
2/24/2019 6:15 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Metadata The metadata contains
Integrated Statistical Systems
Creating Datasets & Using Data Flows
Vancouver Public Library
Microsoft Azure Data Catalog
Australian and New Zealand Metadata Working Group
Rebecca Nyman Service Designer Gordon Williamson Product Manager
Pilot use of Linked Open Data technologies for publishing official statistics: current status in the ESS and Eurostat April 17th, 2018 GISCO WG.
Presentation transcript:

Making your data lovely Making your data lovely! Prioritising, cleaning, extraction, transformation, automation

Tips for ensuring your data is usable Adopt an approach of “data user and developer empathy” Build capacity by building into BAU, an initial focus that supports you Consume your own data & APIs (apps, datavis, BI, etc) Automation wherever possible! Ensure you consider: Discoverability – is it hosted or linked on data.govt.nz? Quality – no one can use bad data, but perfect the enemy of good Currency – is it up to date? How often is it updated? Machine readable or APIs – is it programmatically available? Publishing – have you provided supporting materials (taxonomies)? Reusability – have you tested it with data users? Licensing – Creative Commons By Attribution a good default Streamlining access – are logins really necessary?

Data on the inside Do you know what data you have internally? Are you considering all data types? How embedded is data driven decision making? How can you upskill the whole organisation? Data for management, policy, design, tech, etc Do you know what your external data needs are? How are you measuring and monitoring success? Data infrastructure to support your organisation should be extendable to support sharing/publishing Data by design!

Rub a dub data If a machine can’t read it, a machine can’t make an API Some data has specialised data formats, some commonalities Tabular, spatial, real time, unstructured, etc Most data comes from somewhere, use the source Luke! Machines and humans have different needs

What you need is clean sheets Don’t merge cells. Sorting and other manipulations people may want to apply to your data assume that each cell belongs to one row and column. Don’t mix data and metadata (e.g. date of release, name of author) in the same sheet. The first row of a data sheet should contain column headers. None of these headers should be duplicates or blank. The column header should clearly indicate which units are used in that column, where this makes sense. The remaining rows should contain data, one datum per row. Don’t include aggregate statistics such as TOTAL or AVERAGE. You can put aggregate statistics in a separate sheet, if they are important. Numbers in cells should just be numbers. Don’t put commas in them, or stars after them, or anything else. If you need to add an annotation to some rows, use a separate column. Use standard identifiers: e.g. identify countries using ISO 3166 codes rather than names. Don’t use only colour or other stylistic cues to encode information. If you want to colour cells according to their value, use conditional formatting. Leave the cell blank if a value is not available. If you provide pivot tables, make sure the underlying data is available separately too. If you also want to create a human-friendly presentation of the data, do so by creating another sheet in the same workbook and referencing the appropriate cells in the canonical data sheet http://www.clean-sheet.org/

Automate your reporting http://ckan.org/2015/09/18/pyramids-pipelines-and-a-can-of-sweave-ckan-asia-pacific-meetup/

Automating updates Automation involves system to system updates to save you time & money. Three broad approaches: Write scripts to push or pull data updates using an API directly from the source. Usually doesn’t require much data manipulation. Adopt a tool like Taverna, FME or Splunk to extract, clean/manipulate, and then push data to the data.gov.au (CKAN/geoserver) API directly. Use the data.govt.nz (CKAN) to schedule pull updates from your data, but most agencies don’t do that as they prefer to push updates. Get at least one geek in you data team so you can experiment with code and tools to best meet your needs. “With much help and encouragement from the support team at data.gov.au, we dipped our toes into the CKAN API waters. As a DotNet shop we were keen to limit the technology landscape and sought to automate the upload using DotNet. The CKAN API is refreshingly lightweight with a simple authentication process and messaging.” -- ABN Lookup Team Code at https://github.com/datagovau/ckan-api-examples

Quality – improve over time Sir Tim Berners-Lee's 5 Star Data Quality standard is on beta.data.govt.nz for testing Feedback welcome on basic technical quality framework http://5stardata.info/en/

Sensitive data integration & aggregation Most of your data is not sensitive. Challenging but great potential for improved policy/services. Unit record sharing is complex, privacy concerns for personal data. See great work by Stats (IDI) and upcoming SIU work. Personal unit record data is mostly useful to researchers, appropriate mechanisms with legal, technical, ethical constraints to access such data. Data aggregated by common spatial boundaries is comparative across datasets and over time. Unfortunately, data owners traditionally aggregate to boundaries that constantly change (electorates, postcodes, etc). Anonymisation on the fly APIs also provide mechanism for appropriate public/agency access to unit record level data.

DEMO of beta.data.govt.nz Consistent catalogue Tabular data hosting Automated APIs Analytics & Reporting Basic Charts & Mapping Search Metadata/DCAT Harvesting Metadata mapping Data.govt.nz as your catalog Local cache of data Publish data Basic graphs and maps API access to machine readable data Basic graphs and maps Stats and analytics Broken links “Openness” rating Basic technical quality Survey