Download presentation
Presentation is loading. Please wait.
Published byRussell Snow Modified over 9 years ago
1
B ETTER WITH D ATA : A CASE STUDY IN SOURCING L INKED D ATA INTO A B USINESS I NTELLIGENCE ANALYSIS Amin Chowdhury Charles Boisvert Matthew Love Ian Ibbotson
2
Sourcing Linked Data into a Business Intelligence analysis Can students apply more than one technology at a time? Early barriers prevents access to later work Limited time Need to measure performance Cocktail effect
3
We need carefully worked case studies We use Open data to look into the relationship between weather conditions and levels of air pollution. This is a case using a range of practices: Finding and accessing Open Data Exploring Linked Data Sections of the Extract-Transform-Load processes of data warehousing Building an analytic cube Application of data mining tools Links provided for the data sources and tools.
4
Our case study: Air pollution kills Estimated 29,000 early deaths each year in the UK (PHE). Government targets for reducing the quantities and/or frequencies of the main pollutants (some figures given below). Local Authorities monitor and publish pollution levels in their areas. Sheffield City Council monitoring devices: Diffusion tubes Fully automated processing units.
5
Measuring pollution Nitrogen Dioxide diffusion tube Around 160 diffusion tube devices Diffusion tubes: are spread throughout the city area. Have to be sent in for analysis Data every six to eight weeks per tube. Published aggregated annual level
6
Measuring pollution 6 automated stations A.k.a. Groundhogs Fixed spots (sort of) Measure a variety of pollutants Plus temperature and air pressure (from ‘groundhog 1’) Frequent readings (several per hour) when it works Log is publicly available 15-year archive, with gaps Some post-editing: deletions, correction of outliers.
7
Data is available Sheffield City Council web sites: Air Quality: https://www.sheffield.gov.uk/environment/air-quality/monitoring.html Air Pollution Monitoring: http://sheffieldairquality.gen2training.co.uk/sheffield/index.html Good things: Automated station results We can selected a range, choose a format (PostScript, CSV, Excel), download. Data is human-readable (ish)
8
Is it open? Like so much data sourced from the Internet… Textual descriptions No obvious way of automatically deriving further information. Open data: the idea that certain data should be freely available to everyone to use and republish as they wish wikipedia.org/Open_data e.g. Groundhog1 is at “Orphanage Road, Firhill” – where is that? What is it like?
9
Is it open? Navigation not designed for automation. URL does not reflect the name of the Groundhog On Sir Tim’s 5-star scale, this is 3 / 5. We want automated discovery by data harvesting tools. Plus: how flexibly can users contribute to the data? How is the meta-data (licencing, quality…)? Available Downloadable Open format No API No automatic discovery Image: 5stardata.info
10
Wanted: automated discovery and consumption. C Boisvert office 9327 tel 1234 position Senior Lecturer Store everything as triples Rather than primary keys: Use URIs PKs are unique in one table of one system. URIs are unique World-Wide. Linked Data Form ‘chains’ from point to point through the graph database.
11
Air Quality+: Linked Data for Sheffield Pollution https://github.com/BetterWithDataSociety A database of Sheffield pollution measurements as linked data. Groundhogs have their URI Diverse measures, e.g. NO 2, SO 2, micro-particles (e.g. diesel fumes), air pressure, air temperature. Measurements are archived in the database as triples. The ontology allows all but literal values to be further investigated, for instance to find out more about the NO 2 compound. Allows machine discovery to add context to data, e.g. the type of neighbourhood of each of the Groundhog sites.
12
AQ+ linked data
13
SPARQL To query the Subject / Predicate / Value triples in the database, we use the SPARQL query language. Specify a partial triple to return all records that fit that context. Filter – e.g. return values within a selected date range. Discover programmatically What Groundhogs there are What pollutants each monitors, The readings of those pollutants. The AQ+ endpoint offers multiple result formats, e.g. CVS, JSON, XML.
14
SPARQL Editor boisvert.me.uk/opendata/sparql_aq+.html Hourly readings from all available Groundhogs between selected dates Editing SPARQL syntax highlighted interpreted on AQ+ endpoint
15
Further data sources A lucky strike: Local enthusiast Weather station readings at five- minute intervals In PDF format - 200 pages per month! Bytescout PDF -> CSV Giving added context to facts, through Dimension descriptors added from other sources. From Groundhog1 – temperature and air pressure But no data on other factors - wind strength & direction, humidity Surely these influence pollution formation and/or dispersal? We need detailed historic weather data; not cheap. Licencing rights to this data have not been decided in general. Ask permission to use the data for study purposes (any commercial use of the data could cause the site to be closed).
16
I NTEGRATION O F F URTHER D ATA S OURCES Microsoft SQL Server Data Warehouse ETL processes Data Cube from Data Star Business Intelligence with MS Analysis Service Data Mining
17
Data Warehouse
18
Creation of Data Cube from Data Star
19
Analysis and PowerPivot Exporting
20
Self Service Data Exploration
22
Data Mining
24
Ranked by probability
25
Comparison of properties of cluster 9
26
Cluster Data Mining Toolc
27
Decision Trees
28
http://aces.shu.ac.uk/teaching.cmsrml/AirQuality Teaching resources
29
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.