Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Similar presentations


Presentation on theme: "How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012."— Presentation transcript:

1 How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012

2 Introduction Me……. BSc Geography, Worked as SABSCO ltd, niche power station construction contractor MSc GIS, MRes Energy Demand Studies PhD: The Spatiotemporal patterns of energy demand and supply in the UK Recent interest and research into large datasets including a major piece of research into the effects of disparate inaccurate datasets on energy demand forecast models Email: ucesres@ucl.ac.ukucesres@ucl.ac.uk Web Linkedin: http://www.linkedin.com/pub/ed-sharp/43/2b4/b1bhttp://www.linkedin.com/pub/ed-sharp/43/2b4/b1b UCL: http://www.bartlett.ucl.ac.uk/energy/people/students/ed-sharphttp://www.bartlett.ucl.ac.uk/energy/people/students/ed-sharp LoLo: http://www.lolo.ac.uk/profilepreview/view/id/102http://www.lolo.ac.uk/profilepreview/view/id/102

3 Todays Lecture Three distinct sections 1. Theory: Describe how to handle and analyse large datasets 2. Practice: Run an exercise outlining some pervasive issues 3. Case Study: Demonstrate these within the context of some existing research Slides available on Moodle with web and literature references in full, colour denotes section.

4 Part 1: What is a large dataset? Large volumes of data –Millions of entries –Many Terabytes –Computationally intensive –Past 10 years x 1m Varied sources of data –Same variables –Different sources –Separate set of issues causing problems with handling and analysis Two types There are issues that are common between the two as well as individual

5 Examples…. Volumes –Census (http://census.ac.uk/)http://census.ac.uk/ –Home Energy Efficiency Database (HEED http://www.energysavingtrust.org.uk/Professi onal-resources/Existing-Housing/Homes- Energy-Efficiency-Database ) http://www.energysavingtrust.org.uk/Professi onal-resources/Existing-Housing/Homes- Energy-Efficiency-Database –Time series datasets e.g. energy production/consumption –Remotely sensed data –Geographic datasets –Climate reanalyses Sources –Population –Economic variables (GDP, GVA etc.) –Socio-demographic variables (Population, Employment etc.)

6 Sources including repositories and search engines: Data.gov: www.data.gov.ukwww.data.gov.uk GoGeo: www.gogeo.ac.ukwww.gogeo.ac.uk ShareGeo: www.sharegeo.ac.ukwww.sharegeo.ac.uk Eurostat: http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat/home/http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat/home/ IEA: www.iea.orgwww.iea.org National Statistics: www.statistics.gov.ukwww.statistics.gov.uk Odyssee: http://www.odyssee-indicators.org/http://www.odyssee-indicators.org/ OECD: www.oecd.orgwww.oecd.org UNECE: www.unece.orgwww.unece.org World Bank: www.worldbank.orgwww.worldbank.org ADS, Archaeology Data Service; archaeologydataservice.ac.ukarchaeologydataservice.ac.uk BADC, British Atmospheric Data Centre; badc.nerc.ac.ukbadc.nerc.ac.uk BODC: (Oceanographic): www.bodc.ac.ukwww.bodc.ac.uk CDS, Chemical Database Service; cds.dl.ac.ukcds.dl.ac.uk EBI, European Bioinformatics Institute; www.ebi.ac.ukwww.ebi.ac.uk ESDS, Economic and Social Data Service; www.esds.ac.ukwww.esds.ac.uk NCDR, National Cancer Data Repository; www.ncin.orgwww.ncin.org NGDC, National Geo-science Data Centre; www.ngdc.noaa.govwww.ngdc.noaa.gov UKSSDC, UK Solar System Data Centre. www.ukssdc.ac.ukwww.ukssdc.ac.uk Office for national statistics: www.ons.gov.ukwww.ons.gov.uk UK data archive (UKDA): www.data-archive.ac.ukwww.data-archive.ac.uk Casweb (census): casweb.mimas.ac.ukcasweb.mimas.ac.uk DFT: www.dft.gov.ukwww.dft.gov.uk EEA: www.eea.europe.euwww.eea.europe.eu World Energy Council: www.worldenergy.orgwww.worldenergy.org Florida solar energy centre: www.fsec.ucf.edu/www.fsec.ucf.edu/ EDINA: edina.ac.ukedina.ac.uk Mapcruzin: www.mapcruzin.comwww.mapcruzin.com Guardian datastore: www.guardian.co.uk/datawww.guardian.co.uk/data London air quality network: www.londonair.org.ukwww.londonair.org.uk OpenStreetMap: www.openstreetmap.orgwww.openstreetmap.org UK Borders: edina.ac.uk/ukbordersedina.ac.uk/ukborders Met Office: www.metoffice.gov.ukwww.metoffice.gov.uk DECC: www.decc.gov.ukwww.decc.gov.uk Etc…………………………… Highlighted examples should be the most relevant to EDE

7 Has anyone used “large datasets” before? 1.Yes 2.No

8 Does anyone think they will use it in the future? 1.Yes 2.No 3.Don’t know

9 Likely encounters Access is predominantly through the web Some may require sign in through university Fees sometimes waived for academic use (always worth asking) Verify Copyright and Licensing Used in –Research –Modelling –Pervasive in the environmental domain –Property –Finance Volume and complexity are increasing (e.g. Facebook, Flickr) Mckinsey: concluded that the analysis of this kind of dataset will become increasingly important in influencing business decisions therefore skills in this area will be valuable Mckinsey: “Big data: The next frontier for innovation, competition, and productivity” Available from: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_fron tier_for_innovation http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_fron tier_for_innovation

10 Storage: Very large datasets require their own servers, especially those which require security e.g. HEED and OpenStreetMap Parallel storage allows download simultaneously with simulation, visualisation and analysis Hardware development means all but the very biggest can be stored and transported on portable hard drives Most can be downloaded via the internet or in special cases requested on a CD (e.g. Ordnance Survey Mastermap) Effective backup is necessary especially once analysis begins Bespoke data architecture exists (e.g. financial databases) This requires knowledge of primarily SQL Most data that you encounter will be accessible through some sort of graphical interface –Example on next slide

11 Graphical interface SQL script

12 Software and data format Use whatever you are comfortable with Excel OK for majority of operations, good graphically –Limited to 1 million rows and 16384 columns (beware when importing data) For larger datasets or more sophisticated operations consider a statistical packge –SAS very good for large datasets but requires programming skill –SPSS almost as powerful with a better interface Works well in conjunction with Field (2009) Microsoft Access allows handling of large complicated databases All of these available through cluster machines or for home use from http://www.ucl.ac.uk/isd/common/software http://www.ucl.ac.uk/isd/common/software Alternatives include: R, Mathematica, Statistica and Rapidminer Formats Excel (.xls,.xlsx) Access (.mdb,.dbf) SAS and SPSS have proprietary formats but can be exported to excel A common format used for exchange is comma separated (.CSV,.txt) Others include: xml (machine readable), CDF (NASA), NeXus, OpenMath, PDS, SAIF, SDTS, VICAR etc…… (these require some kind of specialist knowledge) Field, A. P. 2009. Discovering statistics using SPSS, SAGE publications Ltd.

13 Data Handling: First steps 1.Metadata –Data about data –Attached in different ways –Varies in forms and content –Should follow standards e.g. INSPIRE http://inspire.jrc.ec.europa.eu/ http://inspire.jrc.ec.europa.eu/ 2.Identify methods of collection –Are these uniform across data sources? –May require reading supporting documentation 3.Identify contributors –Are they reliable 4.Identify alternative sources –Case study will show that divergence is possible

14 5.Identify data gaps –First do this visually –Genuine gaps should not skew subsequent analysis –If this has been replaced by for example NULL or 0.0 it may cause problems and should be investigated –If several datasets are used this should be harmonised –Follow a convention that is obvious to you and acceptable to the software 6.Identify Duplicates –More than one value for a data point –Possibly valid –E.g. shortened labels falsely groups values Data Handling: Second steps

15 Data Handling: Second steps continued… 7.Note precision –Data should be stored at a reasonable precision –For example: Beware of the dataset that tries to depict population to the nearest person –Harmonise between datasets –Can affect comparability to other data 8.Identify spurious data –Many rows and columns may not be needed –Discard to make analysis simple –Note changes –Keep copies of original 9.Harmonise heading –Ensure that they make sense to you and the software

16 Graphical representation and statistical analysis The above steps can be carried out by looking through a data However techniques exist to automate them and therefore reduce time The first step in any analysis should be to create graphs These can reveal patterns alongside highlighting duplicates, gaps and errors After this is done it may be useful to clean your data again Excel is fine but more complex and repeatable operations are available with other software and some programming

17 Some examples….. A simple graph Tufte (1983) and McCandless (2009)

18 Something more complex

19 Some better looking examples

20

21 Statistical tests Another automated analysis technique is statistical These can be combined in a box plot conveying statistics graphically Simple metrics such as mean, median, mode and standard deviation are useful as well as looking at distribution As well as the t test More sophisticated analysis through e.g. SPSS, GIS…..

22 Advanced analysis, simulation and visualisation These methods vary based on purpose and available data If you have purely statistical intentions then something like SPSS or SAS is ideal, especially in conjunction with Field (2009) A multitude of tests exist which will suit your needs, beware that these depend on data type, collection etc. The internet along with books and lecturers are a good source for deciding which to choose A good program for visualisation, provided that you have spatially related data Some examples of output that I have produced are on the next slide, again there is an abundance of web and literature resources

23 GIS

24 Part 2: Exercise Attempt to calculate the floor area of central house (this building) in pairs Stay in the room but use whatever techniques you have at your disposal No use of the internet (it will be obvious) Write your answer down on a piece of paper 10 minutes Be prepared to answer some questions using the poll system We will declare a floor area champion at the end

25 What units did you use? 1.Acres 2.Hectares 3.Square Mile 4.Square Kilometre 5.Square Metre 6.Square foot

26 Why? Although the standard is m2 you should not assume that data you are given uses this standard Always check the metadata to ensure that it has been done correctly Remember that Americans will not use the metric system and a large volume of data will originate from here Other units could well be correct but ensure that you use the data properly

27 Did you include the basement in your calculations? 1.Yes 2.No

28 Why Floor area calculations can be defined as usable, in this case the basement is used but someone creating a larger database would not have this information This can cause divergence between real data and that which you are provided with Check the metadata And if necessary at source

29 Did you attempt to subtract the floor area of interior walls? 1.Yes 2.No

30 Why Alongside different ways of defining floor area (semantics) There are different ways of calculating it It is possible a dataset may have been formed from an Ordnance survey outline which would include them Or a building survey would not Neither is wrong but transparency is essential

31 How many floors did you allow for? 1.3 2.4 3.5 4.6 5.7 6.8 7.9 8.More

32 Why? The correct number is eight but this may not be clear from plans Is the basement included in this?

33 Did you allow for the light well in the centre of the building? 1.Yes 2.No

34 Why? One method of calculating this would be to figure out the bottom floor and multiply it by the number of floors If you were unaware of the gap this may skew the result This type of error is common not only in floor area calculation but others that you may come across It is important to investigate and understand these sources of error

35 What was your final answer in metres squared? 1.0 – 750 2.750 – 1500 3.1500 – 2250 4.2250 – 3000 5.3000 – 3500 6.3500 – 4000 7.4000 – 4500 8. 4500 – 5000 9.More

36 Conclusion: The “Real” answer was 3,658m2 –39,376 sqft, –0.003658km2, –0.903949 Acre, –0.365815 hectare, –0.001412 mile2 Interestingly there is no DEC here so the figure is off the internet Different ways of defining the floor area have been used here as is the case for real datasets The reality is that the data you have created is probably as good an estimation of the floor area as is available publicly Errors would be multiplied if applied to for example the whole country which is “a large dataset ”

37 Data Sources (UK only) Part 3: Research Case study: Assessing the availability and quality of data for tertiary sector energy demand forecast models Large number of separate datasets Divergence responsible for error of up to 100%

38 Results – Classification schemes NACE (Tertiary)ISIC (Commercial) Wholesale & Retail Trade; repair of motor vehicles and motorcycles Wholesale and Retail Trade; Repair of Motor Vehicles, Motorcycles and Personal and Household Goods Accommodation and food service activities Hotels and Restaurants Financial, insurance and real estate activities Real Estate, Renting and Business Activities Administrative and support service activities Post and telecommunication, Financial Intermediation Education Human health and social work activities Health Other NACE activitiesMiscellaneous Public administration and defence Agriculture, Forestry and Fishery (as separate sub sectors NACE: Nomenclature statistique des Activités économiques dans la Communité Européenne (Eurostat, 2008) ISIC: United Nations International Standard Industrial Classifications (UNIDO, 2010)

39 Results - Floor space in the sector Entire Non- domestic stock “Tertiary sector” Questionable Difference “Tertiary sector” All Commercial and Public buildings

40 Results - Energy consumption in the sector Values from the ISIC scheme Values from the NACE scheme Declining Range

41 Results - Population

42 Results - Employee numbers in the sector Values from the ISIC scheme Values from the NACE scheme Declining Range Same patterns as seen with the energy consumption data

43 Results - Gross Domestic Product Clearly wrong (would this be obvious in isolation)

44 Results - Gross value added Values from the ISIC scheme Values from the NACE scheme

45 Conclusions……….. Research Case Study Conclusions Majority of error caused by lack of standard classification methodology Semantic differences exist but can be resolved Artefacts of harmonisation require care to eradicate Lack of transparency is pervasive Precision inextricably varies Variables with associated established methodology can be relied upon Many issues could be resolved through the setting up of a centralised repository Data is dangerous Theory conclusions: Data exists in many and varied forms Handling and analysis skills will become increasingly important There are a set of standard steps which should be followed in an initial exploration of any dataset Foremost in your mind should be viewing a dataset critically Visualisation is key to understanding Graphs etc. are generally the best way of communicating information

46 References: –Field, A. P. 2009. Discovering statistics using SPSS, SAGE publications Ltd. –Witten, I. H. & Frank, E. 2005. Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann. –Mccandless, D. 2009. Information is beautiful, Collins. –Tufte, E. R. & Howard, G. 1983. The visual display of quantitative information, Graphics press Cheshire, CT. –Mckinsey. 2011. Big data: The next frontier for innovation, competition, and productivity Available from: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_I nnovation/Big_data_The_next_frontier_for_innovation. http://www.mckinsey.com/Insights/MGI/Research/Technology_and_I nnovation/Big_data_The_next_frontier_for_innovation –Infrastructures, D. S. D. 2000. The SDI Cookbook. GSDI/Nebert. (for those interested in data infrastructure) –See also slide detailing data sources


Download ppt "How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012."

Similar presentations


Ads by Google