Download presentation
Presentation is loading. Please wait.
Published byAapo Hakola Modified over 5 years ago
1
European Examples of the Use of Big Data for Producing Statistics
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat
2
Topics Netherlands: Estonia: UK:
traffic statistics based on road sensors (traffic loops) Estonia: Tourism statistics based on mobile phone data UK: use of data from Twitter 2 Eurostat
3
Traffic Statistics Using Road Sensors in the Netherlands
First example in Europe of published official statistics based exclusively on a Big Data source Index of traffic intensity, based on data collected by road sensors (traffic loops) Vehicle count on main Dutch highways 230 million records per day Complex cleaning phase using special techniques Details in Piet Daas presentation 3 Eurostat
4
Mobile Positioning Data as Source for Statistics: Estonian Experiences
Movement of mobile phones is one of the easiest sources to record border crossings and traffic flows Today, mobile positioning is technically possible in most mobile networks; this is a very rapidly developing method which can be successfully used in different applications In Estonia, different statistical datasets derived from mobile networks are provided to government organisations from a collaboration between a University and a spin-off company Positium LBS and the Department of Geography of University of Tartu 4 Eurostat
5
Computed Statistics inbound and outbound tourism statistics
Travel item of the balance of payments Transportation flows and OD matrixes Everyday mobility and commuting Personal and common anchor points All these works dates back to ! 5 Eurostat
6
Source Data (any kind of) event data (metadata) that covers subscriber activities and which is included in the data stream of the Mobile operator Internal events: inbound roaming domestic External events: outbound roaming geographical cellular (network) referencing data attribute data for subscribers (e.g. demographic information taken from the customer database) 6 Eurostat
7
Call Detail Records Inbound and outbound roaming and domestic datasets
Easily accessible by the operators Represent the active usage of mobile devices covering incoming and outgoing calls and SMS text messaging Problems with CDRs: frequency and the regularity of the records as they are based on the usage pattern of the subscriber The average number of CDRs for tourists is approximately four events per subscriber per day average of four location facts for a phone for every single day This is sufficient for some areas, but it sets limits upon domains in which better temporal accuracy is required 7 Eurostat
8
Geographical Location Derived from Antenna Location
8 Eurostat
9
Processing CDRs The size of the data block, the number of records it covers, and the processing complexity require a sophisticated data processing system Two Options Data is extracted and processed within the operators, and the resulting statistical indicators are transmitted to the NSI, where they are combined to create the final statistical indicators Data is extracted by MNOs and transmitted to the NSI, where the processing is carried out in order to produce the final indicators 9 Eurostat
10
Data Processing Steps 10 Eurostat
11
Quality There are several inherent data limitations that impact the quality of the methodology lack of information on expenditure, the purpose of the trip, the method of transportation and the overall qualitative aspects of the tourism activity It should be noted that the quality of the final outcome relies heavily on the availability of external information E.g. accommodation statistics, transport statistics, information about the market share for mobile operators and their subscribers and other information used for estimation. 11 Eurostat
12
Quality The accuracy is the most problematic quality aspect for this type of data, especially in terms of coverage issues Many components that contribute to the issue, complex to assess all of them Examples No information on the relationship between owning and using a mobile phone and a person’s travelling habits Possible alteration in continuity due to changes in underlying legislation and/or in technology Eurostat
13
Coherence Mobile phone-based statistics from Estonia were compared against the official tourism statistics made available by Eurostat and against other indicators related to tourism statistics Tourism statistics from mobile positioning data were provided by Positium, based upon datasets from two Estonian mobile network operators The data includes Estonian inbound, outbound and domestic tourism between 2008 and 2012. The reference statistics were provided mainly by Statistics Finland and Statistics Estonia. Reference data, available on the Eurostat website, on the outbound trips of Estonians to EU Member States was also used. Eurostat
14
Coherence Eurostat 14
15
The Estonian Experience: Conclusions
Estonia’s experience in generating movement statistics with the help of mobile positioning has happened gradually since and now it is established and well-trusted Despite this positive experience, mobile positioning data involves several problems. For instance, there are many unanswered questions related to the sample and quality of the data Protection of data and privacy also serves as an important issue that is continually in the center of attention The example of Estonia can serve as a positive basis to promote the gradual conduction of similar surveys in other countries Eurostat
16
The Use of Twitter Data in UK
ONS has carried out an extensive experimentation on the use of social data from Twitter to compute statistics A large amount of geo-located tweets has been collected in different periods April to October 2014 106 million collected, 86 million used Eurostat
17
Data Collection Data from Twitter can be collected mainly in two ways:
Using the public streaming API Buying data from a company authorized by Twitter (GNIP) The first method is free, but the public streaming download is limited to the 1% of the tweets at global level. ONS tested both methods in distinct periods April 11th 2014 August 14th 2014 (API) April 1st to April 10th - August 15th to October 31st (gnip) Eurostat
18
Data Collection The collection of data through the Twitter API must follow the "Twitter Developer Rules". These rules emphasize principles of courtesy and “being a good partner” rather than enforcement and sanctions. In these rules you can find questionable terms like "reasonable request volume" or "excessive or abusive usage". ONS thinks that it is impossible to base official statistics on such a fragile legal basis, and any large scale use of Twitter data would require commercial arrangements to acquire data Based on the experience of the pilot, the approximate cost of purchasing all geo-located Tweets within Great Britain over 12 months would be in the region of £25,000 How does it compare to the cost of a survey? Eurostat
19
Data Collection Considerations
During the data collection phase, a new version of iOS was released. Location service permissions changed, giving the user the ability to give more detailed permissions Explicit requests were prompted to the users, asking them to specify or confirm the location permissions A lot of users denied the location permission that they gave before, explicitly or implicitly ONS observed at that time a big drop in geo-located tweets coming from iOS users. Globally geo-located tweets dropped about 25% and the drop was entirely due to iOS users. This event shows one of the important and well-known limits of Big Data obtained from mobile phone data sources and generically from sources coming from external providers using a particular technology The stability of these sources is often critical and simple technical and/or social changes can change dramatically the quantity and the quality of data collected (about the quality we must keep in mind that iOS users are different from Android users). Eurostat
20
Data Cleaning Removal of Twitter robots. These are automated Twitter accounts that post high volumes of tweets, but do not represent the activities of a real person. Removal of non-GB tweets from the Twitter API data mainly those from the Republic of Ireland Removal of geolocated GB tweets without GPS precision location e.g. sent from a desktop computer Removal of a very small number of GB labelled tweets with precision coordinates that could not be assigned British Map Grid coordinates Probably wrong GB code by Twitter Removal of duplicate tweets from the time periods on 10 April and 15 August when there were overlaps between the Twitter API and GNIP data. Removal of all tweets from the Twitter API relating to users that were not in the GNIP data where these two sources overlapped Probably users that modified their configuration Eurostat
21
Data Analysis – Tweets by user
A characteristic of Twitter is that some users are much more active than others. In the ONS data set, over 17% of users had only one geolocated tweet over the seven month period. At the other extreme, 90 Twitter users generated more than 10,000 geolocated tweets. This means that most Twitter data is generated by a small proportion of users. More than half of all geolocated tweets were sent from just 4% of Twitter accounts while the median number of geolocated tweets by account was just 10. Eurostat
22
Data Analysis – User Persistence
A simple measure of persistence is the time span in days between a user’s first and last tweet. The ONS study demonstrates that a high proportion of users tweet for a few days. The decline then becomes more gradual and then levels off until about 60 days. The pattern remains fairly flat until about 150 days at which point it continues a gradual decline. The median level of persistence is 47 days. These patterns suggest that many users go through a phase of sending geolocated tweets and then stop after a certain amount of time. This could be for any number of reasons from changing attitudes towards privacy, changing technology, or simply waning enthusiasm for Twitter. Thus, only a subset of users will generate enough geolocated activity to enable patterns to be detected over longer periods of time. This suggests that Twitter may be more useful for tracking longitudinal mobility patterns over periods of up to a couple of months, but is not suitable for longer time periods (e.g. over a year) Eurostat
23
Data Analysis – Location of Residence
ONS used the DBSCAN algorithm to cluster geolocated activity traces from Twitter to infer a user’s location of residence. Density-Based Spatial Clustering Algorithm with Noise (DBSCAN) The residential cluster with the highest number of tweets (referred to as the dominant residential cluster) has been assumed to be the location of usual residence. Eurostat
24
Data Analysis – Monthly Net Flows
The activity traces for each user have been broken down into months and then dominant residential clusters have been identified for each month. When the dominant cluster from one month to the next is in a different local authority, this can be inferred as a mobility flow between local authorities. When these net flows for each local authority have been compared with the proportion of students in the population (based on Census data) there is strong signal that follows the cycle of the academic year. For example, in June there is a net flow out of student areas coinciding with the end of studies. In contrast, in September and October, there is a net inflow back into these areas. Thus, these data can be used as indicator of student mobility that cannot be detected from existing sources. Eurostat
25
The UK Experience: Conclusions
ONS tested two different ways to collect data from one of the most famous social networks, Twitter, discussing pros and cons of them. Also many interesting observations have been made about the influence of technological changes on the continuity of the source and about the quality of data collected. Moving from research into operations would probably require procurement of Twitter data. The collection phase has proved to be the most critical in the use of such data, since regarding statistical processing, the potential of these data are not disputable. Eurostat
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.