Description of national ongoing/intended data processing Roberta Radini – Istat I° Internal Meeting of WP5 Mobile Phone Data Madrid, 7 June
Outline Description of ongoing data processing: The next steps Classifying urban population Tool used: Sociometer The next steps Description of national ongoing/intended data processing
Classifying municipality population (Users) A person is Resident in an area A when his/her home is inside A. Therefore the mobility tends to be from and towards his/her home. A person is a Commuter between an area B and an area A if his/her home is in B while the work/school place is in A. Therefore the daily mobility of this person is mainly between B and A. A person is a Dynamic Resident between an area A and an area B if his/her home is in A while the work/school place is in B. A Dynamic Resident represents a sort of “opposite” of the Commuter. A person is a Visitor in an area A if his/her home and work/school places are outside A, and the presence inside the area is limited to a certain period of time that can allow him/her to perform some activities in A. B A B A A B Description of national ongoing/intended data processing
A methodology to classify the users A methodology to classify the users needs a condensed representation of the user’s activities, which we can define user’s profile of behavior, called Individual call profile (ICP). We can organize the telephone data for each SIM and calling place in: time of day: morning (00:00-08:00), daytime (8:00-19:00) and night (19:00-24:00); days of the week – divided into: weekdays and weekend; Set of CDRs count the single frequency ICP t1 = [00:00-08:00) t2 = [8:00-19:00) t3 = [19:00-24:00) Cod SIM Municipality Date Hour 123643 PISA 06/02/2017 11:00 123643 PISA 06/02/2017 12:05 123643 PISA 07/02/2017 12:15 123643 PISA 08/02/2017 14:03 123643 PISA 08/02/2017 14:13 123643 CASCINA 09/02/2017 09:42 123643 CASCINA 09/02/2017 15:42 123643 PISA 11/02/2017 07:45 123643 PISA 12/02/2017 10:01 123643 PISA 12/02/2017 12:18 …. Description of national ongoing/intended data processing
Classifying the behavior By using the K-Means clustering algorithm, from the Individual Call Profiles (ICP) we can extract some clusters, which are a group of homogeneous behaviors of a population discovered in the data, called Prototypes. The corresponding k centroids, called Stereotypes, are the set of representative behaviors of the population. The same experts have specified the initial set of reference profiles, the Archetypes. These archetypes are a sort of “perfect examples” of a behavioral profile, and aim at synthesizing the users’ typical behavior category. Prototype Architype Description of national ongoing/intended data processing
The Sociometer This tool, implemented by the University of Pisa and the NRC, is able to classify the behavior of each user in the CDRs. Since 2015, a collaboration between ISTAT, NRC and the University of Pisa has been established, with the aim of using the sociometer tool to classify people’s individual and collective behavior in order to compute statistical models. The core of the analysis performed is a tool called Sociometer Description of national ongoing/intended data processing
Classification algorithm The Sociometer: Individual call profile and Data Mining A ICP Resident Dynamic Resident B Classification algorithm A A Commuter From the ICPs, a set of clusters are extracted by using K-Means algorithm. A B Visitor Description of national ongoing/intended data processing
The Next steps We are currently installing and configuring the tool Sociometer on our IT Big Data platform, Cloudera We think we will soon have the first results We will then move on to the integration phase with the administrative data relating to the resident and commuting population For tourism estimates we need a set of CDRs at least at a regional level. At this time, we are requesting permission to the Guarantor to have data from all over Tuscany, as well as from Lazio and Campania In the new request, ISTAT requires wider and more detailed data sets, given the indispensability of information regarding antennas and their location for an accurate processing process of CDRs and even better quantification of the quality of the data. Description of national ongoing/intended data processing
Thanks