Presentation is loading. Please wait.

Presentation is loading. Please wait.

PROFILING USERS BY ESTIMATING COMPOSITE AND MULTI-VALUED ATTRIBUTES FROM BIG DATA SOURCES FOR SOCIAL STATISTICS PURPOSES NTTS 2017, Brussels, March.

Similar presentations


Presentation on theme: "PROFILING USERS BY ESTIMATING COMPOSITE AND MULTI-VALUED ATTRIBUTES FROM BIG DATA SOURCES FOR SOCIAL STATISTICS PURPOSES NTTS 2017, Brussels, March."— Presentation transcript:

1 PROFILING USERS BY ESTIMATING COMPOSITE AND MULTI-VALUED ATTRIBUTES FROM BIG DATA SOURCES FOR SOCIAL STATISTICS PURPOSES NTTS 2017, Brussels, March 14-16, 2017 Jacek Maślankowski CENTRAL STATISTICAL OFFICE, Statistical office in gdańsk, poland UNIVERSITY OF GDAŃSK, POLAND

2 AGENDA NTTS 2017, Brussels, March 14-16, 2017 Prerequisites Framework
Results of analysis Conclusions NTTS 2017, Brussels, March 14-16, 2017

3 OVERVIEW AND THE GOAL OF THE STUDY
Show the methodology of extracting big data sources for social statistics purposes Provide information on population with detailed attributes that can be extracted from the data or at least estimated Attributes can be composite (e.g., address, name, etc.) as well as multi-valued (e.g., phone numbers) H1: The usability of the data can be increased by estimating values for specific entities H2: The representativeness of the web data does not allow applying it directly for social statistics purposes NTTS 2017, Brussels, March 14-16, 2017

4 METHODOLOGY AT A GLANCE
Set of combined methods used to extract users profiles from both social media as well as webpages Attributes available for analysis: Screen name Full name Geographic location URL Description Estimation of attributes: Gender Different forms of verbs Language markers Find phone/ address Regular expressions NTTS 2017, Brussels, March 14-16, 2017

5 METHODOLOGY IN DETAILS
METHODS: (1) Machine Learning tools (2) Text Mining methods STEPS: Analyse the readiness of the data source Social Big Data – Social Mining – Web Mining: profiling users Identification of demographic attributes POPULATION Group of social media users and Internet users that make comments on selected web portals NTTS 2017, Brussels, March 14-16, 2017

6 USE CASES WHO WHAT Entity is a person who is active on social media as well as persons that are making comments on various events in the country Three different cases were used to make analysis and enhance the social statistics: intentions to vote (mostly covered in statistics from OECD) media education – how people trust in media social confidence NTTS 2017, Brussels, March 14-16, 2017

7 RESULTS – ENTITY CLASSIFICATION (IDENTIFY GENDER)
BASED ON THE VERB FORM BASED ON THE VERB FORM AND TRAINING DATASET Based on suffix of the verb form: F1-score varies from 0.25 to 0.57 in specific datasets with precision 0.33 to 0.5. Cannot be included as the primary method of identification. MultinomialNB and Linear SVM From Supported by different methods NTTS 2017, Brussels, March 14-16, 2017

8 CONCLUSIONS (1/2) Several useful and reliable attributes can be extracted to enhance social statistics surveys Can enrich social surveys, e.g., on social confidence and intention to vote Results are presented using geographic and demographic attributes of the entities NTTS 2017, Brussels, March 14-16, 2017

9 CONCLUSIONS (2/2) The hypothesis H1 has been confirmed by comparing the results of analysis with the data from official statistics The hypothesis H2 has also been confirmed – we have to expect some differences in the results Each data source must be treated individually NTTS 2017, Brussels, March 14-16, 2017

10 THANK YOU! Jacek Maślankowski STATISTICAL OFFICE IN GDAŃSK POLAND
NTTS 2017, Brussels, March 14-16, 2017


Download ppt "PROFILING USERS BY ESTIMATING COMPOSITE AND MULTI-VALUED ATTRIBUTES FROM BIG DATA SOURCES FOR SOCIAL STATISTICS PURPOSES NTTS 2017, Brussels, March."

Similar presentations


Ads by Google