Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Slides:



Advertisements
Similar presentations
Innovation data collection: Advice from the Oslo Manual South East Asian Regional Workshop on Science, Technology and Innovation Statistics.
Advertisements

Innovation Surveys: Advice from the Oslo Manual National training workshop Amman, Jordan October 2010.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Indian Statistical Institute Kolkata
Model Assessment, Selection and Averaging
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Evaluation.
Model Evaluation Metrics for Performance Evaluation
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Ensemble Learning: An Introduction
Lecture 5 (Classification with Decision Trees)
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Sparse vs. Ensemble Approaches to Supervised Learning
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7.2 Estimating a Population Proportion Objective Find the confidence.
Increasing Survey Statistics Precision Using Split Questionnaire Design: An Application of Small Area Estimation 1.
Ensemble Learning (2), Tree and Forest
Chapter 7 Confidence Intervals and Sample Sizes
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Active Learning for Class Imbalance Problem
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (28-30 October 2009) Accuracy evaluation of Nuts level 2 hypercubes with the adoption of.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Machine Learning CSE 681 CH2 - Supervised Learning.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Prob and Stats, Aug 26 Unit 1 Review - Fundamental Terms and Definitions Book Sections: N/A Essential Questions: What are the building blocks of Statistics,
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
The new multiple-source system for Italian Structural Business Statistics based on administrative and survey data Orietta Luzi, Ugo Guarnera, Paolo Righi.
CpSc 810: Machine Learning Evaluation of Classifier.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Machine Learning Documentation Initiative Workshop on the Modernisation of Statistical Production Topic iii) Innovation in technology and methods driving.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Class Imbalance in Text Classification
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Classification Ensemble Methods 1
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Acceptance of Social Media Marketing in the Sanitary Market Marcus Diedrich 6. November 2015.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Sharing of previous experiences on scraping Istat’s experience
WEB SCRAPING FOR JOB STATISTICS
Machine Learning – Classification David Fenyő
Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN
Erasmus University Rotterdam
Italian Examples of the use of big data for producing statistics
iSRD Spam Review Detection with Imbalanced Data Distributions
Use of Web scraping for Enterprises Characteristics
Somi Jacob and Christian Bach
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
COSC 4368 Intro Supervised Learning Organization
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra Nurra(*), Marco Scarnò(**), Donato Summa(*) (*) Italian National Institute of Statistics (Istat) (**) Cineca Quality 2014 Wien, June

The “ICT in enterprises” survey  In Italy, the survey investigates on a universe of 211,851 enterprises with at least 10 employees, by means of a sampling survey involving 19,186 of them (2011).  In the 2013 round of the survey, 8,687 indicated their website (45% of sampling respondent units).  The access to the indicated websites in order to gather information directly within them, gives different opportunities. Quality 2014

The “ICT in enterprises” survey Quality 2014 ActionTarget 1Substitute the traditional collection technique questionnaire-based, with an Internet as Data Source new one, for all suitable questions Reduction of respondent burden 2Integrate the information collected via questionnaire with the information collected via IaD Increase of accuracy of estimates 3Collect additional informationIncrease the offer of statistical information

The “ICT in enterprises” survey Quality 2014

Predictive approach vs Content Analysis Quality 2014 We assume that our target is to increase the accuracy of estimates by making use of data originating by the Internet as auxiliary data. This particular case is based on the use of textual data as auxiliary data. Texts are a “perfect” example of unstructured data, that is one of the characteristics of most Big Data. First, the usual model-based approach will be followed, requiring the prediction of values at unit level: under this approach, the target is to maximise the correctness of classification for each unit in the reference population. Next, a different approach will be illustrated, where the prediction of values at unit level is no more required and the target becomes to directly maximise the accuracy at the aggregate level (estimates accuracy).

Predictive approach Quality 2014 In a predictive approach, the subset of data related to sampled respondent units can be considered as the labeled data, and supervisioned learning methods can be applied. In other words, the subset of 8,687 enterprises that indicated to have a website or a home page, and also responded to questions [B8a : B8g], can be considered as the training and test set by means of which different models can be estimated in order to predict answers to [B8a : B8g] questions for the whole reference population. Texts (websites content) Survey Microdata Text and data mining Model

Predictive approach Quality 2014 In our case, we can apply one among the supervisioned learning methods: Classification Trees; “ensembles” (Bootstrap Aggregating, Adaptive Boosting, Random Forests); Supervised Latent Dirichlet Allocation for classification (SLDA); Neural Networks; Logistic Regression; Support Vector Machines; Naïve Bayes.

Evaluation of predictive models Quality 2014 From the error matrix it is possible to compute the following indicators: IndicatorExpressionMeaning Accuracy (precision) (TP+TN) / TotalRate of correctly classified cases Sensitivity (true positives rate) TP / (TP + FN)Rate of positive cases correctly classified Specificity (true negatives rate) TN / (FP+TN)Rate of negative cases correctly classified

Evaluation of predictive models Quality 2014 Application of different learners to predict question B8a “Online ordering or reservation or booking (Yes/No)”

Evaluation of predictive models Quality 2014 In general, when the misclassification cases are not balanced in absolute terms, the result is that the distribution of predicted values can be significantly different from the distribution of observed cases. From these results, Naïve Bayes predictor can be considered as the most convenient, because even if its precision (78%) is the lowest, though sensitivity is the highest, specificity is good, and the alignment of observed and predicted proportion is perfect.

Evaluation of predictive models Quality 2014 Application of Naïve Bayes to predict all questions in section B8

Content analysis Quality 2014

Content analysis performance … Quality 2014 In order to verify the robustness of the Content Analysis, we iterated 40 times the selection of a training set from survey data (each time producing an estimate of the proportion of web sales functionality), in correspondence to different rates of training set on the total (from 10% to 90%). The results show correctness of the method until 30% of training rate, but a great variability of estimates for every rate.

… compared to Naïve Bayes Quality 2014 The same exercise has been carried out for Naive Bayes. The results show a minimum bias (in the order of one or two percentage points), but a much lower variability.

Future work The experimented approach will be improved and extended in different directions: 1.with reference to the population of interest: we will consider the URLs of all the units belonging to the Business Register, and perform a mass scraping of related websites (in this case also experimenting more properly the high volume problems related to Big Data), considering the whole sampling subset of websites as a training set, so to obtain a model that can be applied the whole population. The aim is to produce estimates under a full predictive approach, reducing the sampling errors at the cost of introducing additional bias (both components of MSE should be evaluated); 2.with reference to the content of the questionnaire: the results obtained with the set of variables contained in the “B8” section of the questionnaire, will be evaluated also with the other suitable variables in the questionnaire (e-recruitment, e-procurement, use of social networks, etc.).

Contacts Thank you for your attention Quality 2014