Sharing of previous experiences on scraping Istat’s experience

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

C6 Databases.
Quality Guidelines for statistical processes using administrative data European Conference on Quality in Official Statistics Q2014 Giovanna Brancato, Francesco.
Feature Selection for Regression Problems
DIRECT MARKETING Saket Kandoi Tanja Janjilovic Katarina Matkovic Jusa Neza Mihelcic Jessica Dávila Kaja Vidic IT4Everybody.
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
1 The ICT Statistics of Business Sector in China By Lu Haiqi International Statistical Information Center, National Bureau of Statistics People’s Republic.
Skills and occupational needs: the Occupational Information System in Italy Giovanni Castiglioni Università Cattolica del Sacro Cuore - Milano
1 26 October 2013 Observation and Reflection on Official Statistics against Big Data Challenge Yuan Pengfei Research Institute of Statistical Sciences.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
>>. ESSnet Measuring Global Value Chains 1.Globalisation indicators 2.Methodological development and support for International Organisation and Sourcing.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Machine Learning Documentation Initiative Workshop on the Modernisation of Statistical Production Topic iii) Innovation in technology and methods driving.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Application of Data Mining Techniques on Survey Data using R and Weka
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Chapter 29 Conducting Market Research. Objectives  Explain the steps in designing and conducting market research  Compare primary and secondary data.
New data sources (such as Big Data) and Traditional Sources Work Package 2.
Session topic (i) – Editing Administrative and Census data Discussants Orietta Luzi and Heather Wagstaff UNECE Worksession on Statistical Data Editing.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Introduction to Machine Learning, its potential usage in network area,
Web Scraping for Collecting Price Data: Are We Doing It Right?
Chapter 3: Cost Estimation Techniques
Data mining in web applications
Looking for statistical twins
EDUCAUSE Annual Conference
Short Training Course on Agricultural Cost of Production Statistics
Chapter 3: Cost Estimation Techniques
WEB SCRAPING FOR JOB STATISTICS
Data Based Decision Making
SharePoint Solutions Architect, Protiviti
Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN
School of Computer Science & Engineering
ESSNet Pilot: Web Scraping for Job Vacancy Statistics
WP1: Web scraping Job Vacancies- ELSTAT
Web Mining Ref:
Istituto Nazionale di Statistica – Istat
OECD-Eurostat Expert Meeting on Trade in Services Statistics
Business and Management Research
Dissemination Workshop WP 2: Webscraping / Enterprise Characteristics
United Nations Development Account 10th Tranche Statistics and Data
Classifying enterprises by economic activity
Re-engineering the French Business Register ( )
Chapter 3: Cost Estimation Techniques
Survey phases, survey errors and quality control system
Web scraping tools, an introduction
Chapter 3: Cost Estimation Techniques
Italian Examples of the use of big data for producing statistics
Survey phases, survey errors and quality control system
ESSNet Pilot: Web Scraping for Job Vacancy Statistics
Italian situation in the following areas:
Use of Web scraping for Enterprises Characteristics
Bureau of Transportation Statistics
Web Mining Department of Computer Science and Engg.
Business and Management Research
Course Introduction CSC 576: Data Mining.
Big Data ESSNet WP 1: Web scraping / Job Vacancies Pilot
TEXTAND WEB MINING.
TEXT and WEB MINING.
Sampling and estimation
Parallel Session: BR maintenance Quality in maintenance of a BR:
Information Retrieval and Web Design
Chapter 3: Cost Estimation Techniques
Maria Teresa Capria December 15, 2009 Paris – VOPlaneto 2009
Big Data tools for IT professionals supporting statisticians Istat SW for webscraping Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT.
Is Statistics=Data Science
Presentation transcript:

Sharing of previous experiences on scraping Istat’s experience Essnet on Big Data WP2 - Webscraping / Enterprise Characteristics Sharing of previous experiences on scraping Istat’s experience ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience We can distinguish two different kinds of web scraping: specific web scraping, when both structure and content of websites to be scraped are perfectly known, and crawlers just have to replicate the behaviour of a human being visiting the website and collecting the information of interest. Typical areas of application: data collection for price consumer indices (ONS, CBS, Istat); generic web scraping, when no a priori knowledge on the content is available, and the whole website is scraped and subsequently processed in order to infer information of interest: this is the case of the “ICT usage in enterprises” pilot. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience So far, in Istat there have been a number of web scraping experiences, involving: prices related to a set of goods and services; agritourism farms’ portals; scraping of enterprises websites in the Enterprises ICT survey. In the first two cases, specific scraping has been employed, while the generic one has been used for the enterprises websites. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience Web scraping data acquisition for Harmonized Index of Consumer Prices (European project “Multipurpose price statistics”) It is currently performed on two groups of products: Consumer electronics Airfares These prices are collected by simulation of on line purchases. Applications making use of iMacros. Next: use of open source software as “rvest” (R package for web scraping). ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience Use of web scraping for agritourism data collection In Italy there are about 20,000 agritourism farms (AF). An annual survey is carried out to collect information on them. The same information, and more, can be obtained by directly accessing and scraping their sites on the web. Instead of accessing each single website, a limited set of «hubs» would be scraped, each one containing information related to many AFs. In this case, «specific» scraping would be used. A relevant problem here would be the treatment of duplications and incoherent information, as the same AF can be present in more than one hub. A critical step is related to «record linkage», as it is necessary to refer correctly the information collected in different hubs to a given AF. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience ICT usage in enterprises The web questionnaire is used to collect information on the characteristics of the websites owned or used by the enterprises: In a first phase, the aim of the experiment was to predict values of questions from B8a to B8g using machine learning techniques applied to texts (text mining) scraped from the websites. Particular effort was dedicated to question B8a (“Web sales facilities” or “e-commerce”). ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience ICT usage in enterprises ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience ICT usage: web scraping Different solutions for the web scraping will be investigated. For instance, in Istat experiments we have already tested: the Apache suite Nutch/Solr (https://nutch.apache.org) for crawling, content extraction, indexing and searching results is a highly extensible and scalable open source web crawler; HTTrack (http://www.httrack.com/ ), a free and open source software tool that permits to “mirror” locally a web site, by downloading each page that composes its structure. In technical terms it is a web crawler and an offline browser; JSOUP (http://jsoup.org ) permits to parse and extract the structure of a HTML document. It has been integrated in a specific step of the ADaMSoft system (http://adamsoft.sourceforge.net), this latter selected as already including facilities that allow to handle huge data sets and textual information. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

ICT usage: web scraping Sharing of previous experiences on scraping - Istat’s experience ICT usage: web scraping These techniques will be evaluated by taking into account: efficiency: number of websites actually scraped on the total and execution performance; effectiveness: completeness and richness of collected text that can influence the quality levels of prediction. Solution # websites reached Average number of webpages per site Time spent Type of Storage Storage dimensions Nutch 7020 / 8550=82,1% 15,2 32,5 hours Binary files on HDFS 2,3 GB (data) 5,6 GB (index) HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on file system 16, 1 GB JSOUP 7835/8550=91,6% 68 11 hours HTML ADaMSoft compressed binary files 500MB UNDER TESTING: ad hoc solution based on Jsoup and Jcrawler ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience ICT usage: text mining 2013 and 2014 rounds of the survey have both been used in the experiment. For all respondents declaring to own a website, their website have been scraped, and collected texts submitted to classical text mining procedures in order to build a “matrix terms/documents”. Different learners have been applied, in order to predict values of target variables (for instance, “e-commerce (yes/no)”) on the basis or relevant terms individuated in the websites. The relevance of the terms (and consequent selection of 1,200 out of 50,000) has been based on the importance of each term measured in the analysis of correspondence. 2013 data have been used as “train” dataset, while 2014 data have been used as “test” dataset. The performance of each learner has been evaluated by means of the usual quality indicators: accuracy: rate of correctly classified cases on the total; sensitivity: rate of correctly classified positive cases on total positive cases; specificity: rate of correctly classified negative cases on total negative cases. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

ICT usage: e-commerce prediction Sharing of previous experiences on scraping - Istat’s experience ICT usage: e-commerce prediction Learner Quality Indicators for e-commerce Accuracy Sensitivity Specificity Proportion of e-commerce (observed) e-commerce (predicted) GLM (Logistic) 0.69 0.68 0.19 0.22 Random Forest 0.79 0.63 0.83 0.25 Neural Network 0.70 0.62 0.72 0.20 Boosting 0.67 0.66 Bagging 0.82 0.38 0.92 Naïve Bayes 0.75 0.55 0.21 LDA 0.71 0.65 0.28 RPART (Tree) 0.95 0.16 ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience ICT usage: e-commerce prediction ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience ICT usage: prediction of other variables ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience So far, the pilot explored the possibility to replicate the information collected by the questionnaire using the scraped content of the website and applying the best predictor (reduction of respondent burden). A more relevant possibility is to combine survey data and Big Data in order to improve the quality of the estimates. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience The aim is to adopt a full predictive approach with a combined use of data: all the websites owned by the whole population of enterprises are individuated and their content collected by web scraping (= Big Data); survey data (the “truth ground”) are combined with Big data in order to establish relations (models) between the values of target variables and the terms collected in corresponding scraped websites; estimated models obtained in step 2 are applied to the whole set of texts obtained in step 1 in order to produce estimates related to the target variables. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March

Sharing of previous experiences on scraping - Istat’s experience Thank you for your attention ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March 16 16