Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi alessandra.sozzi@ons.gov.uk.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
HTML I. HTML Hypertext mark-up language. Uses tags to identify elements of a page so that a browser such as Internet explorer can render the page on a.
HTML Overview - Cascading Style Sheets (CSS). Before We Begin Make a copy of one of your HTML file you have previously created Make a copy of one of your.
2/23/ Enterprise Web Accessibility Standards Version 2.0 WebMASSters Presentation 2/23/2005.
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Xyleme A Dynamic Warehouse for XML Data of the Web.
TC 310 The Computer in Technical Communication Dr. Jennifer Turns Week 5, Day 1 (10/28)
Information Retrieval
Glencoe Digital Communication Tools Create a Web Page with HTML Chapter Contents Lesson 4.1Lesson 4.1 Get Started with HTML (85) Lesson 4.2Lesson 4.2 Format.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Using Cascading Style Sheets (CSS) Dept. of Computer Science and Computer Information CSCI N-100.
Slide 1 Today you will: think about criteria for judging a website understand that an effective website will match the needs and interests of users use.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
WEB TERMINOLOGIES. Page or web page: a file that can be read over the world wide web Pages or web pages: the global collection of documents associated.
Introduction to HTML. What is a HTML File?  HTML stands for Hyper Text Markup Language  An HTML file is a text file containing small markup tags  The.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Fundamentals of Web Design Copyright ©2004  Department of Computer & Information Science Introducing XHTML: Module A: Web Design Basics.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Web software. Two types of web software Browser software – used to search for and view websites. Web development software – used to create webpages/websites.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.
Introducing the World Wide Web Internet- a structure made up of millions of interconnected computers whose users communicate with each other and share.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
HTML Basic. What is HTML HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language, it.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Information Retrieval and Web Search Crawling in practice Instructor: Rada Mihalcea.
Headings are defined with the to tags. defines the largest heading. defines the smallest heading. Note: Browsers automatically add an empty line before.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Introduction. Internet Worldwide collection of computers and computer networks that link people to businesses, governmental agencies, educational institutions,
Information Networks. Internet It is a global system of interconnected computer networks that link several billion devices worldwide. It is an international.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Cascading Style Sheets Part 1 1 IT 130 – Internet and the Web.
Getting Started with HTML
Search Engine Optimization
Web Basics: HTML/CSS/JavaScript What are they?
Getting Started with CSS
HTML5 and CSS3 Illustrated Unit D: Formatting Text with CSS
Introducing XHTML: Module A: Web Design Basics
Introducing XHTML: Module A: Web Design Basics
Uppingham Community College
Text Based Information Retrieval
E-commerce | WWW World Wide Web - Concepts
E-commerce | WWW World Wide Web - Concepts
Erasmus University Rotterdam
Web software.
Introduction to XHTML.
CNIT 131 HTML5 – Anchor/Link.
Unit 14 Website Design HND in Computing and Systems Development
Web Development Life Cycle from Beginning to End…and BEYOND!
Data Extraction using Web Scraping
People-LDA using Face Recognition
Tutorial Developing a Basic Web Page
Command Me Specification
Michal Rosen-Zvi University of California, Irvine
Introduction to Text Analysis
HyperText Markup Language
Bryan Burlingame 24 April 2019
Hierarchical, Perceptron-like Learning for OBIE
Web Development Life Cycle from Beginning to End…and BEYOND!
Unsupervised learning of visual sense models for Polysemous words
Presented By: Grant Glass
Presentation transcript:

Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi alessandra.sozzi@ons.gov.uk

Background In Sept 2015 the United Nations stipulated its requirements for Sustainable Development Goals (SDG’s). The SDGs outline a blueprint for development priorities to be achieved by 2030. They include 17 goals with 169 targets covering a broad range of sustainable development issues, from ending poverty and hunger to improving health and education, reducing inequality, and combating climate change. The Office for National Statistics (ONS) has responsibility for reporting the UK’s progress towards the SDGs.

Background Target 12.6: To encourage companies, especially large and transnational companies, to adopt sustainable practices and to integrate sustainability information into their reporting cycle.   Proposed Indicator: Number of company publishing a sustainability report  Project: Web scrape companies websites (sample of 100 largest UK private companies) Extract sustainability-related content Validation of content extracted using NLP 

Collecting data from the Web KEYWORDS: csr, environment, sustainab, responsib, footprint

Extracting just the text Web pages are often cluttered with additional features around the main content. These “features” often include elements such as navigation panels, pop-up ads, advertisements and comments. These noisy parts tend to negatively affect the performances of NLP tasks. Main content

Extracting just the text To decide which content extraction algorithm works best for our needs, we performed a test on a sample of 30 URLs. Four methods considered:  Dragnet: uses an ensemble of machine learning models to extract the main article content Readability: was originally thought as a browser add-on to turn any webpage into a clean view, to give users a better reading experience. This algorithm gives a score to each part of the HTML page based on a series of deterministic rules such as the number of commas, class names, link density and so on. <p> tags: this method simply extract all text enclosed within <p> tags (<p> … </p> ). This notation is generally used in HTML language to surround textual paragraphs. BeautifulSoup get_text(): BeautifulSoup is a Python library for parsing HTML files. The get_text() method of this library simply extract all text without making any distinction between good or bad content.

Extracting just the text Precision: is the percentage of true tokens among all the retrieved tokens by the method. Precision can be seen as a measure of exactness or quality. Recall: is the percentage of true tokens retrieved by the method among all the true tokens. Recall can be seen as a measure of completeness or quantity.

Collecting data from the Web Results: Extracted 563 sustainability-related web pages from 59 companies 35 companies with no sustainability content 6 problematic cases Observations: First half: 80% Second Half: 45% Manufacturer vs Services

Topic Modelling and LDA Topic models represent a family of computer programs that extract topics from texts. A topic is a list of words that occur in statistically meaningful ways. Topic modelling algorithms do not require any prior annotations or labelling of the documents. Instead, the topics emerge from the analysis of the original texts. Latent Dirichlet allocation (LDA) produce two results: Topics as a distribution over words: words that are more meaningful for the topic have a higher probability Documents as a distribution over topics: documents exhibit multiple topics (but typically not many). This is a distinguishing characteristic of LDA, in contrast with the assumption made by typical mixture models which assumes documents exhibit only a single topic.

Topic Modelling and LDA Topics

Companies-topic distribution by Industry Fashion Retailers Transport industry

Conclusions Results show it is possible to discern the number of companies publishing sustainability information via scraping of their websites. Sustainability means different things for different companies. Effects of industry type and size of the company are worth exploring more, especially when a much larger number of companies will be considered.