This module Provides some tips for data management

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Alternative FILE formats
Why python? Automate processes Batch programming Faster Open source Easy recognition of errors Good for data management What is python? Scripting programming.
Languages for Dynamic Web Documents
Technical Tips and Tricks for User Support Mike Gardner
Mark Dixon, SoCCE SOFT 131Page 1 20 – Web applications: HTML and Client-side code.
Master’s course Bioinformatics Data Analysis and Tools Lecture 6: Internet Basics Centre for Integrative Bioinformatics.
Website Development with PHP and MySQL Introduction.
Russell Taylor Lecturer in Computing & Business Studies.
Tutorial 8 Sharing, Integrating and Analyzing Data
Chapter 2 Introduction to HTML5 Internet & World Wide Web How to Program, 5/e Copyright © Pearson, Inc All Rights Reserved.
DAT602 Database Application Development Lecture 15 Java Server Pages Part 1.
Lecturer: Ghadah Aldehim
Classroom User Training June 29, 2005 Presented by:
Selecting and Combining Tools F. Duveau 02/03/12 F. Duveau 02/03/12 Chapter 14.
CPSC 203 Introduction to Computers Lab 21, 22 By Jie Gao.
INTRODUCTION TO FRONTPAGE. TOPICS TO BE DISCUSSED……….  Introduction Introduction  Features Features  Starting Front Page Starting Front Page  Components.
IIPS Summer Conference Session VI Wednesday, July 23, 2008 ~ 8:30 – 10:00 AM Presenters: Carolyn S. Evert and Susan D. Pritchard, Caldwell Community College.
Creating Integrated Web-based Projects using Microsoft Word.
SEG3210 DHTML Tutorial. DHTML DHTML is a combination of technologies used to create dynamic and interactive Web sites. –HTML - For creating text and image.
1 PHP and MySQL. 2 Topics  Querying Data with PHP  User-Driven Querying  Writing Data with PHP and MySQL PHP and MySQL.
Unit 1 – Web Concepts Instructor: Brent Presley. ASSIGNMENT Read Chapter 1 Complete lab 1 – Installing Portable Apps.
Web Programming: Client/Server Applications Server sends the web pages to the client. –built into Visual Studio for development purposes Client displays.
SEG3210 DHTML Tutorial. DHTML DHTML is a combination of technologies used to create dynamic and interactive Web sites. –HTML - For creating text and image.
CPSC 203 Introduction to Computers Lab 23 By Jie Gao.
ITGS Databases.
Plug-in Architectures Presented by Truc Nguyen. What’s a plug-in? “a type of program that tightly integrates with a larger application to add a special.
 Web pages originally static  Page is delivered exactly as stored on server  Same information displayed for all users, from all contexts  Dynamic.
ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
Web Scraping with Python and Selenium. What is Web Scraping?  Software technique for extracting info from websites Get information programmatically that.
Tutorial #1 Using HTML to Create Web Pages. HTML, XHTML, and CSS HTML – HyperText Markup Language The tags the browser uses to define the content of the.
Web Cache. What is Cache? Cache is the storing of data temporarily to improve performance. Cache exist in a variety of areas such as your CPU, Hard Disk.
Basics Components of Web Design & Development Basics, Components, Design and Development.
Enhance Your Page Load Speed And Improve Traffic.
Microsoft FrontPage 2003 Illustrated Complete Creating a Web Site.
1 New Perspectives on Access 2016 Module 8: Sharing, Integrating, and Analyzing Data.
WEB TESTING
Advanced HTML Tags:.
MicrosoftTM SharePoint Content Management SystemTutorial
Internet Made Easy! Make sure all your information is always up to date and instantly available to all your clients.
Getting Started with CSS
Mapping for the interwebs
Weebly Elements, Continued
Working in the Forms Developer Environment
CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi
Lesson 14: Web Scraping Topic: Web Scraping.
UNIT 15 Webpage Creator.
PHP / MySQL Introduction
Microsoft FrontPage 2003 Illustrated Complete
Using Access and the Web
Tutorial 8 Objectives Continue presenting methods to import data into Access, export data from Access, link applications with data stored in Access, and.
Database Driven Websites
Essentials of Web Pages
Competitor Price Monitoring
Exploring Microsoft® Access® 2016 Series Editor Mary Anne Poatsy
Chapter 27 WWW and HTTP.
Adding members to ArcGIS Online
Adding members to ArcGIS Online
Basic HTML and Embed Codes
Smart Integration Express
Comparative Reporting & Analysis (CR&A)
CIS 133 mashup Javascript, jQuery and XML
Recitation on AdFisher
Introduction to JavaScript
TracCloud.
Generate Data with Google Analytics SQL Saturday /04/2019.
Web Application Development Using PHP
Adding members to ArcGIS Online
Presentation transcript:

Data Scraping Summer School 2018 Mohit Kumar Trivedi Center for Political Data

This module Provides some tips for data management Introduces basics of data scraping from websites and PDF files Builds basic familiarity with R and Rstudio

What is Data Scraping? Technique for extracting data in large amounts from various sources into local files readily available for analysis. Data Sources : Online websites, PDFs, … Local Data: Data having variables and observations stored in local files or database in a structured format We will focus next on getting tabular data in form of Comma separated value (CSV) format from a website.

What is Web Scraping? Web scraping, web harvesting or web data extraction is data scraping used for extracting data from websites Basic idea: Automatically programmed bot uses a browser (or pretends to be a browser) to get a page from a website, extracts the content from the page, stores the relevant parts of the page in the form we need (e.g. CSV file) Bot can also perform actions like click, form fill, select dropdown on behalf of the user to get to the desired page

Applications of Web Scraping Adding live and online data to local database, e.g.: Election results scraping Parliamentary Discussions Assembly Member Details Automating multiple fetches Getting around pagination Converting table embedded in a web page to a spreadsheet Scraping text data for making NLP-based systems Downloading a set of files …

Web Scraping Challenges Some frequently encountered problems Access to website for bots – rules governed by robots.txt (If disallowed, try to find alternate site) Login forms (if data is behind a password) Session cookies/state – exact sequence of page loads must be followed. (Direct URL is not reachable.) Getting past captchas – no easy solution Structural changes – scraper needs to be kept up to date Dynamic Websites – content not directly in the page IP Blocking/Throttling Need some iteration to understand the problems posed by a website E-Data Projects <-> ELM 1

Web Scraping Challenges Inconsistencies encountered in the scraped data Misaligned fields Missing/Not available Data Mismatching Data Types Data entry errors Need to carefully check for problems/inconsistencies after scraping E-Data Projects <-> ELM 1

Ways to Scrape Data Human Copy-Paste Text pattern matching API interface DOM Parsing (Document object models) For this module we will use the DOM Parsing approach

Pre-requisites : Scripting Languages Languages for data analysis : R / Python. Integrated Development Environment : Rstudio (for R) Basic Data handling in R : https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf Hands on Tutorials : https://www.datacamp.com/courses/free-introduction-to-r Moving from Excel to R : https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users https://trendct.org/2015/06/12/r-for-beginners-how-to-transition-from-excel-to-r/ Scraping Packages : “rvest” > install.packages(“rvest”) > library(rvest) E-Data Projects <-> ELM 1

HTML and CSS E-Data Projects <-> ELM 1

Document Object Model <!DOCTYPE html> <html> <title>My title</title> <body> <h1>My header</h1> <a href= “”>My link</p> </body> </html> E-Data Projects <-> ELM 1

Browser Extension Selector Gadget : https://selectorgadget.com/ (Chrome) url <- “https://www.imdb.com/chart/top?ref_=nv_mv_250_6” Page <- read_html(url) Titles <- html_nodes(page,”<cssSelector>”) %>% html_text() E-Data Projects <-> ELM 1

Scrapping a webpage url <- “http://eciresults.nic.in/ConstituencywiseS1034.htm?ac=34” Packages : Dplyr, Rvest Readr Data.table Reading the webpage > page <- read_html(url) Extracting tabular information > tbl <- html_nodes(page,”tr:nth-of-type(n+4)”) Extracting Result Status : >t2<- html_nodes(page,"tr:nth-of-type(1)") %>% html_nodes("td:nth-of-type(1)") %>% html_text() >print(paste(t2[10],t2[11])) E-Data Projects <-> ELM 1

Scrapping a webpage Getting Constituency Name: > const_name <- unlist(strsplit(t2[10]," - ")) > const_name <- const_name[2] Getting electoral data: > v<-tbl[2:(length(tbl)-3)] > Names<-v%>% html_nodes("td:nth-of-type(1)")%>%html_text() > Parties<-v%>% html_nodes("td:nth-of-type(2)")%>%html_text() > Votes<-v%>% html_nodes("td:nth-of-type(3)")%>%html_text() > Votes <- as.integer(as.character(Votes)) Structuring data as a table : > dt <- data.frame(Names,Parties,Votes) > names(dt) <- c("Candidate","Party_Name","Votes") > dt$Constituency_No <- 34 > dt$Constituency_Name <- const_name E-Data Projects <-> ELM 1

Scrapping a webpage Result scrapped for one constituency: Run a for loop for all constituency numbers. > url <- paste0(urlprefix,stateno,i,".htm?ac=",i) Format consistency. Challenges : Dynamic web pages. E-Data Projects <-> ELM 1

Automating a webpage R package: RSelenium. > install.package(“Rselenium”) > library(Rselenium) Start Selenium Server. > checkforServer() > startServer() Connect to webserver : > remDr <- remoteDriver(remoteServerAddr=“http://haryanaassembly.gov.in/SearchMLAInformation.aspx”) > remDr$open() > remDr$getStatus() E-Data Projects <-> ELM 1

Scraping a PDF R package: Pdftools, > install.packages(“pdftools”) > library(pdftools) Read a pdf > text <- pdf_text(“<pathtofile>”) Read Scanned Documents/ Images: Tesseract > install.packages(“tesseract”) > library(tesseract) > text <- ocr(pdf_convert(<pathtofile>,dpi=600)) Online Tools: https://www.pdftoexcel.com/ E-Data Projects <-> ELM 1