This module Provides some tips for data management

Data Scraping Summer School 2018 Mohit Kumar Trivedi Center for Political Data

This module Provides some tips for data management
Introduces basics of data scraping from websites and PDF files Builds basic familiarity with R and Rstudio

What is Data Scraping? Technique for extracting data in large amounts from various sources into local files readily available for analysis. Data Sources : Online websites, PDFs, … Local Data: Data having variables and observations stored in local files or database in a structured format We will focus next on getting tabular data in form of Comma separated value (CSV) format from a website.

What is Web Scraping? Web scraping, web harvesting or web data extraction is data scraping used for extracting data from websites Basic idea: Automatically programmed bot uses a browser (or pretends to be a browser) to get a page from a website, extracts the content from the page, stores the relevant parts of the page in the form we need (e.g. CSV file) Bot can also perform actions like click, form fill, select dropdown on behalf of the user to get to the desired page

Applications of Web Scraping
Adding live and online data to local database, e.g.: Election results scraping Parliamentary Discussions Assembly Member Details Automating multiple fetches Getting around pagination Converting table embedded in a web page to a spreadsheet Scraping text data for making NLP-based systems Downloading a set of files …

Web Scraping Challenges
Some frequently encountered problems Access to website for bots – rules governed by robots.txt (If disallowed, try to find alternate site) Login forms (if data is behind a password) Session cookies/state – exact sequence of page loads must be followed. (Direct URL is not reachable.) Getting past captchas – no easy solution Structural changes – scraper needs to be kept up to date Dynamic Websites – content not directly in the page IP Blocking/Throttling Need some iteration to understand the problems posed by a website E-Data Projects <-> ELM 1

Web Scraping Challenges
Inconsistencies encountered in the scraped data Misaligned fields Missing/Not available Data Mismatching Data Types Data entry errors Need to carefully check for problems/inconsistencies after scraping E-Data Projects <-> ELM 1

Ways to Scrape Data Human Copy-Paste Text pattern matching
API interface DOM Parsing (Document object models) For this module we will use the DOM Parsing approach

Pre-requisites : Scripting Languages
Languages for data analysis : R / Python. Integrated Development Environment : Rstudio (for R) Basic Data handling in R : Hands on Tutorials : Moving from Excel to R : Scraping Packages : “rvest” > install.packages(“rvest”) > library(rvest) E-Data Projects <-> ELM 1

HTML and CSS E-Data Projects <-> ELM 1

Document Object Model <!DOCTYPE html> <html> <title>My title</title> <body> <h1>My header</h1> <a href= “”>My link</p> </body> </html> E-Data Projects <-> ELM 1

Browser Extension Selector Gadget : (Chrome) url <- “ Page <- read_html(url) Titles <- html_nodes(page,”<cssSelector>”) %>% html_text() E-Data Projects <-> ELM 1

Scrapping a webpage url <- “ Packages : Dplyr, Rvest Readr Data.table Reading the webpage > page <- read_html(url) Extracting tabular information > tbl <- html_nodes(page,”tr:nth-of-type(n+4)”) Extracting Result Status : >t2<- html_nodes(page,"tr:nth-of-type(1)") %>% html_nodes("td:nth-of-type(1)") %>% html_text() >print(paste(t2[10],t2[11])) E-Data Projects <-> ELM 1

Scrapping a webpage Getting Constituency Name: > const_name <- unlist(strsplit(t2[10]," - ")) > const_name <- const_name[2] Getting electoral data: > v<-tbl[2:(length(tbl)-3)] > Names<-v%>% html_nodes("td:nth-of-type(1)")%>%html_text() > Parties<-v%>% html_nodes("td:nth-of-type(2)")%>%html_text() > Votes<-v%>% html_nodes("td:nth-of-type(3)")%>%html_text() > Votes <- as.integer(as.character(Votes)) Structuring data as a table : > dt <- data.frame(Names,Parties,Votes) > names(dt) <- c("Candidate","Party_Name","Votes") > dt$Constituency_No <- 34 > dt$Constituency_Name <- const_name E-Data Projects <-> ELM 1

Scrapping a webpage Result scrapped for one constituency:
Run a for loop for all constituency numbers. > url <- paste0(urlprefix,stateno,i,".htm?ac=",i) Format consistency. Challenges : Dynamic web pages. E-Data Projects <-> ELM 1

Automating a webpage R package:
RSelenium. > install.package(“Rselenium”) > library(Rselenium) Start Selenium Server. > checkforServer() > startServer() Connect to webserver : > remDr <- remoteDriver(remoteServerAddr=“ > remDr$open() > remDr$getStatus() E-Data Projects <-> ELM 1

Scraping a PDF R package:
Pdftools, > install.packages(“pdftools”) > library(pdftools) Read a pdf > text <- pdf_text(“<pathtofile>”) Read Scanned Documents/ Images: Tesseract > install.packages(“tesseract”) > library(tesseract) > text <- ocr(pdf_convert(<pathtofile>,dpi=600)) Online Tools: E-Data Projects <-> ELM 1

This module Provides some tips for data management

Similar presentations

Presentation on theme: "This module Provides some tips for data management"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

This module Provides some tips for data management

Similar presentations

Presentation on theme: "This module Provides some tips for data management"— Presentation transcript:

Similar presentations

About project

Feedback