590 Web Scraping – Test 2 Review

Slides:



Advertisements
Similar presentations
Chapter 3 – Web Design Tables & Page Layout
Advertisements

A complete citation, notecard, and outlining tool
© Paradigm Publishing, Inc Excel 2013 Level 2 Unit 2Managing and Integrating Data and the Excel Environment Chapter 7Automating Repetitive Tasks.
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
XHTML Basics.
Information Retrieval in Practice
XP New Perspectives on Microsoft Office FrontPage 2003 Tutorial 9 1 Microsoft Office FrontPage 2003 Tutorial 9 – Using Layout Tables, Styles, and Office.
Guide To UNIX Using Linux Third Edition
Overview of Search Engines
COMPREHENSIVE Excel Tutorial 8 Developing an Excel Application.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
®® Microsoft Windows 7 Windows Tutorial 6 Searching for Information and Collaborating with Others.
JSP Standard Tag Library
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Copyright 2006 South-Western/Thomson Learning Chapter 17 Creating and Linking Web Pages.
Creating Web Documents Questions on JavaScript (lecture, text)? Work on JavaScript examples and/or Project III Calculations Homework: experiment, research.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
INTRODUCTION. What is HTML? HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language,
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Support.ebsco.com Rehabilitation Reference Center Tutorial.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
JSTL The JavaServer Pages Standard Tag Library (JSTL) is a collection of useful JSP tags which encapsulates core functionality common to many JSP applications.
XP Tutorial 8 Adding Interactivity with ActionScript.
CPT 499 Internet Skills for Educators Session Three Class Notes.
Autoentry and Autocoder Efficiently creating and coding people records from resumes.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
SEARCH ENGINES The World Wide Web contains a wealth of information, so much so that without search facilities it could be impossible to find what you were.
XP Including Comments in an HTML Document On a new blank line in an HTML document, type the start code for a comment:
Python: Programming the Google Search (Crawling) Damian Gordon.
Information Retrieval in Practice
CSCE 590 Web Scraping – Information Extraction II
Introduction to .NET Florin Olariu
Week 3 - Wednesday CS 113.
Creating Oracle Business Intelligence Interactive Dashboards
Search Engine Architecture
Tutorial support.ebsco.com.
Topics Introduction Hardware and Software How Computers Store Data
Internet Searching: Finding Quality Information
Computer 4 JEOPARDY Bobbie, Sandy, Trudy.
Web Scraping with Scrapy
Basic Web Scraping with Python
CSCE 590 Web Scraping – XPaths
Web Scraping Lecture9 - Requests
Information Retrieval
Web Scraping Lecture 11 - Document Encoding
Scrapy Web Cralwer Instructor: Bei Kang.
BTA MID-TERM EXAM STUDY GUIDE
Topics Introduction Hardware and Software How Computers Store Data
590 Web Scraping – testing Topics Readings: Chapter 13 - Testing
CSCE 590 Web Scraping – Scrapy II
Introduction to Programming
Web Scraping Lecture9 - Requests
CSCE 590 Web Scraping – Scrapy III
1. What Google app created this assessment?
CSCE 590 Web Scraping – Scrapy II
Recitation on AdFisher
Web Scraping Lecture 10 - Selenium
Bryan Burlingame 24 April 2019
CSCE 590 Web Scraping – Scrapy III
Scrapy Web Cralwer Instructor: Bei Kang.
Python 4 and 5 Mr. Husch.
Introduction to Programming
Microsoft Office Word 2003 Lesson 1
Lesson 2: Gathering and Organizing Information Using ICT KEY QUESTION: HOW DO YOU GATHER AND ORGANIZE INFORMATION USING THE COMPUTER AND INTERNET?
Presentation transcript:

590 Web Scraping – Test 2 Review Topics Removing superfluous garbage Pruning common words Inconsistent data Readings: Text – chapters 7, revisited April 4, 2017

Lecture Titles

Test 2 Sample Questions – Lec23 What do the statements do ? input = re.sub('\n+', " ", input).lower() Write a statement to: To remove footnote references like [23] from text xxx

Test 2 Sample Questions – Lec22 What do the statements do? Explain in detail x = bsObj.find("div", {"id":“y"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) driver.implicitly_wait(5) Write a statement to: To remove footnote references like [23] from text Explain how to connect to /a/b/c/phantomjs Give a parse function that just logs the site using a scrapy function

Test 2 Sample Questions – Lec21 What do the statements do? (each statement considered individually, not collectively) grammar = r"NP: {<[CDJNP].*>+}“ cp = nltk.RegexpParser(grammar) cp.evaluate(test_sents) Write a statement to: What is meant by IOB accuracy? Precision, Recall, F xxx

Test 2 Sample Questions – Lec20 What are tag patterns matched against? Give three things that match <DT>?<JJ>*<NN> illustrating as much of breadth in the matches as possible. I What is? (what features are used) the unigramChunker the bigramChunker

Test 2 Sample Questions – Lec19 What do the statements do? input Write a statement to: To Is NLTK more like scrapy, Selenium or BeautifulSoup and why? What is meant by the acronym POS? Using the function ngrams(list, n) and FreqDist write a section of code that will print the most frequent trigram in a string named “text”

Test 2 Sample Questions – Lec18 Google page rank is based on keywords and _____. What is meant by inverse document frequency? And what is the logic behind this metric? x

Test 2 Sample Questions – Lec17 What do the statements do? xp = lambda x: response.xpath(x).extract_first() Write xpaths to select : . Can you use BeautifulSoup with all of the other libraries that we have talked about? Which library that we have discussed would be least likely to work with scrapy? What are the major components of the Scrapy architecture? Is it synchronous or asynchrounous? (answer async) What routines from spiders interact with the engine?

Test 2 Sample Questions – Lec16 In scrapy shell? What is the main object return from a request? Why is a scraping project “fragile,” in that it can work one day and then the same code might not work the next day? What should you do to lessen this fragility? Answer (write to be as flexible and tolerant of webpage format changes as possible) What is the crawl template for genspider? What does the parse routine on slide 18 Lecture 16 do?

Test 2 Sample Questions – Lec15 What is a Named Entity? What are all-discussed/some of the categories of Named Entities? <slide 6> What does IOB stand for? What shape features can be used to help identify NE? What single POS is best for indicating a NE? In slide 13 showing a general classifier working on a window of words before anf after the current word. What features are used as input to the classifier? Given a data set of tagged-IOB data explain how to train and test a classifier? Given a classifier and the results: give formulae in terms of tp, fp, tn, fn for Recall Precision F

Test 2 Sample Questions – Lec14 What is scrapy? A library, a framework or both? What do the following statements do? q = response.css("div.quote")[0] t = q.css("div.tags a.tag::text").extract() t = q.css("span.text::text").extract_first() Draw the project folder structure generated by “scrapy startproject x” What are the two main functions in a spider? What command could be used to generate a skeleton for a spider named “spidy?” Start_urls is used be scrapy to generate what function?

Test 2 Sample Questions – Lec13 How is Selenium different from a standard python library such as Beautiful soup? Explain what driver_init does? The answer it initializes the driver, while correct is insufficient in details. In the statement “link = driver.wait.until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, "Sign in to"))) What is the EC? Explain how to find a button named “SubmitButton” and click it. What does the code on slide 13 of Lec 13 do?

Test 2 Sample Questions – Lec12 No sample questions from this lecture.

Test 2 Sample Questions – Lec11 What is UTF-8? An encoding of characters in 8-bytes An encoding of characters in 8-bits None of the above What is the relation of ASCII and UTF-8? How can you convert from UTF-8 to bytes streams that can be written with write? <ch 6 2-getUtf8Text.py>