Presentation is loading. Please wait.

Presentation is loading. Please wait.

590 Web Scraping – Test 2 Review

Similar presentations


Presentation on theme: "590 Web Scraping – Test 2 Review"— Presentation transcript:

1 590 Web Scraping – Test 2 Review
Topics Removing superfluous garbage Pruning common words Inconsistent data Readings: Text – chapters 7, revisited April 4, 2017

2 Lecture Titles

3 Test 2 Sample Questions – Lec23
What do the statements do ? input = re.sub('\n+', " ", input).lower() Write a statement to: To remove footnote references like [23] from text xxx

4 Test 2 Sample Questions – Lec22
What do the statements do? Explain in detail x = bsObj.find("div", {"id":“y"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) driver.implicitly_wait(5) Write a statement to: To remove footnote references like [23] from text Explain how to connect to /a/b/c/phantomjs Give a parse function that just logs the site using a scrapy function

5 Test 2 Sample Questions – Lec21
What do the statements do? (each statement considered individually, not collectively) grammar = r"NP: {<[CDJNP].*>+}“ cp = nltk.RegexpParser(grammar) cp.evaluate(test_sents) Write a statement to: What is meant by IOB accuracy? Precision, Recall, F xxx

6 Test 2 Sample Questions – Lec20
What are tag patterns matched against? Give three things that match <DT>?<JJ>*<NN> illustrating as much of breadth in the matches as possible. I What is? (what features are used) the unigramChunker the bigramChunker

7 Test 2 Sample Questions – Lec19
What do the statements do? input Write a statement to: To Is NLTK more like scrapy, Selenium or BeautifulSoup and why? What is meant by the acronym POS? Using the function ngrams(list, n) and FreqDist write a section of code that will print the most frequent trigram in a string named “text”

8 Test 2 Sample Questions – Lec18
Google page rank is based on keywords and _____. What is meant by inverse document frequency? And what is the logic behind this metric? x

9 Test 2 Sample Questions – Lec17
What do the statements do? xp = lambda x: response.xpath(x).extract_first() Write xpaths to select : . Can you use BeautifulSoup with all of the other libraries that we have talked about? Which library that we have discussed would be least likely to work with scrapy? What are the major components of the Scrapy architecture? Is it synchronous or asynchrounous? (answer async) What routines from spiders interact with the engine?

10 Test 2 Sample Questions – Lec16
In scrapy shell? What is the main object return from a request? Why is a scraping project “fragile,” in that it can work one day and then the same code might not work the next day? What should you do to lessen this fragility? Answer (write to be as flexible and tolerant of webpage format changes as possible) What is the crawl template for genspider? What does the parse routine on slide 18 Lecture 16 do?

11 Test 2 Sample Questions – Lec15
What is a Named Entity? What are all-discussed/some of the categories of Named Entities? <slide 6> What does IOB stand for? What shape features can be used to help identify NE? What single POS is best for indicating a NE? In slide 13 showing a general classifier working on a window of words before anf after the current word. What features are used as input to the classifier? Given a data set of tagged-IOB data explain how to train and test a classifier? Given a classifier and the results: give formulae in terms of tp, fp, tn, fn for Recall Precision F

12 Test 2 Sample Questions – Lec14
What is scrapy? A library, a framework or both? What do the following statements do? q = response.css("div.quote")[0] t = q.css("div.tags a.tag::text").extract() t = q.css("span.text::text").extract_first() Draw the project folder structure generated by “scrapy startproject x” What are the two main functions in a spider? What command could be used to generate a skeleton for a spider named “spidy?” Start_urls is used be scrapy to generate what function?

13 Test 2 Sample Questions – Lec13
How is Selenium different from a standard python library such as Beautiful soup? Explain what driver_init does? The answer it initializes the driver, while correct is insufficient in details. In the statement “link = driver.wait.until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, "Sign in to"))) What is the EC? Explain how to find a button named “SubmitButton” and click it. What does the code on slide 13 of Lec 13 do?

14 Test 2 Sample Questions – Lec12
No sample questions from this lecture.

15 Test 2 Sample Questions – Lec11
What is UTF-8? An encoding of characters in 8-bytes An encoding of characters in 8-bits None of the above What is the relation of ASCII and UTF-8? How can you convert from UTF-8 to bytes streams that can be written with write? <ch 6 2-getUtf8Text.py>


Download ppt "590 Web Scraping – Test 2 Review"

Similar presentations


Ads by Google