Presentation is loading. Please wait.

Presentation is loading. Please wait.

WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 14-Nov-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/  Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

3 Open Spyder 14-Nov-2014 3 NLP, Prof. Howard, Tulane University

4 Twitter Review 14-Nov-2014 4 NLP, Prof. Howard, Tulane University

5 Finding text on the web 14-Nov-2014 5 NLP, Prof. Howard, Tulane University

6 http://sethgodin.typepad.com/ 14-Nov-2014NLP, Prof. Howard, Tulane University 6

7 Firefox: Tools > web developer > Page source Safari: Prefs > Advanced > Show develop >> show page source  If someone asked you how to do something …. By all means, you still need pictures, even video. But there's nothing to replace the specificity that comes from the alphabet. Use labels. Use words. 14-Nov-2014NLP, Prof. Howard, Tulane University 7

8 We need  requests  % pip install feedparser  % pip install BeautifulSoup4 14-Nov-2014NLP, Prof. Howard, Tulane University 8

9 Get the text 1. import requests 2. from bs4 import BeautifulSoup 3. url = 'http://sethgodin.typepad.com/' 4. html = requests.get(url).text 5. soup = BeautifulSoup(html) 6. print soup.find("div", {"class":"entry- body"}).text.encode('utf8') 14-Nov-2014NLP, Prof. Howard, Tulane University 9

10 Install feedparser by hand  https://pypi.python.org/pypi/feedparser https://pypi.python.org/pypi/feedparser  click on Downloads button  choose.zip file  $ cd /Users/harryhow/Downloads/feedparser- 5.1.3  $ python setup.py install 14-Nov-2014NLP, Prof. Howard, Tulane University 10

11 Get the RSS feed 1. from bs4 import BeautifulSoup 2. import feedparser 3. url = 'feed://feeds.feedblitz.com/sethsblog' 4. fp = feedparser.parse(url) 5. print "Fetched %s entries from '%s'" % (len(fp.entries), fp.feed.title) 6. blog_posts = [] 7. for e in fp.entries: 8. blog_posts.append({'title': e.title, 9. 'content': BeautifulSoup(e.content[0].value).get_text().encode('utf8'), 10. 'link': e.links[0].href}) 11. print blog_posts[0]['content'] 14-Nov-2014NLP, Prof. Howard, Tulane University 11

12 something else maybe a quiz Next time 14-Nov-2014NLP, Prof. Howard, Tulane University 12


Download ppt "WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google