WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
Course organization 14-Nov-2014NLP, Prof. Howard, Tulane University 2 The syllabus is under construction. Chapter numbering 3.7. How to deal with non-English characters 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters 6. Control 6. Control
Open Spyder 14-Nov NLP, Prof. Howard, Tulane University
Twitter Review 14-Nov NLP, Prof. Howard, Tulane University
Finding text on the web 14-Nov NLP, Prof. Howard, Tulane University
14-Nov-2014NLP, Prof. Howard, Tulane University 6
Firefox: Tools > web developer > Page source Safari: Prefs > Advanced > Show develop >> show page source If someone asked you how to do something …. By all means, you still need pictures, even video. But there's nothing to replace the specificity that comes from the alphabet. Use labels. Use words. 14-Nov-2014NLP, Prof. Howard, Tulane University 7
We need requests % pip install feedparser % pip install BeautifulSoup4 14-Nov-2014NLP, Prof. Howard, Tulane University 8
Get the text 1. import requests 2. from bs4 import BeautifulSoup 3. url = ' 4. html = requests.get(url).text 5. soup = BeautifulSoup(html) 6. print soup.find("div", {"class":"entry- body"}).text.encode('utf8') 14-Nov-2014NLP, Prof. Howard, Tulane University 9
Install feedparser by hand click on Downloads button choose.zip file $ cd /Users/harryhow/Downloads/feedparser $ python setup.py install 14-Nov-2014NLP, Prof. Howard, Tulane University 10
Get the RSS feed 1. from bs4 import BeautifulSoup 2. import feedparser 3. url = 'feed://feeds.feedblitz.com/sethsblog' 4. fp = feedparser.parse(url) 5. print "Fetched %s entries from '%s'" % (len(fp.entries), fp.feed.title) 6. blog_posts = [] 7. for e in fp.entries: 8. blog_posts.append({'title': e.title, 9. 'content': BeautifulSoup(e.content[0].value).get_text().encode('utf8'), 10. 'link': e.links[0].href}) 11. print blog_posts[0]['content'] 14-Nov-2014NLP, Prof. Howard, Tulane University 11
something else maybe a quiz Next time 14-Nov-2014NLP, Prof. Howard, Tulane University 12