Download presentation
Presentation is loading. Please wait.
Published byCecil Hood Modified over 9 years ago
1
WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
2
Course organization 14-Nov-2014NLP, Prof. Howard, Tulane University 2 http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/ Chapter numbering 3.7. How to deal with non-English characters 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters 6. Control 6. Control
3
Open Spyder 14-Nov-2014 3 NLP, Prof. Howard, Tulane University
4
Twitter Review 14-Nov-2014 4 NLP, Prof. Howard, Tulane University
5
Finding text on the web 14-Nov-2014 5 NLP, Prof. Howard, Tulane University
6
http://sethgodin.typepad.com/ 14-Nov-2014NLP, Prof. Howard, Tulane University 6
7
Firefox: Tools > web developer > Page source Safari: Prefs > Advanced > Show develop >> show page source If someone asked you how to do something …. By all means, you still need pictures, even video. But there's nothing to replace the specificity that comes from the alphabet. Use labels. Use words. 14-Nov-2014NLP, Prof. Howard, Tulane University 7
8
We need requests % pip install feedparser % pip install BeautifulSoup4 14-Nov-2014NLP, Prof. Howard, Tulane University 8
9
Get the text 1. import requests 2. from bs4 import BeautifulSoup 3. url = 'http://sethgodin.typepad.com/' 4. html = requests.get(url).text 5. soup = BeautifulSoup(html) 6. print soup.find("div", {"class":"entry- body"}).text.encode('utf8') 14-Nov-2014NLP, Prof. Howard, Tulane University 9
10
Install feedparser by hand https://pypi.python.org/pypi/feedparser https://pypi.python.org/pypi/feedparser click on Downloads button choose.zip file $ cd /Users/harryhow/Downloads/feedparser- 5.1.3 $ python setup.py install 14-Nov-2014NLP, Prof. Howard, Tulane University 10
11
Get the RSS feed 1. from bs4 import BeautifulSoup 2. import feedparser 3. url = 'feed://feeds.feedblitz.com/sethsblog' 4. fp = feedparser.parse(url) 5. print "Fetched %s entries from '%s'" % (len(fp.entries), fp.feed.title) 6. blog_posts = [] 7. for e in fp.entries: 8. blog_posts.append({'title': e.title, 9. 'content': BeautifulSoup(e.content[0].value).get_text().encode('utf8'), 10. 'link': e.links[0].href}) 11. print blog_posts[0]['content'] 14-Nov-2014NLP, Prof. Howard, Tulane University 11
12
something else maybe a quiz Next time 14-Nov-2014NLP, Prof. Howard, Tulane University 12
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.