WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 14-Nov-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/  Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

Open Spyder 14-Nov-2014 3 NLP, Prof. Howard, Tulane University

Twitter Review 14-Nov-2014 4 NLP, Prof. Howard, Tulane University

Finding text on the web 14-Nov-2014 5 NLP, Prof. Howard, Tulane University

http://sethgodin.typepad.com/ 14-Nov-2014NLP, Prof. Howard, Tulane University 6

Firefox: Tools > web developer > Page source Safari: Prefs > Advanced > Show develop >> show page source  If someone asked you how to do something …. By all means, you still need pictures, even video. But there's nothing to replace the specificity that comes from the alphabet. Use labels. Use words. 14-Nov-2014NLP, Prof. Howard, Tulane University 7

We need  requests  % pip install feedparser  % pip install BeautifulSoup4 14-Nov-2014NLP, Prof. Howard, Tulane University 8

Get the text 1. import requests 2. from bs4 import BeautifulSoup 3. url = 'http://sethgodin.typepad.com/' 4. html = requests.get(url).text 5. soup = BeautifulSoup(html) 6. print soup.find("div", {"class":"entry- body"}).text.encode('utf8') 14-Nov-2014NLP, Prof. Howard, Tulane University 9

Install feedparser by hand  https://pypi.python.org/pypi/feedparser https://pypi.python.org/pypi/feedparser  click on Downloads button  choose.zip file  $ cd /Users/harryhow/Downloads/feedparser- 5.1.3  $ python setup.py install 14-Nov-2014NLP, Prof. Howard, Tulane University 10

Get the RSS feed 1. from bs4 import BeautifulSoup 2. import feedparser 3. url = 'feed://feeds.feedblitz.com/sethsblog' 4. fp = feedparser.parse(url) 5. print "Fetched %s entries from '%s'" % (len(fp.entries), fp.feed.title) 6. blog_posts = [] 7. for e in fp.entries: 8. blog_posts.append({'title': e.title, 9. 'content': BeautifulSoup(e.content[0].value).get_text().encode('utf8'), 10. 'link': e.links[0].href}) 11. print blog_posts[0]['content'] 14-Nov-2014NLP, Prof. Howard, Tulane University 11

something else maybe a quiz Next time 14-Nov-2014NLP, Prof. Howard, Tulane University 12

WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations

Presentation on theme: "WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations

Presentation on theme: "WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback