Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
Building a Web Crawler in Python Frank McCown Harding University Spring 2013 This work is licensed under a Creative Commons Attribution-NonCommercial-
Advertisements

Presenter: James Huang Date: Sept. 29,  HTTP and WWW  Bottle Web Framework  Request Routing  Sending Static Files  Handling HTML  HTTP Errors.
Adding Dynamic Content to your Web Site
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
Microsoft® Small Basic Debugging Aids Estimated time to complete this lesson: 1 hour.
HTTP HyperText Transfer Protocol. HTTP Uses TCP as its underlying transport protocol Uses port 80 Stateless protocol (i.e. HTTP Server maintains no information.
CIS101 Introduction to Computing
Linux Operations and Administration
UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
1 Network Statistic and Monitoring System Wayne State University Division of Computing and Information Technology Information Technology.
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
TCP/IP Protocol Suite 1 Chapter 22 Upon completion you will be able to: World Wide Web: HTTP Understand the components of a browser and a server Understand.
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Creating Dynamic Web Pages Using PHP and MySQL CS 320.
NLTK & Python Day 7 LING Computational Linguistics Harry Howard Tulane University.
Structured programming 3 Day 33 LING Computational Linguistics Harry Howard Tulane University.
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Module 5: Configuring Internet Explorer and Supporting Applications.
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
SAS ODS (Output Delivery System) Donald Miller 812 Oswald Tower ;
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CRaSH Portal Team. 2 Agenda Introduction to CRaSH Deployment and connection Using the CRaSH command Develop the CRaSH commands yourself.
ECMM6018 Enterprise Networking for Electronic Commerce Tutorial 7
NLTK & Python Day 8 LING Computational Linguistics Harry Howard Tulane University.
REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
ASP. ASP is a powerful tool for making dynamic and interactive Web pages An ASP file can contain text, HTML tags and scripts. Scripts in an ASP file are.
Chap 2 – Getting Started COMP YL Professor Mattos.
CHAPTER 8 AJAX & JSON WHAT IS AJAX? Ajax lets you…
ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 3 DAY /12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
 Before you continue you should have a basic understanding of the following:  HTML  CSS  JavaScript.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Optical Flow walk through Aidean Sharghi Spring 14.
How to Apply PDF in Flipbook on Website. Description If you are finding solution for applying PDF in flipbook mode on website, and adding multimedia items.
PHP using MySQL Database for Web Development (part II)
Introduction to Dynamic Web Content
Lesson 08: Files Class Participation: Class Chat: Attendance Code 
CS320 Web and Internet Programming Generating HTTP Responses
Content from Python Docs.
CISC103 Web Development Basics: Web site:
Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing
Computation with strings 2 Day 3 - 9/02/16
Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Server-Side Application and Data Management IT IS 3105 (Spring 2010)
PHP Introduction.
Computation with strings 3 Day 4 - 9/07/16
Computation with strings 1 Day 2 - 8/31/16
CISC103 Web Development Basics: Web site:
Regular expressions 2 Day /23/16
Introduction to Dynamic Web Content
control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing
LING 3820 & 6820 Natural Language Processing Harry Howard
Scrapy Web Cralwer Instructor: Bei Kang.
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
HTTP Request Method URL Protocol Version GET /index.html HTTP/1.1
Regular expressions 3 Day /26/16
Tutorial 6 PHP & MySQL Li Xu
Kevin Harville Source: Webmaster in a Nutshell, O'Rielly Books

Computation with strings 4 Day 5 - 9/09/16
CSCI-351 Data communication and Networks
PHP By Prof. B.A.Khivsara Note: The material to prepare this presentation has been taken from internet and are generated only for students reference and.
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
Presentation transcript:

Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization http://www.tulane.edu/~howard/NLP/ 1.1.7. Schedule of assignments NLP, Prof. Howard, Tulane University 16-Sep-2016

Review NLP, Prof. Howard, Tulane University 16-Sep-2016

Project Gutenberg http://www.gutenberg.org/ebooks/28554 NLP, Prof. Howard, Tulane University 16-Sep-2016

Add this code to the function # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') startIndex = download.index('\n',lineIndex) endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') text = download[startIndex:endIndex] with open(name,'w') as tempFile: text = text.encode('utf8') tempFile.write(text) return 'File was written.' NLP, Prof. Howard, Tulane University 16-Sep-2016

Call the function as before (>>> import textProc) >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) NLP, Prof. Howard, Tulane University 16-Sep-2016

Practices Practice 1: try this with another PG text. Practice 2: add comments NLP, Prof. Howard, Tulane University 16-Sep-2016

5. Flat text The code for this chapter can be downloaded as nlp5.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one. NLP, Prof. Howard, Tulane University 16-Sep-2016

Install textract & chardet http://textract.readthedocs.io/en/latest/installation .html Mac Install homebrew: http://brew.sh/ $ brew install poppler antiword unrtf tesseract All ($ pip install tesseract) $ pip install textract $ pip install chardet NLP, Prof. Howard, Tulane University 16-Sep-2016

EPUBs NLP, Prof. Howard, Tulane University 16-Sep-2016

Convert EPUBs >>> from requests import get >>> url = 'http://www.gutenberg.org/ebooks/28554.epub.noimage s' >>> response = get(url) >>> type(response) <class 'requests.models.Response'> NLP, Prof. Howard, Tulane University 16-Sep-2016

More about the Response object >>> response.headers {'content-length': '16922', 'x-varnish': '1988218503', 'x- powered-by': '3', 'set-cookie': 'session_id=c91e2c01ad330b816664af3600b141ed13f5be9 4; Domain=.gutenberg.org; expires=Thu, 15 Sep 2016 13:05:50 GMT; Path=/', 'age': '0', 'server': 'Apache', 'x- connection': 'Close', 'via': '1.1 varnish', 'x-rate-limiter': 'ratelimiter2.php57', 'date': 'Thu, 15 Sep 2016 12:35:50 GMT', 'x-frame-options': 'sameorigin', 'content-type': 'application/epub+zip'} >>> response.text[:150] >>> response.content[:150] NLP, Prof. Howard, Tulane University 16-Sep-2016

Textract expects to read a file from disk with open('Wub.epub','wb') as tempFile: tempFile.write(response.content) from textract import process rawText = process('Wub.epub') type(rawText) from chardet import detect detect(rawText) len(rawText) # 34361 rawText[:150] NLP, Prof. Howard, Tulane University 16-Sep-2016

Review from requests import get from textract import process url = 'http://www.cwanderson.org/wp- content/uploads/2011/11/Philip-K-Dick-The-Minority- Report.pdf' response = get(url) with open('MinorityReport.pdf', "wb") as tempFile: tempFile.write(response.content) rawText = process('MinorityReport.pdf') NLP, Prof. Howard, Tulane University 16-Sep-2016

PDFs NLP, Prof. Howard, Tulane University 16-Sep-2016

Download & convert a PDF url = 'http://www.cwanderson.org/wp- content/uploads/2011/11/Philip-K-Dick-The-Minority-Report.pdf' response = get(url) with open('MinorityReport.pdf', "wb") as tempFile: tempFile.write(response.content) rawText = process('MinorityReport.pdf') NLP, Prof. Howard, Tulane University 16-Sep-2016

Images NLP, Prof. Howard, Tulane University 16-Sep-2016

Download & write an image from requests import get url = 'http://i.stack.imgur.com/t3qWG.png' try: response = get(url) except: print 'Download failed!' with open('ocrTest.png', "wb") as tempFile: tempFile.write(response.content) NLP, Prof. Howard, Tulane University 16-Sep-2016

Try to OCR it >>> from textract import process >>> rawText = process('ocrTest.png') # Switch to Terminal for tesseract $ cd /Users/harryhow/Documents/pyScripts $ tesseract ocrTest.png ocrText NLP, Prof. Howard, Tulane University 16-Sep-2016

Q1 stats Q1 MIN 6.0 AVG 9.3 MAX 10.0 NLP, Prof. Howard, Tulane University 16-Sep-2016

Next time Q2 Regex NLP, Prof. Howard, Tulane University 16-Sep-2016