Corpus Linguistics I ENG 617

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Research in the Central High School Media Center Connie L. Heller.
Creating a Program In today’s lesson we will look at: what programming is different types of programs how we create a program installing an IDE to get.
CIS101 Introduction to Computing Week 05. Agenda Your questions CIS101 Survey Introduction to the Internet & HTML Online HTML Resources Using the HTML.
Programming Introduction November 9 Unit 7. What is Programming? Besides being a huge industry? Programming is the process used to write computer programs.
M AKING E - RESOURCE ACCESSIBLE FROM ONLINE CATALOG *e-books *serials Yan Wang Senior Librarian Head of Cataloging & Database Maintenance Central Piedmont.
MBAC 611.  We have been using MS Access to query and modify our databases.  MS Access provides a GUI (Graphical User Interface) that hides much of the.
TC2-Computer Literacy Mr. Sencer February 8, 2010.
CITATIONS AND WORKS CITED MLA FORMAT FOR REFERENCES.
Constructing Your Own Corpus from Written Language.
HTML CRASH COURSE. What is HTML?  Hyper Text Markup Language  The language used to make web pages  Written by using tags.
First Program  Open a file  In Shell  Type into the file: 3  You did it!!! You wrote your first instruction, or code, in python!
C O M P U T E R G R A P H I C S Jie chen Computer graphic -- OpenGL Howto.
 We are going to learn about programming in general…How to think logically and problem solve. The programming language we will use is Python. This is.
Python From the book “Think Python”
LING/C SC/PSYC 438/538 Lecture 2 Sandiway Fong. Today’s Topics Did you read Chapter 1 of JM? – Short Homework 2 (submit by midnight Friday) Today is Perl.
Setting up and getting going with…. MIT App Inventor.
Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.
Programming for GCSE 1.0 Beginning with Python T eaching L ondon C omputing Margaret Derrington KCL Easter 2014.
You Need an Interpreter!. Closing the GAP Thus far, we’ve been struggling to speak to computers in “their” language, maybe its time we spoke to them in.
Copy of the from the secure website - click on the AccoridaLife.zip link.
Practical Kinetics Exercise 0: Getting Started Objectives: 1.Install Python and IPython Notebook 2.print “Hello World!”
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
KompoZer. What is it? A FREE product used to design websites A FREE product used to design websites A WYSIWYG HTML Editor A WYSIWYG HTML Editor –WYSIWYG:
Web Scraping with Python and Selenium. What is Web Scraping?  Software technique for extracting info from websites Get information programmatically that.
First Program  Open a file  In Shell  Type into the file: 3  You did it!!! You wrote your first instruction, or code, in python!
Chapter 1: Introduction to Computers and Programming.
Windows Vista Configuration MCTS : Internet Explorer 7.0.
I was looking through many APIs to figure out what I wanted to use and how I wanted to develop this Twitterbot. My early attempts consisted of developing.
Introduction to Algorithm. What is Algorithm? an algorithm is any well-defined computational procedure that takes some value, or set of values, as input.
Introduction to Computers
Topic 2: Hardware and Software
Class03 Introduction to Web Development (Hierarchy and the IDE)
Welcome to your library day!
Development Environment
CST 1101 Problem Solving Using Computers
UMBC CMSC 104 – Section 01, Fall 2016
Business Directory REST API
Introduction to Eclipse
How to use the internet safely and How to protect my personal data?
How to use the internet safely and How to protect my personal data?
Auburn University COMP 2710 Software Construction xCode Development Environment for C++ Programming in Mac OS Dr. Xiao.
Ch 1. A Python Q&A Session Bernard Chen 2007.
Corpus Linguistics I ENG 617
CompSci 101 Introduction to Computer Science
A451 Theory – 7 Programming 7A, B - Algorithms.
Introduction to Programming the WWW I
Next Generation SSIS Tasks and data Connection Series
Topics in Linguistics ENG 331
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Teaching Computing to GCSE
Topics in Linguistics ENG 331
Lecturer: Mukhtar Mohamed Ali “Hakaale”
IPC144 Introduction to Programming Using C Week 1 – Lesson 2
Introduction to Computers
Let's Learn Python and Pygame
Corpus Linguistics I ENG 617
LGC Website, Software updates, Documentation, and Videos
Computer Science and an introduction to Pascal
Topics in Linguistics ENG 331
Introduction to Algorithm Design
Topics in Linguistics ENG 331
Family Search and the scanning of OCPL’s historical book collection.
Accelerated Introduction to Computer Science
Introduction In today’s lesson we will look at: why Python?
Introduction to Web Application Design
Web Application Development Using PHP
Quick and Dirty: the art of OCR
Presentation transcript:

Corpus Linguistics I ENG 617 Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) Ain Shams University rsabbagh@alsun.asu.edu.eg Week 10

Installation Prerequisites We need to download and install these two before we start: Visual C++ 2015 Build Tools Editra: Python editor Notepad++ Week 10

Corpus Compilation It is always a good idea to look for a ready made corpus either from sources such as the LDC and ERLA or from individual researchers. However, sometimes you have to compile your own corpus. As you compile the corpus, you need to make sure that it follows the criteria of a well-designed corpus. Do you remember what those criteria are? In corpus and computational linguistics, corpus compilation is referred to as corpus harvesting as well. Week 10

Resources for Corpus Harvesting: Print Books Depending on your study, you may compile your corpus from print books, online written resources, or audiovisual resources. For print books, one can check the following for a text machine-readable version of the books Project Gutenberg Oxford Internet Archive If such a version does not exist, one may need to work on a scanned version of the book and use an Optical Character Reader (OCR) software program. OCR programs convert scanned images into text files. They are never 100% accurate but they save much typing time. There are many free online OCRs, though. Week 10

Resources for Corpus Harvesting: Web as Corpus When we compile data from online resources, we are using the “Web as Corpus”. This is a term coined a few years ago and there is an entire series of workshops that carry the same name as well as a SIG. Software programs used to compile corpora from the Web are referred to as scrappers, spiders, or crawlers. Can you guess why? Today, we will learn how to scrap texts from news website, Twitter, and Facebook using Python. Week 10

Installing Python Python is a high-level programming language widely used in corpus and computational linguistics. It comes with many libraries or modules that can help us harvest Web-based texts. To start, we need to download Python from here. Double click the executable file to start the installation. Notice where the folder in which Python will be installed. To check the Python is correctly installed type > python Week 10

Harvesting Newspaper Websites 1 Python has a very nice library/module to harvest articles from Arabic and English newspaper websites. It’s newspaper. To install newspaper, direct your cmd to the folder where Python is installed and then to “Scripts” To download and install newspaper, it is as simple as typing pip install newspaper3k To make sure that the module is correctly installed, run Python shell and type import newspaper Week 10

Harvesting Newspaper Websites 2 Now, we will run the following code as: python getNewsArticles.py > getNewsArticles_Output.txt The code takes as input a list of URLs with one URL per line like this one. It returns the articles titles and texts. Let’s get a closer look at this very basic code to understand it. Week 10

Harvesting Newspaper Websites 3 Now, the question is how to get the URLs of the articles? For that purpose, we will need another code. Run the code! Do you remember how? Do you remember how to direct the output to a file? Now examine the output, what are the newly acquired URLs? Week 10

Harvesting Tweets Twitter does not allow to collect tweets unless you are using its API (Application Programming Interface). API enables Twitter to regulate the scraping process so that it does not lead to too much traffic and no private profiles get violated. Before you scrap with Twitter, you need: Download and install Python 2.7 Install Tweepy library/module Get Twitter API Download the following code and run it. Week 10

Harvesting Facebook Similar to Twitter, Facebook has its API. You can scrap public pages and groups. Before scrapping anything, you need to get an API key from here. We will be using Python 27 and the following two codes: To scrap groups To scrap pages You also need to install urllib2 Python library. For groups, you will need to get the group ID. Week 10

Code Credits Credit for Twitter Scrapper Credit for Facebook Scrapper Week 10