Download presentation
Presentation is loading. Please wait.
1
Corpus Linguistics I ENG 617
Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) Ain Shams University Week 10
2
Installation Prerequisites
We need to download and install these two before we start: Visual C Build Tools Editra: Python editor Notepad++ Week 10
3
Corpus Compilation It is always a good idea to look for a ready made corpus either from sources such as the LDC and ERLA or from individual researchers. However, sometimes you have to compile your own corpus. As you compile the corpus, you need to make sure that it follows the criteria of a well-designed corpus. Do you remember what those criteria are? In corpus and computational linguistics, corpus compilation is referred to as corpus harvesting as well. Week 10
4
Resources for Corpus Harvesting: Print Books
Depending on your study, you may compile your corpus from print books, online written resources, or audiovisual resources. For print books, one can check the following for a text machine-readable version of the books Project Gutenberg Oxford Internet Archive If such a version does not exist, one may need to work on a scanned version of the book and use an Optical Character Reader (OCR) software program. OCR programs convert scanned images into text files. They are never 100% accurate but they save much typing time. There are many free online OCRs, though. Week 10
5
Resources for Corpus Harvesting: Web as Corpus
When we compile data from online resources, we are using the “Web as Corpus”. This is a term coined a few years ago and there is an entire series of workshops that carry the same name as well as a SIG. Software programs used to compile corpora from the Web are referred to as scrappers, spiders, or crawlers. Can you guess why? Today, we will learn how to scrap texts from news website, Twitter, and Facebook using Python. Week 10
6
Installing Python Python is a high-level programming language widely used in corpus and computational linguistics. It comes with many libraries or modules that can help us harvest Web-based texts. To start, we need to download Python from here. Double click the executable file to start the installation. Notice where the folder in which Python will be installed. To check the Python is correctly installed type > python Week 10
7
Harvesting Newspaper Websites 1
Python has a very nice library/module to harvest articles from Arabic and English newspaper websites. It’s newspaper. To install newspaper, direct your cmd to the folder where Python is installed and then to “Scripts” To download and install newspaper, it is as simple as typing pip install newspaper3k To make sure that the module is correctly installed, run Python shell and type import newspaper Week 10
8
Harvesting Newspaper Websites 2
Now, we will run the following code as: python getNewsArticles.py > getNewsArticles_Output.txt The code takes as input a list of URLs with one URL per line like this one. It returns the articles titles and texts. Let’s get a closer look at this very basic code to understand it. Week 10
9
Harvesting Newspaper Websites 3
Now, the question is how to get the URLs of the articles? For that purpose, we will need another code. Run the code! Do you remember how? Do you remember how to direct the output to a file? Now examine the output, what are the newly acquired URLs? Week 10
10
Harvesting Tweets Twitter does not allow to collect tweets unless you are using its API (Application Programming Interface). API enables Twitter to regulate the scraping process so that it does not lead to too much traffic and no private profiles get violated. Before you scrap with Twitter, you need: Download and install Python 2.7 Install Tweepy library/module Get Twitter API Download the following code and run it. Week 10
11
Harvesting Facebook Similar to Twitter, Facebook has its API. You can scrap public pages and groups. Before scrapping anything, you need to get an API key from here. We will be using Python 27 and the following two codes: To scrap groups To scrap pages You also need to install urllib2 Python library. For groups, you will need to get the group ID. Week 10
12
Code Credits Credit for Twitter Scrapper Credit for Facebook Scrapper
Week 10
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.