Corpus Linguistics I ENG 617

Slides:

Advertisements

Similar presentations

What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.

Advertisements

Word Usage and Vocabulary in context Lecture 8

Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt

Resources for Using Corpus Linguistics in ELT Kenji Kitao Doshisha University Kyoto, Japan S. Kathleen Kitao Doshisha Women ’ s College Kyoto, Japan.

Memory Strategy – Using Mental Images

Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,

Vocabulary Instruction Ivy Phillips 7th-grade English/Language Arts Hutchison School.

GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.

Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.

How Can Corpora Help Me To Be Successful in CO150?

Natural Language Processing Chapter 2 : Morphology.

MORPHOLOGY definition; variability among languages.

Corpus search What are the most common words in English

Using Corpora to Teach Vocabulary Helping Students Help Themselves 1.

MORPHOLOGY. PART 1: INTRODUCTION Parts of speech 1. What is a part of speech?part of speech 1. Traditional grammar classifies words based on eight parts.

What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.

Making trouble-free corpus tasks in 10 minutes Jennie Wright.

Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.

PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.

AMANY ALKHAYAT PSCW ENG371 INTRODUCTION TO CORPUS PROCESSING Corpus Processing Ch1.

Using language corpora in developing Arabic lessons & syllabuses

Language Identification and Part-of-Speech Tagging

Corpora: a key part of a materials writer’s toolkit

Vocabulary Module 2 Activity 5.

Business Process Modeling

Introduction to Corpus Linguistics

Statistical NLP: Lecture 7

Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.

Learning Usage of English KWICly with WebLEAP/DSR

Introduction to Linguistics

Searching corpora.

Revision Outcome 1, Unit 1 The Nature and Functions of Language

Computational and Statistical Methods for Corpus Analysis: Overview

Exploring the BNC Corpus

Corpus Linguistics I ENG 617

Introduction to Corpus Linguistics: Exploring Collocation

Topics in Linguistics ENG 331

Introduction to Corpus Linguistics: Dispersion/concordance plots

Introduction to Corpus Linguistics: Key Word Analysis

Corpus Linguistics I ENG 617

Corpus Linguistics I ENG 617

Corpora and Concordancers in ESL/EFL Class:

Topics in Linguistics ENG 331

Corpus-Based ELT CEL Symposium Creating Learning Designers

Corpus Linguistics I ENG 617

click your mouse or hit enter to advance animation

Multi-Dimensional Data Visualization

Topics in Linguistics ENG 331

Corpus Linguistics I ENG 617

Module 5: Data Cleaning and Building Reports

Creating a Basic Search on

Visual Basic Programming Chapter Four Notes Working with Variables, Constants, Data Types, and Expressions GROUPBOX CONTROL The _____________________________________.

Introduction to Corpus Linguistics ENG 331

Topics in Linguistics ENG 331

(word formation: follow up)

Língua Inglesa - Aspectos Morfossintáticos

Using GOLD to Tracking L2 Development

A Corpus-Based Approach to Adapting Authentic Military Material

Introduction to Text Analysis

Applied Linguistics Chapter Four: Corpus Linguistics

Presenting Data in Tables

Corpus processing tools

Microsoft Office Illustrated Fundamentals

You spoke... We listened... © 2008 Acquire Media

BYU COCA: CORPUS OF CONTEMPORARY AMERICAN ENGLISH

Big Data: Text Mining The Linguistics Department Presents:

DESCRIÇÃO E ANÁLISE MORFOSSINTÁTICA DO INGLÊS

Creating a Basic Search on

Presentation transcript:

Corpus Linguistics I ENG 617 Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) rsabbagh@alsun.asu.edu.eg Week 3

Corpora from Brigham Young University 1 Brigham Young University (BYU) is a private research university in Provo, Utah, USA. It has developed a large number of online corpora processors mostly for English. Online corpus processors have Web interfaces to enable users to search the corpora without the need to download those large texts on their local Pros: (1) portable, (2) saves memory, (3) free, and (4) ready to use Cons: (1) restricted to certain functions and (2) limited to certain texts Week 3

Corpora from Brigham Young University 2 For illustration purposes, we will use three of the BYU corpora: The Corpus of Contemporary American English (COCA): It is a corpus representing American English. It comprises 520 million words of text (20 million words each year 1990 – 2015) and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. The Corpus of Historical American English (COHA): It is a corpus representing historical American English with more than 400 million words from the 1810s-2000s. The corpus is balanced by genre decade by decade. The Arabic Corpus Tools: It is a corpus of 173 million words of text representing Modern Standard and Egyptian Arabic. Week 3

COCA: Signing Up Although the Web interfaces of COCA, COHA, and Arabic Corpus are available for free, you need to sign up to use them. Week 3

COCA: Searching for Single Words To search for a single word in COCA, all you need to do is to type it inside the search box. Each word you look for inside the corpus is called a query. For example, typing ‘jump’, we get the following result: What does the figure 19,993 stand for? Week 3

COCA: Single Words and Raw Frequencies It is the raw frequency of the word ‘jump’ in COCA. It means that the word ‘jump’ has been repeated 19,993 times in the corpus. Put differently, it means that the word ‘jump’ has occurred in 19,993 contexts in the corpus. How to get these 19,993 contexts? By clicking the word itself Week 3

COCA: Filtering by Part of Speech When we type ‘jump’ as a query, the results will include all the parts of speech of jump; i.e. jump as a verb and as a noun. What if we want the raw frequency of jump as a noun? You will need to use the ‘POS’ option to the right of the search box. Week 3

COCA: Searching for Phrases You can search for phrases the same way you search for single words. For example, if you search for ‘kick the bucket’ 24 is the raw frequency of the entire phrase. To view the contexts in which your query phrase is used, you click the phrase itself. However, with phrases we can’t use the POS filter. Week 3

COCA: Searching with the Wildcard 1 What if you want to search for: kick the bucket, kicks the bucket, kicked the bucket, and kicking the bucket. One option is to enter each phrase as a separate query. This is tedious. Another option is to use the asterisks or the wildcard (*) as in: kick* the bucket The wildcard means anything: anything that comes in the position of the wildcard. Week 3

COCA: Searching with the Wildcard 2 If a wildcard means anything, then we can use it to know identify fixed and flexible expressions. Fixed expressions do not allow any other words to come in-between. For example, kick the bucket will always be the same; never kick the big bucket or kick the last bucket. To make sure, try this query in COCA: kick the * bucket How about the expression ‘at first glance’? Is it fixed or flexible? Can ‘first’ be replaced by something else? To know, we can try ‘at * glance’. What did you get? Week 3

COCA: Searching with the Wildcard 3 Wildcards can also be used to do morphological searches. What if we want to know which words are used with the suffix ‘-icity’? To know the answer, we can use the wildcard as in *icity. Notice the difference between: *icity and *˽icity. What different results does each one of them give you? Week 3

Quiz True or False? POS stands for Point of Selection. Fixed expressions are unmodifiable expressions. Online corpus processors come with Web interfaces. Online corpus processors use corpora stored on some servers. Raw frequency is the total number of occurrences of a word in a corpus. The wildcard is a good idea to look for all the derivations of pick in one step. Use COCA online corpus processor to get the raw frequency of: book (v. & n.) book (v.) book a ticket books, booking, and booked combined Week 3

Quiz Use COCA Web interface to find out: the top three frequent words starting with the prefix ‘anti-’ the most frequent stem attached to suffix ‘–ness’ whether ‘once upon a time’ is a fixed or a flexible expression Week 3

COCA: Searching for Parts of Speech What if we want to know the most common noun in COCA? We can search for parts of speech using the tags provided in the interface. If we want the most common noun in COCA, we can use the following: Try it and see what is the most common noun Week 3

COCA: Searching for Lemmas Although wildcards can be used to find word derivations, they only find affix-based derivations, but what about zero-affix derivations such as ate ? To find all the derivations of a given lemma, including both affix-based and zero-affix derivations, we can try the following: What is the result of your query? Week 3

COCA: Searching for Synonyms We can search COCA for synonyms as well. To do so, all we need is the following: Do you see something wrong in the results? How can we get better results? Week 3

Quiz Use COCA Web interface to get: the synonym of skip as a verb the derivations of speak the most frequent preposition in COCA Week 3

COCA: Searching Genres and Periods of Time 1 COCA is a general corpus with many genres including spoken, fiction, magazine, newspaper, and academic genres. What if we want to know the frequency of Egypt in each genre? COCA also includes texts from different periods of time: 1990 – 2015. What if we want to know the frequency of Egypt in each period? The best way to do so it to use the chart option. Week 3

COCA: Searching Genres and Periods of Time 2 There are three different numbers in the chart of Egypt Freq. stands for raw frequency. Size (M) stands for the size of the texts in a given genre/period of time. How about Per MiL? Week 3

COCA: Raw vs. Normalized Frequencies Per Mil is the normalized frequency per million. Raw frequency is the number of occurrences in the corpus. It does not always give an accurate idea about which word is more frequent. Hence, we typically use normalized frequency which is calculated as follows: Normalized Frequency (w) = 𝐶(𝑤) 𝑁 ∗𝑐𝑜𝑚𝑚𝑜𝑛 𝑏𝑎𝑠𝑒 where C(w) is the raw frequency of the given word, N is the total number of words in the corpus, and the common base ranges from 10 to 1,000,000 depending on the size of the corpus. Week 3

Quiz In a corpus of 2,000 words, book as a noun has been repeated 120 times, whereas book as a verb has been repeated 30 times. Calculate the normalized frequency of book as a noun. In a corpus of 300,000 words, withdrawal has been repeated 20 times. Calculate the normalized frequency of withdrawal. What are the raw and normalized frequencies of Cairo in the different genres of COCA? Week 3

COCA: Key Word In Content (KWIC) 1 The Key Word In Content (KWIC) is the concordance function which display up to 1,000 random contexts of the query word. Two questions: What if I want to see more than 1,000 contexts? What is the difference between the KWIC and clicking the word frequency to see the contexts? The main difference is that with the KWIC, we get the context with the parts of speech encoded in colors. Week 3

COCA: Key Word In Content (KWIC) 2 What do these colors stand for? Week 3