Download presentation
Presentation is loading. Please wait.
Published byHilary Black Modified over 6 years ago
1
Computer Corpora and What They Can Tell Us about How People Use Language
情報科学入門 26 July 2012
2
“Corpus”? Latin “corpus” = body. Latin “corpora” = bodies.
English “corpus” = collection of texts English “corpora” = collections of texts Japanese “コーパス” = 文書などの集大成
3
What is a computer corpus?
A corpus is a collection of texts stored on a computer. Books, magazines, letters, Internet pages, s, or parts of these. Or transcriptions of speeches, phone calls, or radio programs. Often stored as a single file in simple text format.
4
How big is a computer corpus?
It can be very big or very small. The biggest (e.g. the British National Corpus and the Corpus of Contemporary American English) have many millions of words. A small corpus might have only a few hundred words.
5
Benefits of computer corpora
In what way do you think computer corpora might be useful? Any ideas?
6
What are computer corpora for?
We can use corpora to study language. What are the most common words? What words are used together? What words of a particular type are used together (e.g., under + NOUN)? If we compare two corpora (e.g. and textbooks), is a word more common in one? How do people use words in sentences?
7
Computer corpora and dictionaries
All major English dictionaries are now based on computer corpora. How common is a word? How many different meanings does it have? What are some examples of its use? Is it used in a good or bad sense? What grammatical patterns is it used with? What other words is it used with?
8
Word frequencies What do you think are the most common words in English? Make a list of about five words.
9
The most common English words (http://www.world-english.org)
Of To And A In Is It You That
10
In various situations
11
Concordances One of the most common ways to study computer corpora is to use a concordance. A concordance finds all the instances of a word or phrase in a corpus. It presents a list of the instances, often with the search word in the middle of the screen.
12
Example of a concordance list
13
What does this tell us? In the words before forget, there are
many examples of negative words: not, won’t, don’t, couldn’t, shouldn’t, never, nobody many contractions: won’t, don’t, you’ll, couldn’t, shouldn’t, you’d several examples of to
14
What does this tell us? In the words after forget, there are
several examples of to several examples of –ing several examples of what and that several examples of the several examples of he, she, you, it, and we Notice also that forget usually comes in the middle of a sentence, not at the beginning or end.
15
Open a concordance on your PC
Go to This site allows you to access the Corpus of Contemporary American English (COCA). The largest free corpus in the world: 425 million words, 5 types of text Spoken Fiction Magazine Newspaper Academic
17
Display At the top left, you will see under Display:
List: Shows a list of words in the right column Chart: Shows two charts in the right column Types of text (spoken, fiction, magazine, etc.) Time ( , , etc.) KWIC (Key Words in Context) Shows nouns, verbs, etc. around the search string Compare: Shows results for two words
18
Search String Under Search String, you will see:
Word: Type a word (e.g. head). Collocates: Type a word used near head. The two boxes next to Collocates show Maximum number of words before head Maximum number of words after head POS (Part Of Speech): Select a part of speech (e.g., noun, verb, etc.) used near head. Random: This chooses a random search string. Search: Click this to begin your search Reset: Clear the left column
19
Sections Show: Check this box to show charts for
Type of text (Spoken, Magazine, etc.) Time 1: Choose a type of text for the search string Ignore (= all types) Spoken Magazine Newspaper Academic 2: If you are comparing two search strings, choose the type of text for the second string.
20
Search syntax To find two words: To find the neighboring word:
To find “good luck”, type “good luck” in Word(s). To find the neighboring word: To find what word comes after “dog”, type “dog *”. To find what word comes before “dog”, type “* dog”. To find two words with 1–4 words between: Word(s): dog Collocate: bark “dog bark”, “dog will bark”, “dog will often bark”, “dog will not always bark”, “dog will in no situation bark”. 5
21
Query syntax (2) To find different forms of a word:
Word(s): [blow] away “blow away”, “blows away”, “blew away”, “blowing away”, “blown away” To find all the words that begin the same way: Word(s): comp* “compare”, “compute”, “computer”, “compiler”, “comply”, etc. To find all of a set of words: Word(s): cut|cuts|cutting “cut”, “cuts”, “cutting”.
22
Try the COCA concordance
In the top right corner, type My address. Our group password. In the Word(s) box, type “playing”. Click on “Search” In the top right column, click on “PLAYING”. What topics are most of the examples about?
23
COCA concordance for “playing”
24
Findings for “playing”
Acting Playing Santa Claus, playing the mother Sports Playing basketball, left the playing field Other games Playing chess, playing the video game Music Quiet music is playing, playing the guitar 遊んだ Playing with him on the school’s grounds
25
Word frequency At the top right, under TOT, you see “58676”.
The corpus contains examples of playing. Under Display, select CHART. Click the Search button. The right column shows the frequency of playing in different types of text. In which type is it most common? Why? You can also see the frequency for 5-year periods. In which period was it most common?
26
Try a two-word search Click the Reset button.
In the Word(s) box, type “* friend of *”. Click on “Search”. Notice the words before and after “friend of”. What did you find?
27
Collocations for “friend+of”
28
Findings for “friend of”
Before “a” “good” “close” “old” After “mine” “the” “his” “hers” “ours” “theirs”
29
Two words with an optional gap
Click the Reset button. In the Word(s) box, type “a”. Click on Collocates. In the Collocates box, type “teacher” Click “Search”. In the top right column, click on “TEACHER” Notice the words between “a” and “teacher”. What did you find? 5
30
Concordance for “a . . . teacher”
31
Findings for “a . . . teacher”
“a technology teacher” “a high school teacher” “a new head teacher” “a 29 year old teacher” “a German-language teacher” “a preschool teacher” “a highly qualified teacher” “a French teacher”, etc.
32
Word + Part of Speech (POS)
You can also search for a word with a POS. E.g., made me + VERB (動詞) Click on the POS button in the left column. noun.ALL: all common nouns (名詞) verb.ALL: all verbs (動詞) adj.ALL: all adjectives (形容詞) adv.ALL: all adverbs (副詞) neg.ALL all instances of “not”, “n’t” art.ALL all articles (“a”, “an”, “the”) det.ALL all determiners (“this”, “these”, etc.) pron.ALL all pronouns (代名詞) poss.ALL all possessive pronouns (“my”, “your”, etc.) prep.ALL all prepositions (前置詞) conj.ALL all conjunctions (接続詞) noun.ALL+ all common and proper nouns (名詞) noun.SG: singular noun (単数の名詞) noun.PL: plural noun (複数の名詞) noun.CMN common noun (普通名詞) noun.+PROP proper nouns (固有名詞) verb.BASE base form of verb (“know”, “think”, etc.) verb.INF infinitive form of verb (“be”, “have”, etc.) verb.MODAL modal form of verb (“may”, “might”, etc.) verb.3SG 3rd person singular verb (“has”, “goes”, etc.) verb.ED past tense verb (“went”, “played”, etc.) verb.ING “ing” form of verb (“going”, “playing”, etc.) PUNC all punctuation marks (. , ; : ! ? - etc.)
33
Search for a word + POS In the left column, click “Reset”
In the Word(s) box, type “made me”. Click “POS” In the POS box, type VERB(ALL) Click “Search”. Notice the words after “me”. What did you find?
34
Concordance for “made me VERB”
35
Findings for “made me” All the words after “me” were bare infinitives.
The most common verb was “want” (328). There were many “thinking” verbs, e.g., “realize”, “see”, “believe”, “think”, “understand”. There were also some “action” verbs, e.g., “do”, “look”, “take”, “get”.
36
Inflected forms Click “Reset”
In the Word(s) box, type “I wish I [be]”. Click “Search”. Notice the word after “I wish I”. What did you find?
37
Concordance for “I wish I [be]”
38
Findings for “I wish I” “I wish I was” (224 cases)
“I wish I were” (205 cases) Grammatically, “I wish I were” is correct. Native English speakers do not always use English “correctly”.
39
Pre-lecture quiz What answers did you get? happy ______ What a _______
I haven’t a _______ as good as ______ ______ the winter I’ve _____ arrived Don’t be a ______ a ______ breakfast He didn’t take any _______ She ______ her head
40
Answers to the pre-lecture quiz
happy [to, with, and, about, birthday] What a [great, lot, good, wonderful, difference] I haven’t a [clue, thing, single, choice] as good as[the, it, they, any, a, I, you] [in, during, for, of, through] the winter I’ve [just, now, finally, always, already] arrived Don’t be a [fool, stranger, hero, jerk, baby] A [big, good, hearty, late, quick] breakfast He didn’t take any [questions, of, shit, precautions] She [shook, shakes, turned, tilted] her head
41
Summary We can learn a lot about language from computer corpora.
In particular, concordances can show us how people really use language in practice. Concordances are useful for students of English To check how vocabulary is used. To check grammatical constructions.
42
Some other online concordances
Michigan Corpus of Academic Spoken English (MICASE) Web Concordancer (English) Corpus Concordance English
43
Post-lecture quiz Please complete the quiz paper I gave you today.
Submit it to the 講師室 by 5:30 p.m. Wednesday evening. If you don’t submit it, you will not get any points for attending this lecture. That’s it, folks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.