Download presentation
Presentation is loading. Please wait.
Published byLydia Wilcox Modified over 7 years ago
1
Using Corpus Linguistics Tools to Aid Research in the Social Sciences
Mike Scott Aston University Friedrich-Alexander University Erlangen, 25 January 2016
3
Bootyful, cyw, scrims Bootyful, cyw, scrims. Do you know these words? If not, you soon might. They are some of the fastest growing words from online niches around the world, as identified by new software that charts the rise of language online. Bootyful, an alternative spelling for beautiful, has had a dramatic rise in usage on Twitter in South Wales. Cyw (coming your way) has become popular in the north of the country. Scrims comes from gaming forums, where it refers to practice sessions before competitive games. The software that found these words was developed by Daniel Kershaw and his supervisor, Matthew Rowe, at Lancaster University, UK. Kershaw and Rowe took established methods lexicographers use to chart the popularity of words, translated them into algorithms, then applied them to 22 million words worth of twitter and Reddit posts. Their goal is to peer into the niche portions of the internet, and chart novel language making its foray out into the mainstream. “If we see an innovation taking off on Reddit or Twitter, the question is what point is it going to appear in a newspaper,” says Rowe. Kershaw and Rowe’s algorithms don’t just pick out frequently used words, but words that have gone through a sudden rise in popularity. This comes with some complications. The five fastest rising words in central London for the period they studied were all Spanish or Portuguese, unlikely to be reflecting the reality of London’s language scene. (
5
Agenda Why Corpus Tools? Which Corpus Tools? How?
6
My reason for interest language teaching English for Academic Purposes
Latin America students struggling to understand main points but getting bogged down in detail
7
Social Science Research Agenda
Explore events Understand causes Understand processes
12
Research Agenda Make sense of complexity
13
Kintsch & van Dijk (1970s) macro-rules
Generalization: use “super-propositions” Deletion: omit unwanted detail Construction: use entailment to draw inferences
14
A Text Linguistic Objective of the 1970s
To come up with a set of macro-propositions which could represent the gist of the text.
15
Kinstch & van Dijk macro-rules
Generalization: use “super-propositions” Deletion: omit unwanted detail Construction: use entailment to draw inferences
16
Generalization “Of a sequence of propositions we may substitute any subsequence by a proposition defining the immediate superconcept of the micropropositions” Mary was drawing a picture. Sally was jumping rope and Daniel was building something with Lego blocks. The children were playing.
17
Deletion “Of a sequence of propositions we may delete all those denoting an accidental property of a discourse referent” A girl in a yellow dress passed by. 1. A girl passed by. (2. She was wearing a dress. 3. The dress was yellow.)
18
Construction “Of a sequence of propositions we may substitute each subsequence by a proposition if they denote normal conditions, components or consequences of the macroproposition substituting them.” John went to the station. He bought a ticket, started running when he saw what time it was and was forced to conclude that his watch was wrong when he reached the platform. John missed the train.
19
Corpus Linguistics but now in 2016?
20
Standard corpora British National Corpus
Corpus of Contemporary American English (COCA)
22
Your own corpus In contrast to other well-known corpora and corpus archives (such as the British National Corpus), however, the German Reference Corpus is explicitly not designed as a balanced corpus: The distribution of DeReKo texts across time or text types does not match some predefined percentages. This conception complies with the fact that whether or not a given corpus constitutes a balanced or even representative language sample may only be assessed with respect to a specific language domain (i.e., the statistical population). Because different linguistic investigations generally aim at different language domains, the declared purpose of the German Reference Corpus is to serve as a versatile superordinate sample, or primordial sample (German: Ur-Stichprobe) of contemporary written German, from which corpus users may draw a specialised subsample (a so-called virtual corpus) to represent the language domain they wish to investigate. ( Jan 2016)
23
Your own texts From your students Your research archive
LexisNexis and other standard text collections Project Gutenberg Oxford Text Archive
24
text patterns
25
Example of corpus
26
Twitter
27
txtLAB450
28
accompanying spreadsheet
29
Limitations Corpus tools typically ignore images numbers, dates
equations variations in typeface hyperlinks related sound or video files
30
Problems Multiple formats PDF format One text = one file?
Incomplete texts Duplicated texts
31
Formats ASCII, ANSI (one byte per character)
legacy formats from 1960s to 1990s (DOS, Windows, Mac, IBM etc.) UTF8 varied bytes per character UTF16 allows for 65,000 characters, fixed 2-bytes per character
32
PDF save as?
33
converting PDF to plain text….
Adobe Reader Save As… Export as Word .docx
35
Adobe Acrobat OCR Save as .doc
36
Why Corpus Tools process larger amounts of text
transform the text in varied ways seeking patterns
37
Which Corpus Tools? online corpus tools stand-alone grammar patterns
38
How simple word lists concordances collocation patterns
dispersion plots
39
at a basic level… let you see the overall vocabulary
multiple examples of words & phrases can be broken down by number of texts location in text context words
40
dealing with large amounts of data
sorting filtering out
41
corpus-based or corpus-driven
corpus-based: uses a corpus to try to find examples of something where the underlying research categories already exist looking through a large text corpus for references to austerity and seeking collocates
42
corpus-based or corpus-driven
corpus-driven: explores a corpus trying to find out what is identified as typical or outstanding looking through a large text corpus concerning austerity, seeking typical key words
43
Issues What is the unit of text we are working with?
Can you see the text patterns?
44
Levels of Context What is the unit of text we are working with?
single words n-grams paragraphs whole texts genres
45
Corpus Linguistics can operate at any of these levels
and can use comparison
46
Choice of reference corpus
47
Finding patterns
49
Conclusions Corpus Linguistics tools: relevant to the Social Scientist
traditional tools to be retained limitations do not give answers but pointers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.