Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.

Slides:



Advertisements
Similar presentations
Corpus Linguistics Richard Xiao
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
Uses of a Corpus “[E]xplore actual patterns of language use”
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Corpus Linguistics. What is corpus linguistics? Method / Theory in Linguistics Analysis of collections of texts (corpora) Verifying/ Strengthening or.
Corpus Linguistics and Second Language Acquisition – The use of ACORN in the teaching of Spanish Grammar Guadalupe Ruiz Yepes.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Using Corpora in Linguistics Introduction to WordSmith Tools for Beginners Íde O’Sullivan Regional Writing Centre
Using Corpora in Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
CALL – computer assisted language learning A short course delivered by Dr. Klaus Schwienhorst. MITE January 2002.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
Corpus Linguistics Case study 2 Grammatical studies based on morphemes or words. G Kennedy (1998) An introduction to corpus linguistics, London: Longman,
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Teaching Vocabulary Chapter 14
Researching language with computers Paul Thompson.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Literary Elements Parts of Speech PhrasesClausesMLAGrab.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
LOGISTICS, LOGISTICAL, LOGISTIC: DIACHRONIC AND SYNCHRONIC CORPUS ANALYSIS Dr. Violeta Jurkovič Faculty of Maritime Studies and Transport Portorož.
英 3B 戴偲婷. WConcord is a fast and easy to use concordancer for unlimited amounts of text. It allows the user to load multiple plain text files (.txt)
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Elena Tarasheva, PhD New Bulgarian University. Conclusions at last year’s BETA conference.
Corpus approaches to discourse
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
Corpus search What are the most common words in English
Differences between Spoken and Written Discourse
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
MORPHOLOGY. PART 1: INTRODUCTION Parts of speech 1. What is a part of speech?part of speech 1. Traditional grammar classifies words based on eight parts.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Making trouble-free corpus tasks in 10 minutes Jennie Wright.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Corpus Linguistics Anca Dinu February, 2017.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Using Corpora in Linguistics
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Introduction to Corpus Linguistics: Exploring Collocation
Topics in Linguistics ENG 331
Introduction to Corpus Linguistics: Dispersion/concordance plots
Introduction to Corpus Linguistics: Key Word Analysis
Corpora and Concordancers in ESL/EFL Class:
A Brief Intro to Corpus Techniques in ELT Research
Core Concepts Lecture 1 Lexical Frequency.
FIRST SEMESTER GRAMMAR
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Differences between written and spoken discourse
Presentation transcript:

Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12

What is a corpus?  An collection of a large number of texts of a particular type in digital format which can be easily searched and manipulated with computer programs What is corpus linguistics ?  The analaysis of collections of texts (corpora) with computer tools in order to detect grammatical, lexical or discourse level patterns, often with the aim of comparing those patterns with those found in other collections of texts.

Examples of corpus assisted discourse analysis  Flowerdew (1997, 2002)  Anlaysis of the speeches of Gov. Chris Patten and CE Tung Chee Hwa  common themes: free market economy, freedom of the individual, rule of law  Divergent themes: democracy, stability and harmony  Rey (2001)  Startrek characters from 1966 to 1993  Female language has shifted from being more relational to more informational  Male language has shifted from being more informational to more relational

Advantages of using corpora  Easily detecting grammatical and lexical patterns in a large number of texts  Reducing researcher bias  Efficiently detecting differences among varieties, registers, genres, and Discourses  Corpus based (deductive) vs. Corpus driven (inductive) analysis

Disadvantages of using corpora  Separation of discourse from its social context  Corpus data usually confined to text (cannot account for images, non-verbal behavior and other aspects of multimodal discourse)  Frequency does not equal importance (sometimes very important messages are implicit or ‘taken for granted’ rather than explicit)  ‘People don’t say what they mean and people don’t mean what they say’  Words have multiple meanings and word meanings change over time and according to the context in which they are used

Tools for corpus analysis  Online corpora and concordancers  Collins Bank of English Collins Bank of English  British National Corpus British National Corpus  Corpus of Contemporary American English Corpus of Contemporary American English  International Corpus of English International Corpus of English  General vs. Specialized Corpora  Software tools  AntConc AntConc  ConcApp ConcApp  WordSmith Tools WordSmith Tools

Preparing corpora  Collecting data (Internet? Scanning files?)  Txt files  Separate files for different texts  ‘Cleaning’ files  ‘Tagging’

Procedures in corpus analysis  Type token ratio  Dispersion plots  Frequency lists  Concordance data  Collocation calculations  Keyword calculations

Example  Lady Gaga’s lyrics  Total of 59 songs  Reference corpus: 100 top songs from November 2010

Type Token Ratio Number of types divided by the number of tokens

Type Token Ratio  Low indicates narrow range of subjects, lack of variety or frequent repetition  High indicates wide range of subjects, great variation, less frequent repetition  BNC Written =  BNC Spoken =  Baker’s Holiday Pamphlets =  100 Song Corpus = 9.07  Gaga Corpus = 11.4

Frequency lists

Frequency  Function words (articles, prepositions, conjunctions, pronouns, etc.)  Useful in answering questions about style, register  Pronouns can be particularly important  Content words (nouns, verbs, adjectives, adverbs)  Useful in answering questions about topics/ Discourses

Top 5 function words  100 Song Corpus  I  you  the  and  it  Gaga Corpus  I  you  the  oh  me 1 = 4.4% me = 2.03% I = 5.09% me = 1.3% Murphey 1992: The word count revealed that the total referents in first person (I, me, my, mine, etc.) amounted to 10% of the total words

‘t (not)  100 Song Corpus  Ranked 7  1.3%  Gaga Corpus  Ranked 9  1.59%

Top 5 content words  100 Song Corpus  like  no  can  baby  know  (love) (0.42%)  Gaga Corpus  love (0.98%)  baby  can  want  know

Concordances

 Can reveal contexts of frequent words  Sorting strategies  Searching for patterns

Concordances

Collocation  ‘Co-location’  The frequency with which words appear close to other words  ‘You shall know a lot about a word from the company it keeps.’ (Firth 1957)  Span (xL, xR)

Top 5 collocates for ‘I’  100 Song Corpus  ‘m  and  can  Know  ‘ll  Gaga Corpus  ‘m  want  ‘ll  don’t  can Span: 1L, 1R

Top 5 collocates of ‘love  100 Song Corpus  I  you  my  me  the  Gaga Corpus  I  fu  want  ‘t  revenge Span 5l, 5R

Keywords  The frequency of words in a corpus in relation to another corpus  The statistical significance of a keyword's frequency in a given corpus, relative to a reference corpus.

Keywords

Keywords: semantic domains  lover*  romance*  love*  loves  fame*  fancy*  ribbons*  glitter  fashion  vanity  rich  presents  famous  retro*  bang*  shake*  dirty*  grease*  bad*  teeth  monster  filthy  oh*  eh*

What does this analysis tell us out Lady Gaga lyrics?  Style and texture  Whos doing whats  Discourses and ideology