ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo

Slides:



Advertisements
Similar presentations
An investigation into Corpus-based learning about language inin the primary-school: CLLIP Corpus evidence of the features of childrens literature.
Advertisements

Uses of a Corpus “[E]xplore actual patterns of language use”
The Language of Math November 3, Second Check-In  My name is ___ & I am (role).  I am feeling _______ today because ____.  The biggest challenge.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
A corpus-based study of lexical bundles in students‘ dissertations in Cameroon Prof Daniel A. Nkemleke Department of English Ecole Normale Supérieure University.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
Corpus design & analysis techniques 1.  Monolingual: general, specialized, comparable  Bi/Multilingual: parallel, comparable 2.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Compiling a corpus II. Corpus A finite size, non random collection of naturally occurring language, in a computer readable form. Non-random = representative.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Corpus Linguistics. What is corpus linguistics? Method / Theory in Linguistics Analysis of collections of texts (corpora) Verifying/ Strengthening or.
Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Using Corpora in Linguistics
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools.
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Quantifying Data.
Research methods in corpus linguistics Xiaofei Lu.
Memory Strategy – Using Mental Images
Language Objectives. Planning Teachers should write both content and language objectives Content objectives are drawn from the subject area standards.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES lexico-grammatical profiles Bambang Kaswanti Purwo
Homing in on the Text- Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Researching language with computers Paul Thompson.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Translation Studies 9. The use of corpora in TS Krisztina Károly, Spring, 2006 Sources: Olohan, 2004; Tirkkonen-Condit, 2005.
Elena Tarasheva, PhD New Bulgarian University. Conclusions at last year’s BETA conference.
Corpus approaches to discourse
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
Corpus search What are the most common words in English
Overview of Corpus Linguistics
Levels of Linguistic Analysis
1 Ch 1. VOCABULARY SIZE, TEXT COVERAGE & WORD LISTS Nation& Waring.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Chapter 3 Vocabulary Paul Nation & Paul Meara.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
2. The standards of textuality: cohesion Traditional approach to the study of lannguage: sentence as conventional object of study Structuralism (Bloofield,
+ PARCC Partnership for Assessment of Readiness for College and Careers.
Chapter 5 The Oral Approach.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Corpus Linguistics Anca Dinu February, 2017.
The vocabulary of academic speaking: an interdisciplinary perspective
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Introduction to Corpus Linguistics: Exploring Collocation
Topics in Linguistics ENG 331
Introduction to Corpus Linguistics: Dispersion/concordance plots
Corpora and Concordancers in ESL/EFL Class:
A CORPUS-BASED STUDY OF COLLOCATIONS OF HIGH-FREQUENCY VERB —— MAKE
Stylistics and Stylometry
Levels of Linguistic Analysis
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Differences between written and spoken discourse
Data Analysis, Interpretation, and Presentation
Presentation transcript:

ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo

Adolph, Svenja (2006) Ch. 3 role of frequency information in relation to characterization of the whole texts or collections of texts techniques and practices in data analysis ▪ quantitative exploration of texts and text collections  different types of wordlists how the wordlists can be used for contrastive studies of different texts ▪ generating hypotheses frequency lists to inform the generation of hypotheses and research questions ▪ testing hypotheses electronic text analysis to test existing hypotheses in any area that deals with the use of language ▪ facilitating manual processes from “manual” to “automated” e.g. extraction of frequency info not necessarily motivated by a particular research question

some of the software resources to facilitate the research process ▪ software packages to facilitate the manipulation and analysis of electronic texts ▫ the generation of frequency counts ▫ comparisons of frequency information in different texts ▫ different formats of concordance outputs [including Key Word In Context (KWIC)] » [free of charge via internet] ◊ The Compleat Lexical Tutor (Tom Cobb) ◊ View Variation in English Words and Phrases (Mark Davis) » [commercial] ◊ Wordsmith Tools (Mike Scott)

basic information about the text most software packages ▪ allow textual data to be sorted into concordance outputs ▪ produce some basic information about the text or collection of texts ▫ average sentence length ▫ word length ▫ number of paragraphs ▫ number of individual running words (tokens) ▫ number of different words (types) ▫ number of lexical items and number of grammatical items (in tagged corpora) » type-token ratio some of the info can be expressed in terms of ratios: ratio between grammatical and lexical items in the text (lexical density)

the type-token ratio ▪ to gain some basic understanding of the lexical variation within the text tokens: the number of running words in a text types: the number of different words This chapter moves from the discussion of design and development of electronic text resources to techniques and practices in data analysis. How many tokens? 21How many types? 19 The type-token ratio: divide number of tokens by number of types 21/19 = 1.11 What is it for?  to asses the level of complexity of a particular text or text collections (e.g. comparisons between documents for different types of audiences) the higher the type-token ratio the less varied the text

watch out: the overall size of the text(s) on which the ratio is based  compare type-token ratios of text(s) of similar length  textual complexity ▪ sentence and word length ▪ linguistic analysis of grammatical structure ▪ semantic fields of the individual items » word lists ● single words frequency of a word or phrase in different text types is important for the description of the context of use (e.g. for English language teaching) ▪ various word lists exist in the ELT context e.g. Academic Word List (Coxhead 200) ▪ spoken vs. written discourse ▪ American vs. British English

word list ▪ frequency order ▪ alphabetical order ▪ lemmatized format ▪ grammatical tags ▪ other analytical tags word list to account for ▪ individual items ▪ recurrent sequences of two or more items lemmatized frequency lists group together words from the same lemma (all grammatical inflections of a word: e.g. say, said, saying, says) ▪ often variations of meaning between different variants of the lemma (Stubbs 1996, Tognini-Bonelli 2001) ▪ [ELT] beneficial to teach all forms of one lemma together and give priority to the most frequently used form

Table 3.1: one basic information from a frequency list ten most frequent items in the ▪ spoken CANCODE corpus ▪ written component (BNC) some of the key differences between the two discourse modes are highlighted: ▪ both contain mainly grammatical items ▪ the spoken corpus includes the personal pronouns I and you (interactive nature of the spoken discourse) ▪ Yeah – listener response tokens in conversation

● recurrent continuous sequences other terms: “lexical bundles” (Biber et al. 1999) “clusters” (Scott 1996) corpus research: a large proportion for particular items to co-occur in a non-random fashion of language is phrasal in nature (observable tendency ) collocation: attraction between two words (Ch. 4) [overall length to be determined at the outset; e.g. Wordsmith Tools ] Table 3.2 ten most frequent two-word, three-word, and four- word recurrent sequences in the CANCODE corpus most of the sequences are concerned with ▪ the management of discourse ▪ the deictics: you and I ▪ attempt to establish mutual understanding: know what I mean, I know, I think, do you think, etc.

● comparing frequencies in text collections of different sizes How to compare the frequencies of individual items in two corpora of different sizes? ▪ represent them as a percentage of the overall number of words in the respective corpora ▪ use a norming technique of frequency counts ▫ divide the raw frequency of individual items by the total number of words in a text ▫ we need to decide on an appropriate number of words which forms the basis of the norm ▫ multiply the results by this figure

» keywords ◊ keywords = items that occur ▪ either with a significantly higher frequency (positive keywords) ▪ or with a significantly lower frequency (negative keywords) in a text or collection texts when compared to a larger reference corpus (Scott 1997) ◊ keywords are identified on the basis of ▪ statistical comparisons of word frequency lists derived from the target corpus and the reference corpus ▪ [via a chi-square or a log-likelihood analysis] each item in the target corpus is compared to its equivalent in the reference corpus and its statistical significance of difference is calculated  to generate words that are characteristic uncharacteristic in a given target corpus

● single keywords ◊ on the basis of a 35,000 word corpus: the spoken language of health professionals ◊ five million word CANCODE corpus of general spoken Eng a study of telephone calls made to the British advice helpline provided by The National Health Service (NHS-Direct)  the data from the medical consultations was recorded ▪ most frequent items in both corpora grammatical items ▪ distribution of personal pronouns Health Service “other-oriented”: you most frequent ▪ the reverse frequency order of you and I ▪ right in Health Service, yeah in CANCODE both are listener response tokens ▫ right signals more transactional nature ▫ yeah interactional nature (encourage the Sp to continue with the turn)

▪ comparison of frequency lists can help in the characterization of different spoken genres ▪ keyword analysis (below), based on a log-likelihood calculation, better suited to highlight the main elements that are characteristics for a particular text or collection of texts Table 3.4 shows the top 10 positive keywords the list gives a better idea of the content of the texts in the HP corpus ▪ reference to medication (antibiotics) ▪ ailments (diarrhoea) ▪ the nature of the discourse (information) ▪ the mode of the discourse (call) ▪ the medical context (NHS, Direct)

the keywords that mark listener response in an advice-giving setting (ok, okay) patient-oriented nature (you, your) Table 3.5 confirms the result of the analysis of positive keywords ▪ the discourse in the HP corpus oriented towards the hearer who phones in with a health problem  you, your third person pronouns – negative keywords (low in HP corpus) past tense verb was also NEG keywords HP reports current medical concerns in the present tense ▪ ▪ laughter ([laughs]) significantly more in CANCODE  HP relatively serious nature of medical consultation

● key sequences analysis of keywords can be extended to include extended recurrent sequences Table 3.6 key sequences provides us with even stronger evidence of the particular domain of HP discourse ▪ quite a few of the recurrent sequences “automated response” marking the beginning of telephone interaction with NHS Direct ▪ other sequences relate to the gathering of basic information about the caller ▪ the most significant NEG key sequence in the HP: I don’t know (professionals providing knowledge and advice)