ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES introduction (02) Bambang Kaswanti Purwo

Slides:



Advertisements
Similar presentations
Corpora in grammatical studies
Advertisements

Why study grammar? Knowledge of grammar facilitates language learning
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Introduction: The Chomskian Perspective on Language Study.
Introduction: A discourse perspective on grammar
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Corpus Linguistics and Second Language Acquisition – The use of ACORN in the teaching of Spanish Grammar Guadalupe Ruiz Yepes.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
LELA English Corpus Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpora and Language Teaching
The application of corpus analysis and concordance feedback to collegiate EFL writing Presenter: Wen-Shuenn Wu (Michael Wu) Chung Hua University, Hsinchu,
Daniel Nkemleke, Humboldt Kolleg Kamerun, 30/07/2008 Corpus Linguistics and Language Education: Development and Utility of the Corpus of Cameroon English.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Lecture 1 Introduction: Linguistic Theory and Theories
1. Introduction Which rules to describe Form and Function Type versus Token 2 Discourse Grammar Appreciation.
Sociolinguistics.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
English Corpora and Language Learning Tamás Váradi
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
Prof. Karīna Aijmere ( Karin Aijmer ) Gēteborgas Universitāte, Zviedrija „Valodas apguvēju korpuss – tā veidošana un izmantošana valodu apguvē, mācību.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
Reflections on Using Corpora Data in EFL Teaching CHEN BO Chongqing Jiaotong University 2006.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Linguistics and Language
Researching language with computers Paul Thompson.
Historical linguistics Historical linguistics (also called diachronic linguistics) is the study of language change. Diachronic: The study of linguistic.
What is linguistics  It is the science of language.  Linguistics is the systematic study of language.  The field of linguistics is concerned with the.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
The Great Vowel Shift Continued The reasons behind this shift are something of a mystery, and linguists have been unable to account for why it took place.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Department of English Faculty of Modern Languages and Communication B. A. (English Language) Semester II 2011/2012 ESPTHEORY AND PRACTICE (BBI 3211)
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
SPEECH AND WRITING. Spoken language and speech communication In a normal speech communication a speaker tries to influence on a listener by making him:
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
How Can Corpora Help Me To Be Successful in CO150?
Corpus approaches to discourse
RESEARCH DESIGN & CORPUS COMPILATION. Corpus design is intrinsic and a fundamental part of the analysis. It is guided by the RQ and affects the results.
1 Branches of Linguistics. 2 Branches of linguistics Linguists are engaged in a multiplicity of studies, some of which bear little direct relationship.
Building and analysing your own corpus 1. Building a corpus.
Corpus search What are the most common words in English
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Some Distinctions in Linguistics. Descriptivism & Prescriptivism Synchronic & diachronic Speech & writing Language & parole Competence & performance Traditional.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
INTRODUCTION TO APPLIED LINGUISTICS
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
2. The standards of textuality: cohesion Traditional approach to the study of lannguage: sentence as conventional object of study Structuralism (Bloofield,
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Usage-Based Phonology Anna Nordenskjöld Bergman. Usage-Based Phonology overall approach What is the overall approach taken by this theory? summarize How.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
The development of ESP.
Introduction to Corpus Linguistics
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Using Corpora in Linguistics
Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
McEnery, T. , Xiao, R. and Y. Tono Corpus-based language studies
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES introduction (02) Bambang Kaswanti Purwo

corpus linguistics not a particular linguistic paradigm sociolinguistics psycholinguistics computational linguistics a way of doing linguistics a methodological basis for pursuing linguistic research [Meyer 2002] theory-driven approach data-based approach [to be elaborated later]

Corpus Linguistics first appeared early 1980 [McEnery et al. 2006] corpus-based language study corpus methodology pre-Chomskyan period Boas (1941); Sapir, Newman, Bloomfield, Pike data storage shoeboxes filled with paper slips rather than computers ▪ simple collections of written or transcribed texts ▪ not representative corpus-based = empirical n based on observed data late 1950s corpus methodology ▪ severely criticized ▪ marginalized ▪ abandoned the size of “shoebox corpora” very small

developments of powerful computers ▪ increasing power ▪ massive storage ▪ relatively low cost corpus ▪ “any body of text” (McEnery and Wilson 2001), i.e. any collection of recorded instances of spoken or written lang. ▪ a collection of texts or parts of texts upon which some general linguistic analysis can be conducted (Meyer 2002) ▪ any collection of texts, written or spoken, which is stored on a computer (O’Keeffe et al. 2007) large amounts of texts can be stored and analyzed using analytical software ▪ collections of texts (or parts of texts) that are stored and accessed electronically (Hunston 2002)

linguistic theory and description Chomsky’s three levels of adequacy: observational adequacy descriptive adequacy explanatory adequacy What does it mean if a theory or a description achieves observational adequacy? It is able to describe which sentences in a language are grammatically well formed. (a) He studied for the exam. (b) *Studied for the exam. descriptive adequacy: ▪ not only describe ▪ specify the abstract grammatical properties making the sentences well formed: Eng requires an explicit Subj explanatory adequacy: use abstract principles applicable beyond the language under study  universal grammar (UG) Eng, unlike Spanish or Indonesian, not a lang which permits “pro-drop”

Chomsky’s theory of principles and parameters language acquisition: “the parameters of UG” vs. “the norms of the language being acquired” pro-drop is a consequence of “null-subject parameter” ▪ speakers acquiring English set the parameter to negative ▪ speakers acquiring Indonesian set the parameter to positive generative grammar ▪ emphasis is on universal grammar ▪ explanatory adequacy a high priority elements of a language part of the “core” part of the “periphery” ▪ core ▪ periphery – “pure instantiations of UG” – “marked exceptions”

Generative Grammar (GG) a. little concern for variation in a language b. variation is limited to nonsubstantive elements of the lexicon and general properties of lexical items c. (a) and (b) belong to the periphery of a language d. only the elements that are part of the core are relevant for purposes of theory construction e. (d) is the idealist view of language f. this is the goal of the minimalist theory, “a theory of the initial state”: a theory of what humans know about language “in advance of experience” e. the real world of the language and the complexity of the structure that comes out of it is not (yet) their concern

Corpus Linguistics (CL) f. (e) is what CL is interested in studying g. complexity n variation are inherent in language h. very high priority on descriptive, not explanatory adequacy i. CL very skeptikal of the highly abstract and decontextualized discussion of language (promoted by GG) j. such discussions too far removed from actual language use the primary concern of CL is an accurate description of language GG is a a theoretical discussion of language that advances our knowledge of universal grammar “formalists” (generative grammarians) vs. “functionalists”

functionalists are interested in language as a communication tool how speakers n writers use language to achieve various communicative goals functionalists approach the study of language from a perspective different from formalists (generative grammarians) formalists are interested in describing the form of linguistic constructions using these descriptions to make general claims about Universal Grammar (UG)

I made mistakes vs Mistakes were made by me [active] [passive] generative grammarians are interested in the structural changes in word order making more general claims about the movement of constituents in natural language: the movement of NPs in English actives n passives is part of a more general process: “NP [noun phrase]–movement” a functionalist is more interested in the communicative potential of actives and passives in Eng to study this potential, investigate the linguistic and social contexts favoring or disfavoring the use, e.g. a passive rather than an active construction

context: a politician embroiled in a scandal of all these three possible constructions, which one to choose? (1) I made mistakes. (2) Mistakes were made by me. (3) Mistakes were made. the agentless passive construction (3) allows him/her to admit that something went wrong [at the same time] to evade responsibility for the wrong- doing by being quite imprecise about exactly who made the mistakes corpora consist of texts (or parts of texts)  enable linguists to contextualize their analysis of language  very well suited to more functionally based discussion of lg

(1) Jack gave a flower to Ann. (2) Jack gave Ann a flower. (3) A flower was given to Ann by Jack. (4) Ann was given a flower by Jack. (1)  (2) “dative movement”; “preposition deletion” “passivization”: (3) and (4) two different analyses ▪ (1)  (3) ▪ (1)  (2)  (4) syntactic analysis of generative grammarians

functional analysis (1) Jack gave a flower to Ann. (2) Jack gave Ann a flower. ▪ what drives an English speaker to utter (1) instead of (2) or (2) instead of (1)? ▪ what questions triggers the speaker to say (1) instead of (2) or (2) instead of (1)? sentence vs. utterance [- context] [+ context] A1: What did Jack give to Ann? B1: Jack gave a flower to Ann. B2: Jack gave Ann a flower. A2: Whom did Jack give a flower to?

corpus ▪ a naturally occurring language ▪ assembled with a particular purposes in mind ▪ assembled to be representative of some language or text type ▪ not a random collection of texts representative a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language (McEnnery et al. 2006) a corpus is a collection of (1) machine-readable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety

medium: spoken corpora (eg. London-Lund corpus) vs. written corpora (e.g. Lancaster Oslo/Bergen corpus (LOB)) vs. mixed corpora (British National Corpus (BNC) or Bank of English) national varieties: British corpora (e.g. Lancaster Oslo/Bergen corpus) vs. American corpora (e.g. Brown corpus) vs. an inter- national corpus of English. historical variation: diachronic corpora (Helsinki corpus, cf. the ICAME home pagethe ICAME home page) vs. synchronic corpora (Brown, LOB, BNC) vs. corpora which cover only one stage of language his- tory (corpus of Old or Middle English, Shakespeare corpora) geographical variation/dialectal variation: corpus of dia- lect samples (e.g. Scots) vs. mixed corpora (The BNC spoken component includes samples of speakers from all over Britain) corpus: a possible classification

age: corpora of adult English vs. corpora of child English (Eng- lish components of CHILDES) genre: corpora of literary texts vs. corpora of technical English vs. corpora of non-fiction (e.g. news texts) vs. mixed corpora covering all genres open-endedness: closed, unalterable corpora (e.g. LOB, Brown) vs. monitor corpora (Bank of English) availability: commercial vs. non-commercial research corpora, online corpora vs. corpora on ftp servers vs. corpora available on floppy disks or CD-ROMs

Why corpus linguistics use computers to manipulate and exploit language data? electronic corpora have advantages unavailable to their paper-based equivalents ▪ process and manipulate the data rapidly n easily (e.g. searching, selecting, sorting, n formatting) ▪ process machine-readable data accurately and consistently ▪ computers can avoid human bias in an analysis, making the result more reliable

 The Brown corpus and the Lancaster-Oslo/Bergen corpus (LOB): The Brown corpusthe Lancaster-Oslo/Bergen corpus (LOB)  Some well-known corpora from the beginnings of the computer  age are the Brown corpus of written American English and the  Lancaster-Oslo/Bergen corpus of written British English. The  Brown corpus was compiled in the 60's [the first modern corpus  of the English language], its British counterpart in the 70's. Both  consist of around one million tokens (i.e. words, counted every  time they appear).  The London-Lund corpus is another corpus of British English The London-Lund corpus  created around that time, but this corpus is different from the  Brown and the LOB in that it exclusively contains transcripts  from spoken material, collected at the Survey of English Usagethe Survey of English Usage  at University College London. The London-Lund corpus, the  Brown corpus, the LOB and other corpora are now available on  CD-ROM as the ICAME collection of English texts. The Inter-  national Computer Archive of Modern and Medieval English  (ICAME), situated at Bergen in Norway, offers a wealth of (ICAME)  information on these corpora.

 The Bank of English was initiated in 1991 by COBUILD (a The Bank of English  division of HarperCollins publishers) and the University of  Birmingham. The main purpose of the Bank of English is and  has been to provide a textual database for the compilation of  dictionaries and for language studies. The Bank of English is a  monitor corpus (i.e. new material is constantly added). By now  the corpus has got a size of more than 320 million words.  The British National Corpus was compiled by a consortium of The British National Corpus  British publishers, of academic institutions such as Oxford Uni-  versity Computing services, Lancaster University's Centre for  Computer Research on the English language and the British  Library. It is now a 100 million word corpus of modern British  English, both written and spoken, including everyday conver-  sations [a hundred times larger than the Brown corpus]. It is  available on CD-ROM for research purposes; we have got a  copy at our department.

 The International Corpus of English (ICE) will ultimately be a The International Corpus of English (ICE)  collection of 1,000,000 word corpora from each country or  region where English is spoken as a first language. The corpus  consists of a written and a spoken component. The Survey of  English Usage, situated at University College London, is respon-  sible for this project. The home page of the Survey provides  information on a variety of research projects, including the  International Corpus of English (ICE).  The CHILDES system (mirror of the American site in Antwerp): The CHILDES system (mirror of the American site in Antwerp)  This is the home page for the Child Language Data Exchange  System (CHILDES). In particular, you'll find the CHILDES data-  base, a collection of child language transcript data from a  number of projects in different languages (including English  and German).

The Bank of English – written and spoken English (used ex- tensively by researchers and for the COBUILD series of English language books) The BNC – written and spoken British English (used extensive- ly by researchers and for the Oxford University Press, Cham- bers and Longman publishing houses) CANCODE (Cambridge Nottingham Corpus of the Discourse of English)– spoken British English (used extensively by research- ers and Cambridge University Press) ICE (International Corpus of English– international varieties of spoken and written English (most of the corpus is not yet avail- able) Examples of English language corpora

Brown University Corpus & LOB (Lancaster-Oslo-Bergen) Corpus – parallel corpora of written texts (but now rather out- dated) London-Lund Corpus (Survey of English Usage)– spoken British English (used very extensively by researchers, but it is now quite old) Santa Barbara Corpus – spoken American English (most of the corpus is not yet available) Hong Kong Corpus of Spoken English (still being compiled, 1 million of the target 1,5 million words have been collected so far) ICAME (International Computer Archive of Modern English) – a centre which aims to coordinate and facilitate the sharing of computer-based corpora.

Online corpora Experimental BNC Website: Bad Guys Dont Look: The British National Corpus consortium currently offers a BNC online service which allows everyone with access to the internet to register for an account on the BNC server (free for twenty days unlimited usage) Shakespeare Online Corpus Concordance browsingConcordance browsing : This site allows you to search a number of English literary classics, including the Bronte novels, Shakespeare and James Joyce's Ulyssees, with the help of the concordance program TactWeb. It is easy to use, even for absolute novices in the area.