McEnery, T. , Xiao, R. and Y. Tono Corpus-based language studies

Slides:



Advertisements
Similar presentations
Action Research Not traditional educational research often research tests theory not practical Teacher research in classrooms and/or schools/districts.
Advertisements

Interlanguage phonology: Phonological description of what constitute ‘foreign accents’ have been developed. Studies about the reception of such accents.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
LELA English Corpus Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpora and Language Teaching
1. Introduction Which rules to describe Form and Function Type versus Token 2 Discourse Grammar Appreciation.
Sociolinguistics.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Traditional Grammar Vs Linguistics
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
(Business Research Methods)
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Program Evaluation. Program evaluation Methodological techniques of the social sciences social policy public welfare administration.
Linguistics and Language
Researching language with computers Paul Thompson.
Historical linguistics Historical linguistics (also called diachronic linguistics) is the study of language change. Diachronic: The study of linguistic.
Scientific Prose Style (SPS) Literary and Linguostylistic Characteristics.
The Great Vowel Shift Continued The reasons behind this shift are something of a mystery, and linguists have been unable to account for why it took place.
Chapter 10 Language and Computer English Linguistics: An Introduction.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Introduction Chapter 1 Foundations of statistical natural language processing.
Enda F. Scott 2001 Good morning An introduction to modern dictionary making.
A Quick Guide to Empirical Research Collaborative Construction of a CSCL Theory EME 6403 Fall 2008 – Team 1.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Some Distinctions in Linguistics. Descriptivism & Prescriptivism Synchronic & diachronic Speech & writing Language & parole Competence & performance Traditional.
INTRODUCTION TO APPLIED LINGUISTICS
Applied Linguistics Applied Linguistics means
Moshe Banai, PhD Editor International Studies of Management and Organization 1.
Text Linguistics. Definition of linguistics Linguistics can be defined as the scientific or systematic study of language. It is a science in the sense.
RESEARCH METHODOLOGY Research and Development Research Approach Research Methodology Research Objectives Engr. Hassan Mehmood Khan.
Language Identification and Part-of-Speech Tagging
Objectives The objectives of this lecture is to:
Presented By: Marine Milad, Ph.D.
E303 Part II The Context of Language Research
An Introduction to Linguistics
What is cognitive psychology?
Linguistics Linguistics can be defined as the scientific or systematic study of language. It is a science in the sense that it scientifically studies the.
Corpus Linguistics Anca Dinu February, 2017.
What is Science? Part II.
How to Research Lynn W Zimmerman, PhD.
Introduction to Corpus Linguistics
Statistical NLP: Lecture 7
Leacock, Warrican and Rose (2009)
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Computational and Statistical Methods for Corpus Analysis: Overview
In this lecture, we will learn:
Corpus Linguistics I ENG 617
Research Design: Terms to Know
Intro to corpus linguistics: Data Driven Grammar
Literature Review Ms. Maysoon Dorra.
IS Psychology A Science?
Macrolinguistics Linguistics is not the only field concerned with language. Other disciplines such as psychology, sociology, ethnography, the science of.
What is Linguistics? The scientific study of human language
Corpus-Based ELT CEL Symposium Creating Learning Designers
© 2012 The McGraw-Hill Companies, Inc.
Corpus Linguistics I ENG 617
IS Psychology A Science?
Introduction To Linguistics
What is Research? A research study is a study conducted to collect and analyse information in order to increase our understanding of a topic or an issue.
What is Stylistics? Stylistics is the science which explores how readers interact with the language of (mainly literary) texts in order to explain how.
Statistical Data Analysis
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Competence and performance
RESEARCH BASICS What is research?.
What is qualitative research?
CONTRASTIVE LINGUISTICS First lecture
What is Discourse Analysis
Research in Language Learning and Teaching
Presentation transcript:

McEnery, T., Xiao, R. and Y.Tono. 2006. Corpus-based language studies. Routledge. Unit A 1. Corpus linguistics: the basics (pp3-11) http://www.cl2011.org.uk/history-of-corpus-linguistics.html

A1.2 CORPUS LINGUISTICS: PAST AND PRESENT The term first appeared in the early 1980s. But corpus-based language study has a substantial history. The early examples of corpus linguistics date to the late 19th century Germany. In 1897, German linguist J. Kading used a large corpus consisting of about 11 million words to analyse distribution of the letters and their sequences in German language.

A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Other early linguists to have used corpus for studying language include Franz Boas (Handbook of Native American Indian Languages, 1911), Zellig Harris (Methods in Structural Linguistics, 1951), Charles C. Fries (The structure of English, 1952),

A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Leonard Bloomfield (Language, 1933), Archibald A. Hill and others, mostly American structural and field linguists (see TERMS AND CONCEPTS). Some of them also started to use corpus in pedagogical study of foreign languages. Thus, the corpus methodology dates back to the pre-Chomskyan period;

A1.2 CORPUS LINGUISTICS: PAST AND PRESENT In the late 1950s the corpus methodology was severely criticized – it became marginalized. Chomsky rejected the use of a corpus as a tool for linguistic studies, arguing that linguists must model language on competence (kompetencija) instead of performance (atliktis).

A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Even “shoeboxes” used instead of computers – their methodology was essentially “corpus-based” (empirical - based on observable data).

A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Using paper slips and human hands and eyes, it was impossible to analyse large bodies of language data. Consequently the corpora of the time could rarely avoid being ‘skewed’.

A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Corpus linguistics was not abandoned completely; however, it was not until the 1980s when linguists began to show an increased interest in the use of corpus for research.

A1.2 CORPUS LINGUISTICS: PAST AND PRESENT But with the development of powerful computers (esp their processing power and massive storage at relatively low cost), the exploitation of massive corpora became possible. The marriage of corpora with computer technology rekindled interest in the corpus methodology.

A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Nowadays, the corpus methodology enjoys widespread popularity. It has opened up or foregrounded many new areas of research. Corpora have revolutionized nearly all branches of linguistics.

A1.3 WHAT IS A CORPUS? Sinclair (1996): “A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.” Thus a corpus is different from a random collection of texts;

A1.3 WHAT IS A CORPUS? Definition of a corpus: “A collection of sampled texts, written or spoken, in machine-readable form which may be annotated with various forms of linguistic information” (McEnery et al. 2006:4)

A1.3 WHAT IS A CORPUS? There are many ways to define a corpus, but there is an increasing consensus that a corpus is a collection of machine-readable authentic texts (including transcripts of spoken data) which is

A1.3 WHAT IS A CORPUS? (3) sampled to be (4) representative of a particular language or language variety. The problem: but what can be counted as representative? (Unit A2)

A1.4 WHY USE COMPUTERS TO STUDY LANGUAGE? Advantages of electronic corpora: The speed of processing data The accuracy of processing data The ease of manipulating data (e.g. searching, selecting, sorting) Minimal costs Computers avoid human bias in an analysis thus the result is more reliable

A1.5 THE CORPUS-BASED APPROACH VS. THE INTUITION-BASED APPROACH By using the intuition-based approach, researchers can invent purer examples for analysis. However, it should be applied with caution. WHY?

A1.5 THE CORPUS-BASED APPROACH VS. THE INTUITION-BASED APPROACH It is possible to be influenced by one’s dialect or sociolect: what appears to be acceptable for one speaker may be not so for another; When one invents an example to support or disprove the argument, the utterance may not represent typical language use.

A1.5 THE CORPUS-BASED APPROACH VS. THE INTUITION-BASED APPROACH The corpus-based approach, in contrast, draws upon authentic or real texts. Results based on introspection (i.e. intuition) are difficult to verify – introspection is not observable. But a corpus can yield reliable quantitative data.

A1.6 CORPUS LINGUISTICS: A METHODOLOGY OR A THEORY? Corpus linguistics is indeed a methodology rather than an independent branch of linguistics (but see Tognini-Bonelli 2001:1); CL is not restricted to a particular aspect of language; CL is a whole system of methods and principles of how to apply corpora in language studies and teaching/learning;

A1.7 CORPUS-BASED VS. CORPUS-DRIVEN APPROACHES Corpus linguistics is a research approach that has developed over the past few decades to support empirical investigations of language variation and use. Corpus studies have used two major research approaches: ‘corpus-based’ (tekstynais paremtas tyrimas) and ‘corpus-driven’ (tekstyno inspiruotas tyrimas).

A1.7 A CORPUS-BASED APPROACH Corpus-based research assumes the validity (pagrįstumas) of linguistic forms and structures derived from linguistic theory. The primary goal of research is to analyse the systematic patterns of variation and use for those pre-defined linguistic features. Corpora are used mainly to test or exemplify theories formulated before large corpora became available.

A1.7 A CORPUS-DRIVEN APPROACH Corpus-driven research is more inductive, so that the linguistic constructs themselves emerge from analysis of a corpus. The theoretical statements reflect directly the evidence provided by the corpus. However, the distinction between ‘corpus-based’ and ‘corpus-driven’ is overstated. In this book the term ‘corpus-based’ is used in a broad sense, encompassing both approaches.

THE SCOPE OF CORPUS LINGUISTICS Kennedy, G. 1998. An introduction to Corpus linguistics. Longman.

Corpus linguistics Corpus linguistics is based on bodies of text as the source of evidence for linguistic description and argumentation. It is a methodology for linguistic description. The focus of study is on performance rather than competence, and on observation of language in use leading to theory rather than vice versa (cf. a corpus-driven approach). It is NOT a separate branch of linguistics.

Corpus-based research In the case of corpus-based research, the evidence of what is possible in a language is derived directly from texts. Work in corpus linguistics is currently associated with several quite different activities.

Activities in Corpus linguistics The first group of researchers consists of corpus makers or compilers. These scholars are concerned with the design and compilation of corpora, the collection of texts and their preparation and storage for later analysis.

Activities in Corpus linguistics A second group of researchers is concerned with the developing tools for the analysis of corpora. A third group of researchers consists of descriptive linguists whose main concern has been to describe reliably the lexicon and grammar of languages.

Concerns of descriptive linguistics Corpus-based descriptive linguistics is concerned with how often particular forms are used. This model allows us to study variation in text types, language change and regional and other varieties of language.

Concerns of descriptive linguistics The corpus provides contexts for the study of meaning in use. The corpus makes it possible to extract linguistic information from texts on a scale previously undreamed of.

Activities in Corpus linguistics A fourth area of activity is concerned with using corpus material for language learning and teaching, and natural language processing by machine, including speech recognition and translation.

Activities in Corpus linguistics Corpus linguistics is also concerned with the statistical distribution of linguistic items in the context of use, e.g. word count to discover the most frequent words and grammatical structures for language teaching purposes.

The current concerns of corpus linguistics The current concerns of corpus linguistics include: improved ways of annotating (i.e. adding short notes to explain something) corpora, the tagging (i.e.attaching a word-class label) of parts of speech and the senses of polysemous word forms, improved automated parsing (i.e. syntactic analysis),

Analyses of the corpus can contribute to: the making of dictionaries, word lists, descriptive grammars (cf. LGSWE), diachronic and synchronic comparative studies of speech varieties, and to stylistic, pedagogical and other applications.

TERMS AND CONCEPTS Annotation – the process of encoding interpretive linguistic information in a corpus annotating (i.e. adding short notes to explain something), tagging (an alternative term for annotation, i.e. attaching a word-class label) parsing (also treebanking or bracketing, i.e. syntactic analysis of the sentence into their constituents) POS: part of speech

TERMS AND CONCEPTS (Chomsky) competence (kompetencija; internalised knowledge of a language) vs. performance (atliktis; external evidence of language competence, its usage on particular occasions) Empirical data - The empirical method is generally taken to mean the approach of using a collection of data to base a theory or derive a conclusion in science.

TERMS AND CONCEPTS Corpus-based (tekstynais paremtas tyrimas) research assumes the validity of linguistic forms and structures derived from linguistic theory. Corpus-driven (tekstyno inspiruotas tyrimas) research is more inductive, so that the linguistic constructs themselves emerge from analysis of a corpus. The theoretical statements reflect directly the evidence provided by the corpus.

TERMS AND CONCEPTS Field linguists (e.g. Boas 1940) - field linguistics is concerned with the description and analysis of previously undescribed languages. Because many undescribed languages are spoken by small groups of people, many field linguists dedicate a great deal of their time to language documentation and language revitalization.

TERMS AND CONCEPTS Structuralism-any approach to linguistic description which views the grammar of a language primarily as a system of relations. Structuralism in this sense derives largely from the work of the Swiss linguist Ferdinand de Saussure (1857-1913). Virtually all 20th century approaches to linguistics are structuralist in this sense.