Representatıvness, balance and samplıng ın a corpus Lınguistıcs.

Slides:



Advertisements
Similar presentations
Innovation data collection: Advice from the Oslo Manual South East Asian Regional Workshop on Science, Technology and Innovation Statistics.
Advertisements

Innovation Surveys: Advice from the Oslo Manual South Asian Regional Workshop on Science, Technology and Innovation Statistics Kathmandu,
Innovation Surveys: Advice from the Oslo Manual National training workshop Amman, Jordan October 2010.
Diachronic study and language change Corpus Linguistics Richard Xiao
Uses of a Corpus “[E]xplore actual patterns of language use”
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2
Sampling & External Validity
QBM117 Business Statistics Statistical Inference Sampling 1.
MISUNDERSTOOD AND MISUSED
Who and How And How to Mess It up
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Sampling.
Biostatistics Frank H. Osborne, Ph. D. Professor.
1 1 Slide © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Formalizing the Concepts: Simple Random Sampling.
Sampling ADV 3500 Fall 2007 Chunsik Lee. A sample is some part of a larger body specifically selected to represent the whole. Sampling is the process.
Course Content Introduction to the Research Process
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Sampling Moazzam Ali.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Sampling Methods Assist. Prof. E. Çiğdem Kaspar,Ph.D.
The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.
COLLECTING QUANTITATIVE DATA: Sampling and Data collection
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Chapter 7 Sampling and Sampling Distributions Sampling Distribution of Sampling Distribution of Introduction to Sampling Distributions Introduction to.
Under the Guidance of Dr. ADITHYA KUMARI H. Associate Professor DOS in Library and Information Science University of Mysore Mysore By Poornima Research.
1 Hair, Babin, Money & Samouel, Essentials of Business Research, Wiley, Learning Objectives: 1.Understand the key principles in sampling. 2.Appreciate.
Population and sample. Population: are complete sets of people or objects or events that posses some common characteristic of interest to the researcher.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
United Nations Regional Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Bangkok,
1 Chapter Two: Sampling Methods §know the reasons of sampling §use the table of random numbers §perform Simple Random, Systematic, Stratified, Cluster,
LIN Corpus Linguistics LIN3098 – Corpus Linguistics Lecture 2 Albert Gatt.
Sampling Techniques 19 th and 20 th. Learning Outcomes Students should be able to design the source, the type and the technique of collecting data.
Mabel Ortiz N.. Discourse analysis 1. What is discourse? It is written or spoken _______. A. Words B. Sentences C. Paragraphs D. Communication What is.
Market research for a start-up. LEARNING OUTCOMES By the end of this lesson I will be able to: –Define and explain market research –Distinguish between.
Corpus approaches to discourse
RESEARCH DESIGN & CORPUS COMPILATION. Corpus design is intrinsic and a fundamental part of the analysis. It is guided by the RQ and affects the results.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Bangor Transfer Abroad Programme Marketing Research SAMPLING (Zikmund, Chapter 12)
Overview of Corpus Linguistics
1 Introduction to Statistics. 2 What is Statistics? The gathering, organization, analysis, and presentation of numerical information.
 When every unit of the population is examined. This is known as Census method.  On the other hand when a small group selected as representatives of.
Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
1. 2 DRAWING SIMPLE RANDOM SAMPLING 1.Use random # table 2.Assign each element a # 3.Use random # table to select elements in a sample.
TEXT TYPES. 1) The holistic perspective implies the overall meaning of a text as a complex system This examines macro-level or textual features first.
RESEARCH METHODS Lecture 28. TYPES OF PROBABILITY SAMPLING Requires more work than nonrandom sampling. Researcher must identify sampling elements. Necessary.
Sampling Chapter 5. Introduction Sampling The process of drawing a number of individual cases from a larger population A way to learn about a larger population.
COGS Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I.
Sampling Dr Hidayathulla Shaikh. Contents At the end of lecture student should know  Why sampling is done  Terminologies involved  Different Sampling.
Formulation of the Research Methods A. Selecting the Appropriate Design B. Selecting the Subjects C. Selecting Measurement Methods & Techniques D. Selecting.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Institute of Professional Studies School of Research and Graduate Studies Selecting Samples and Negotiating Access Lecture Eight.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
General Notes on Stylistics
Sampling.
I. Introduction to statistics
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Population, Samples, and Sampling Descriptions
SAMPLING (Zikmund, Chapter 12.
SAMPLING.
2. Stratified Random Sampling.
SAMPLING (Zikmund, Chapter 12).
Applied Linguistics Chapter Four: Corpus Linguistics
SAMPLING J.RAJEES Assistant Professor Department Of Commerce Computer Application St.Joseph’s College (Autonomous) Tiruchirappalli.
Presentation transcript:

Representatıvness, balance and samplıng ın a corpus Lınguistıcs

Introduction A corpus is designed to respresent a particular language or language variety. It is impossible to analyse every extant utterance or sentence of a given language. Sampling is unavodiable. How can you be sure that the sample you are studying is representative of the language or language variety under consideration?

One must consider Balance Sampling to ensure representativness

Representativeness in Corpus Linguistics Biber (1993:242) : “Representativness refers to the exent to which a sample includes the full range of variability in a population.” Population: a sample of a language or a language variety in a corpus.

2 factors to determine the representativeness of a corpus The range of genres included in a corpus (i.e. balance). How text chunks for each genre are selected (i.e sampling).

Criteria used to select texts for a corpus External : situational; genre / register Internal: linguistic / text types

Internal criteria is problematic and circular Internal criteria like the distribution of words or grammatical features cannot be the primary parameters for the selection of corpus data. The corpus has been skewed by design.

Sinclair (1995): The texts or parts of texts to be included in a corpus should be selected according to external criteria so that their linguistic characteristics are independent of the selection process.

Another aspect of representativeness: Change over time View of a corpus as a Static (It applies to a sample corpus) Dynamic (It applies to a monitor corpus) language model

Static view of language Helsinki Diachronic Corpus Lancester — Oslo / Bergen Corpus (LOB) Freiburg — LOB (FLOB) Sampling frame LOB : British English, early 1960s FLOB : British English, early1990s

Representativeness of General and Specialized Corpora General Corpora (e.g., BNC) serve as a basis for an overall description of a language or language variety. It should involve samples from a broad range of genres. Specialized Corpora tend to be domain (e.g., medicine or law) and genre (fiction, newspaper texts, academic prose) specific.

Balance The range of text categories included in the corpus determines how balanced the corpus is. The acceptable balance of corpus is determined by its intended uses. A balanced corpus usually covers a wide range of text categories which are supposed to be representative of a language or language variety under consideration. These text categories are typically sampled proportionally for inclusion in a corpus.

Balance There is no scientific measure of corpus balance. A more typical approach to corpus balance is that corpus-builders adopt an existing corpus model when building their own corpus, assuming that balanced will be achieved from the adopted model. BNC is generally accepted as being a balanced corpus. American National Corpus Korean National Corpus Polish National Corpus Russian Reference Corpus

Balance Balance is a more importan issue for a static sample corpus than a dynamic monitor corpus. The builders of monitor corpora finds size of corpora more important than the balance of it. They assume that corpus will balance itself when it reaches a substantial size.

Domain % Date % Medium % Imaginative Book Arts Periodical31.08 Belief and thought Unclassified Misc. published Commerce /finance Misc. published Leisure To-be spoken Natural / pure science Unclassifed Applied science Social science World affairs Unclassified1. 93 Composition of written BNC

Region % Interaction type % Context - goverened % South Monologue Educational /informative Midlands Dialogue Bussiness North Unclassified Institutional Unclassifed Leisure Unclassifed Composition of spoken BNC

Sampling To obtain a representative sample from a population: define the sampling unit and boundaries of population Written texts: sampling units are books, periodicals, or newspaper. List of sampling units is referred to as sampling frame.

LOB corpus Target population: All written English texts published in the United Kingdom in 1961 Sampling frame: British National Bibliograhpy Cumulated Subjet Index , for books Willing’s Press Guide 1961, for periodicals

Sample size Full texts (i.e., whole texts) Text chunks (ideal: sample text fragments, 2,000 running words)

A population can be defined in terms of Language production (demographically oriented) Language reception (demographically oriented) Language as a product (text category / genre of language)

Different sampling techniques Simple random sampling: all sampling units within the sampling frame are numbered and the sample is chosen by the use of table of random numbers Stratified random sampling, first divides the whole population into relatively homogeneous groups (strata) and samples each stratum at random. Demographic sampling (i.e., categorize sampling units on the basis of speaker/writer age,sex and social class) is also a type of stratified sampling.

Stratified Ramdom Sampling: Brown and LOB Corpora The target population for each corpus was first grouped into 15 text categories such as news reportage, academic prose, different types of fiction. Samples then drawn from each text category.

Conclusion In constructing a balanced, representative corpus, Stratified random sampling is to be preferred. For written texts, a text typology established on the basis of external critearia is relevant. For spoken data, demographic sampling is appropriate. It must be complemented by context- goverend sampling so that some context goverend linguistic variations can be included in the resulting corpus.