Corpus design & analysis techniques 1.  Monolingual: general, specialized, comparable  Bi/Multilingual: parallel, comparable 2.

Slides:



Advertisements
Similar presentations
GSSR Research Methodology and Methods of Social Inquiry January 10, 2012 Research Using Available Data.
Advertisements

Corpus Linguistics Richard Xiao
An investigation into Corpus-based learning about language inin the primary-school: CLLIP Corpus evidence of the features of childrens literature.
Contents The Gentt Group The concept of text genre as the core of the project Research objectives Methodology Phases of the Gentt Project Main results.
Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
The Behaviour of Key Words (KWs) Mike Scott University of Liverpool.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Compiling a corpus II. Corpus A finite size, non random collection of naturally occurring language, in a computer readable form. Non-random = representative.
Corpus Linguistics. What is corpus linguistics? Method / Theory in Linguistics Analysis of collections of texts (corpora) Verifying/ Strengthening or.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
TaLC Graz The English Italian Translational Corpus: A resource for learning about translation Federico Zanettin Università di Bologna SSLMIT – Forlì.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
Journal Article Presentation Group 1: Anik Damaris Maria Rofik.
QUALETRA FINAL CONFERENCE Sandrine PERALDI JUST/2011/JPEN/AG/2975 QUALETRA JUST/2011/JPEN/AG/2975 With financial support from the Criminal Justice Programme.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES lexico-grammatical profiles Bambang Kaswanti Purwo
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Homing in on the Text- Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Elaine Ménard & Margaret Smithglass School of Information Studies McGill University [Canada] July 5 th, 2011 Babel revisited: A taxonomy for ordinary images.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
How Can Corpora Help Me To Be Successful in CO150?
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
中国学习者英语笔语中的 词块能力研究 许家金 中国外语教育研究中心 北京外国语大学. Lexical Chunks in Chinese Learners ’ Writing (WECCL) Xu Jiajin Beijing Foreign Studies University.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Corpus Linguistics Anca Dinu February, 2017.
Measuring Monolinguality
Introduction to Corpus Linguistics
Statistical NLP: Lecture 7
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Searching corpora.

Using Corpora in Linguistics
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Introduction to Corpus Linguistics: Dispersion/concordance plots
Corpus Linguistics I ENG 617
A Search for Discipline-Specific Vocabulary
(word formation: follow up)
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

Corpus design & analysis techniques 1

 Monolingual: general, specialized, comparable  Bi/Multilingual: parallel, comparable 2

Type of analysis IntralingualInterlingual (cross-linguistic) Number of languages Monolingual (1 language) Bilingual/multilingual (2+ languages) Corpus design (1) M ONOLINGUAL 1 corpus (2a) C OMPARABLE 2+ corpora (2b) C OMPARABLE 2+ corpora (3) P ARALLEL 2+ corpora Type typical linguistic corpus translation-driven corpus translation corpus Number of languages 1 language 2 or more languages Corpus content non-translated language A translated versus non- translated language A non-translated language A and B non-translated language A aligned with translation in B What may be examined legal language against other genres translated language against non-translated one differences and similarities between languages translation process 3

 Monolingual corpus: it is the most typical corpus used by linguists. It contains non-translated texts created only in one language. It involves intralingual analysis, within a single language, for example for descriptive purposes, but also to compare legal language against everyday language or other genres if a reference corpus is used. This type of corpus is mainly used within forensic linguistics, but also in monolingual lexicography and in foreign language teaching to prepare study materials, as is the case with the Cambridge Corpus of Legal English, a 20-million-word collection of legal books and newspaper articles compiled by Cambridge University Press. 4

 Comparable corpora: It is a set of at least two monolingual corpora which may involve one language (a) or at least two languages (b). Zanettin refers to them as “translation-driven corpora” since their design is motivated by translation research or training yet they do not contain source texts (STs) and corresponding target texts (TTs) (2000: 106).  Monolingual comparable corpora: they contain a corpus of translations and a corpus of texts created spontaneously in the same language (non-translated language). The main object of analysis is how the translated language differs from the non-translated language (to be discussed later as the ‘textual fit’). An example of such corpora is the Translational English Corpus at the University of Manchester. This type of corpora is used in translation studies.  Bilingual or multilingual comparable corpora: they do not contain translated language but spontaneously created texts in two different languages. It is a set of two monolingual corpora designed according to a similar criterion and is used for cross- linguistic analysis. In addition to translation studies, this type of corpora is typically associated with contrastive and comparable linguistics. An example of comparable corpora is the BOnonia Legal Corpus, BoLC, at the University of Bologna, with the Italian legal subcorpus of 33.5 m words and the English legal corpus of 21 m words. 5

 Parallel corpus is a translation corpus in the strictest sense. It is bilingual or multilingual and may be bi-directional. It contains STs aligned with their translations. Alignment makes parallel corpora more time-consuming to build and, as a result, they are rather seldom found. Examples include: the MultiJur Multilingual Corpus of Legal Texts at the University of Helsinki, legal sections of the CLUVI Parallel Corpus at the University of Vigo (Galician-Spanish, Basque-Spanish) and the GENTT Corpus of Textual Genres for Translation at the Jaume I University. This type of corpus is mainly used for research into the translation process and in applied translation studies: to prepare dictionaries, extract terms for terminological databases, train information extraction software, and train translators. 6

 Narodowy Korpus Języka Polskiego  Korpus Języka Polskiego PWN (pełny bezpłatny dostęp w BUG Oliwa)  Korpus Języka Polskiego IPI PAN  British National Corpus: (100 million word collection, spoken 10%, written 90%)  Proceedings of the Old Bailey, London's Central Criminal Court  Korpus równolegly  JRC-Acquis Multilingual Parallel Corpus Acquis.html - korpus PL ok. 30 mln slowhttp://langtech.jrc.it/JRC- Acquis.html 7

 CORPUS SOFTWARE  Opis roznych programow  Monolingual  Comparison of KfNgram, N-Gram Phrase Extractor, Wordsmith:  Wordsmith:  KfNgram:  Lexical Tutor/N-Gram:  Corsis (open-access answer to Wordsmith) 8

 Purpose  Balance  Representativeness Sampling criteria  Language variety  Time span  Full text / extracts  Sample size  Target audience  Overall size  Translators represented (e.g. acc. to sociolinguistic variables: gender, mother tongue) 9

 Wordlists  Alphabetical lists  Frequency-ranked lists  Keywords  Lists of clusters  KWIC Concordance  Collocates  Statistics: average sentence/word length; type/token rate; lexical denisty 10

 What is the purpose of preparing a wordlist? - Make a wordlist & analyse lexical v function words - Make a batch Statistics: - Average sentence/word length - Type/token ratio: If a text is 1,000 words long, it is said to have 1,000 tokens. But a lot of these words will be repeated, and there may be only say 400 different words in the text. ‘Types’ are different words. The ratio between types and tokens would be here 40%. 11

 Clusters – words which are found repeatedly together in sequence; recurrent expressions regardless of their idiomacity, and regardless of their structural status  N-grams, p-frames (phrase-frames), lexical bundles, multi-word-units, conversational routines, fixed expressions  4-grams: I don’t think so, I don’t think I, but I don’t think 12

 Keyword – word that is found to be outstanding in its frequency in a text with reference to its frequency in another, generally larger, text/corpus of texts  Key words are lexemes which have become cognitively salient through their repetitive, unusually frequent use. They characterise a given text in that they “are used over and over in the text and are crucial to the theme or topic under discussion. (...) Key words are most often words which represent an essential or basic concept of the text” (Larson 1984: 177). 13

 A list of all the occurrences of a specified word or expression in a corpus, set in the middle of one line of context each. KWIC concordances help identify collocates  Collocates – words which occur in the neighbourhood (co-text) of the search word 14

 Semantic prosody refers to the positive or negative connotative meaning which is transferred to the focus word by the semantic fields of its common collocates (Louw 1993). Stubbs (1995, 1996:173–4) examines collocates of causal verbs and finds in his corpus that the vast majority of collocates of cause are negative, e.g. accident, cancer, commotion, crisis and delay. On the other hand, the verb provide has a positive semantic prosody with collocates care, food, help, jobs, relief and support. 15