Corpus design & analysis techniques 1
Monolingual: general, specialized, comparable Bi/Multilingual: parallel, comparable 2
Type of analysis IntralingualInterlingual (cross-linguistic) Number of languages Monolingual (1 language) Bilingual/multilingual (2+ languages) Corpus design (1) M ONOLINGUAL 1 corpus (2a) C OMPARABLE 2+ corpora (2b) C OMPARABLE 2+ corpora (3) P ARALLEL 2+ corpora Type typical linguistic corpus translation-driven corpus translation corpus Number of languages 1 language 2 or more languages Corpus content non-translated language A translated versus non- translated language A non-translated language A and B non-translated language A aligned with translation in B What may be examined legal language against other genres translated language against non-translated one differences and similarities between languages translation process 3
Monolingual corpus: it is the most typical corpus used by linguists. It contains non-translated texts created only in one language. It involves intralingual analysis, within a single language, for example for descriptive purposes, but also to compare legal language against everyday language or other genres if a reference corpus is used. This type of corpus is mainly used within forensic linguistics, but also in monolingual lexicography and in foreign language teaching to prepare study materials, as is the case with the Cambridge Corpus of Legal English, a 20-million-word collection of legal books and newspaper articles compiled by Cambridge University Press. 4
Comparable corpora: It is a set of at least two monolingual corpora which may involve one language (a) or at least two languages (b). Zanettin refers to them as “translation-driven corpora” since their design is motivated by translation research or training yet they do not contain source texts (STs) and corresponding target texts (TTs) (2000: 106). Monolingual comparable corpora: they contain a corpus of translations and a corpus of texts created spontaneously in the same language (non-translated language). The main object of analysis is how the translated language differs from the non-translated language (to be discussed later as the ‘textual fit’). An example of such corpora is the Translational English Corpus at the University of Manchester. This type of corpora is used in translation studies. Bilingual or multilingual comparable corpora: they do not contain translated language but spontaneously created texts in two different languages. It is a set of two monolingual corpora designed according to a similar criterion and is used for cross- linguistic analysis. In addition to translation studies, this type of corpora is typically associated with contrastive and comparable linguistics. An example of comparable corpora is the BOnonia Legal Corpus, BoLC, at the University of Bologna, with the Italian legal subcorpus of 33.5 m words and the English legal corpus of 21 m words. 5
Parallel corpus is a translation corpus in the strictest sense. It is bilingual or multilingual and may be bi-directional. It contains STs aligned with their translations. Alignment makes parallel corpora more time-consuming to build and, as a result, they are rather seldom found. Examples include: the MultiJur Multilingual Corpus of Legal Texts at the University of Helsinki, legal sections of the CLUVI Parallel Corpus at the University of Vigo (Galician-Spanish, Basque-Spanish) and the GENTT Corpus of Textual Genres for Translation at the Jaume I University. This type of corpus is mainly used for research into the translation process and in applied translation studies: to prepare dictionaries, extract terms for terminological databases, train information extraction software, and train translators. 6
Narodowy Korpus Języka Polskiego Korpus Języka Polskiego PWN (pełny bezpłatny dostęp w BUG Oliwa) Korpus Języka Polskiego IPI PAN British National Corpus: (100 million word collection, spoken 10%, written 90%) Proceedings of the Old Bailey, London's Central Criminal Court Korpus równolegly JRC-Acquis Multilingual Parallel Corpus Acquis.html - korpus PL ok. 30 mln slowhttp://langtech.jrc.it/JRC- Acquis.html 7
CORPUS SOFTWARE Opis roznych programow Monolingual Comparison of KfNgram, N-Gram Phrase Extractor, Wordsmith: Wordsmith: KfNgram: Lexical Tutor/N-Gram: Corsis (open-access answer to Wordsmith) 8
Purpose Balance Representativeness Sampling criteria Language variety Time span Full text / extracts Sample size Target audience Overall size Translators represented (e.g. acc. to sociolinguistic variables: gender, mother tongue) 9
Wordlists Alphabetical lists Frequency-ranked lists Keywords Lists of clusters KWIC Concordance Collocates Statistics: average sentence/word length; type/token rate; lexical denisty 10
What is the purpose of preparing a wordlist? - Make a wordlist & analyse lexical v function words - Make a batch Statistics: - Average sentence/word length - Type/token ratio: If a text is 1,000 words long, it is said to have 1,000 tokens. But a lot of these words will be repeated, and there may be only say 400 different words in the text. ‘Types’ are different words. The ratio between types and tokens would be here 40%. 11
Clusters – words which are found repeatedly together in sequence; recurrent expressions regardless of their idiomacity, and regardless of their structural status N-grams, p-frames (phrase-frames), lexical bundles, multi-word-units, conversational routines, fixed expressions 4-grams: I don’t think so, I don’t think I, but I don’t think 12
Keyword – word that is found to be outstanding in its frequency in a text with reference to its frequency in another, generally larger, text/corpus of texts Key words are lexemes which have become cognitively salient through their repetitive, unusually frequent use. They characterise a given text in that they “are used over and over in the text and are crucial to the theme or topic under discussion. (...) Key words are most often words which represent an essential or basic concept of the text” (Larson 1984: 177). 13
A list of all the occurrences of a specified word or expression in a corpus, set in the middle of one line of context each. KWIC concordances help identify collocates Collocates – words which occur in the neighbourhood (co-text) of the search word 14
Semantic prosody refers to the positive or negative connotative meaning which is transferred to the focus word by the semantic fields of its common collocates (Louw 1993). Stubbs (1995, 1996:173–4) examines collocates of causal verbs and finds in his corpus that the vast majority of collocates of cause are negative, e.g. accident, cancer, commotion, crisis and delay. On the other hand, the verb provide has a positive semantic prosody with collocates care, food, help, jobs, relief and support. 15