Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.

Slides:



Advertisements
Similar presentations
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Advertisements

The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Measuring Distance between Language Varieties Adam Kilgarriff, Jan Pomikalek, Pavel Rychly, Vit Suchomel Supported by EU Project PRESEMT.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
Augmenting online dictionary entries with corpus data for Search Engine Optimisation Holger Hvelplund, 1 Adam Kilgarriff, 2 Vincent Lannoy, 1 Patrick White.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Corpus Linguistics. What is corpus linguistics? Method / Theory in Linguistics Analysis of collections of texts (corpora) Verifying/ Strengthening or.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
So much of everything Adam Kilgarriff Lexical Computing Ltd.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Introduction to Florian Jaeger, For the Methods class, December 3 rd, 2003.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Solve by using the ELIMINATION method The goal is to eliminate one of the variables by performing multiplication on the equations. Multiplication is not.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
How Can Corpora Help Me To Be Successful in CO150?
The Difference Between Linguistics and “Lingvistika”
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Dividing Polynomials. First divide 3 into 6 or x into x 2 Now divide 3 into 5 or x into 11x Long Division If the divisor has more than one term, perform.
Systems of Linear Equations A system of linear equations consists of two or more linear equations. We will focus on only two equations at a time. The solution.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
 Two types of analysis based on the data that is generated.  Quantitative or  Qualitative  Basic descriptive statistics such as mean, median, and.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
GDEX: Automatically finding good dictionary examples in a corpus.
Evaluating word sketches and corpora
A Level English Language
Algorithm Discovery and Design
(word formation: follow up)
Corpora, Language Technology and Maltese
Presentation transcript:

Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT

How to compare language varieties  Qualitative  Quantitative  Quantitative means corpus  Corpus represents variety  Compare corpora LSB, Brussels, Jan 2013Kilgarriff: Variation 2

My big question  How to compare corpora  How else can corpus methods/corpus linguistics be scientific  Roles  How do varieties contrast  How do corpora contrast  When we don’t know if they are different  Find bugs in corpus construction LSB, Brussels, Jan 2013Kilgarriff: Variation 3

Corpus comparison  Qualitative  Quantitative LSB, Brussels, Jan 2013Kilgarriff: Variation 4

Qualitative  Take keyword lists  [a-z]{3,}  Lemma if lemmatisation identical, else word  C1 vs C2, top 100/200  C2 vs C1, top 100/200  study LSB, Brussels, Jan 2013Kilgarriff: Variation 5

Qualitative: example, OCC and OEC  OEC: general reference corpus  OCC: writing for children Look at fiction only Top 200 keywords (each way) what are they? LSB, Brussels, Jan 2013Kilgarriff: Variation 6

LSB, Brussels, Jan 2013Kilgarriff: Variation 7

 Paper  Getting to know your corpus  Technology  The Sketch Engine  LSB, Brussels, Jan 2013Kilgarriff: Variation 8

The Sketch Engine  Not 25, but 10’s not bad  Word sketches  One-page, corpus-driven summaries of a word’s grammatical and collocational behaviour  First used: Macmillan English Dictionary  Dictionary-making  UK: almost a monopoly  OUP, CUP, Macmillan, Collins  National/supranational institutes including INL  Linguistic research  Teaching: languages, translation LSB, Brussels, Jan 2013Kilgarriff: Variation 9

Do it  Sketch Engine does the grunt work  It’s ever so interesting LSB, Brussels, Jan 2013Kilgarriff: Variation 10

Quantitative  Methods, evaluation  Kilgarriff 2001, Comparing Corpora, Int J Corp Ling  Then:  not many corpora to compare  Now:  Many  Ad hoc, from web  First question: is it any good, how does it compare  Let’s make it easy: offer it in Sketch Engine LSB, Brussels, Jan 2013Kilgarriff: Variation 11

Original method  C1 and C2:  Same size, by design  Put together, find 500 highest freq words  For each of these words  Freqs: f1 in C1, f2 in C2, mean=(f1+f2)/2  (f1-f2) 2 /mean (chi-square statistic)  Sum  Divide by 500: CBDF LSB, Brussels, Jan 2013Kilgarriff: Variation 12

Evaluated  Known-similarity corpora  Shows it worked  Used to set parameter (500)  CBDF better than alternative measures tested LSB, Brussels, Jan 2013Kilgarriff: Variation 13

Adjustments for SkE  Problem: non-identical tokenisation  Some awkward words: can’t  undermine stats as one corpus has zero  Solution  commonest 5000 words in each corpus  intersection only  commonest 500 in intersection LSB, Brussels, Jan 2013Kilgarriff: Variation 14

Adjustments for SkE  Corpus size highly variable  Chi-square not so dependable  Also not consistent with our keyword lists  Link to keyword lists – link quant to qual  Keyword lists  nf = normalised (per million) frequencies  Keyword lists: nf1+k/nf2+k  Default value for k=100  We use: if nf1>nf2, nf1+k/nf2+k, else nf2+k/nf1+k  Evaluated on Known-Sim Corpora  as good as/better than chi-square LSB, Brussels, Jan 2013Kilgarriff: Variation 15

LSB, Brussels, Jan 2013Kilgarriff: Variation 16

What’s missing  Heterogeneity  “how similar is BNC to WSJ”?  We need to know heterogeneity before we can interpret  The leading diagonal  2001 paper: randomising halves  Inelegant and inefficient  Depended on standard size of document LSB, Brussels, Jan 2013Kilgarriff: Variation 17

New definition, method  Heterogeneity (def)  Distance between most different partitions  Cluster to find ‘most different partitions’  Bottom-up clustering  until largest cluster has over one third of data  Rest: the other partition  Problem  nxn distance matrix where n > 1 million  Solution: do it in steps LSB, Brussels, Jan 2013Kilgarriff: Variation 18

Summary  Corpus comparison  Qualitative: use keywords  Quantitative  On beta  Heterogeneity (to complete the task) to follow (soon) LSB, Brussels, Jan 2013Kilgarriff: Variation 19

Going multilingual  Bilingual word sketches  How do we connect between languages?  Three flavours  Parallel corpus  Dictionary  With comparable corpus  Manual LSB, Brussels, Jan 2013Kilgarriff: Variation 20

In Sketch Engine  Parallel  language pairs  Credit: Joerg Tiedermann, OPUS project LSB, Brussels, Jan 2013Kilgarriff: Variation 21

LSB, Brussels, Jan 2013Kilgarriff: Variation 22

Going live  For parallel corpora   For manual clustering  LSB, Brussels, Jan 2013Kilgarriff: Variation 23

Simple maths for keywords This word is twice as common in this text type as that NfreqFreq per m Focus Corp2m8040 Ref corp15m30020 ratio2 LSB, Brussels, Jan 2013Kilgarriff: Variation 24

 Intuitive  Nearly right but:  How well matched are corpora  Not here  Burstiness  Not here  Can’t divide by zero  Commoner vs. rarer words LSB, Brussels, Jan 2013Kilgarriff: Variation 25

You can’t divide by zero  Standard solution: add one  Problem solved fc rcratio buggle100? stort1000? nammikin10000? fc rcratio buggle111 stort1011 nammikin10011 LSB, Brussels, Jan 2013Kilgarriff: Variation 26

High ratios more common for rarer words fc rc ratiointeresting? spug101 no grod yes LSB, Brussels, Jan 2013Kilgarriff: Variation 27 some researchers: grammar, grammar words some researchers: lexis content words No right answer Slider?

Solution: don’t just add 1, add n  n=1  n=100 word fc rc fc+n rc+nRatioRank obscurish middling common word fc rc fc+n rc+nRatioRank obscurish middling common LSB, Brussels, Jan 2013Kilgarriff: Variation 28

Solution  n=1000  Summary word fc rc fc+n rc+nRatioRank obscurish middling common word fc rc n=1 n=100n=1000 obscurish1001st2nd3rd middling nd1st2nd common rd 1st LSB, Brussels, Jan 2013Kilgarriff: Variation 29