Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language and Speech Resources, Genova 27 May 2006
2 Why Monolinguality ? Alien language noise disturbs statistics for corpus-based methods: Language Models, e.g. n-gram Lexical Acquisition Semantic Indexing Co-occurrence Statistics
3 What is Monolinguality? Foreign language sentences should be removed Sentences containing few foreign language words or phrases, such as movie titles, terminology etc. should remain.
4 Korean Example A:Yes. The traffic cop said I had one too many and made me take the sobriety test, but I passed it. B:Lucky you ! 무인도 표류 소년 25 명 통해 인간의 야만 성 그려 영국 소설가 윌리엄 골딩의 83 년 노벨문학상 수상작 을 영화화한 ` 파리대 왕 '(Lord of the flies) 은 결코 편안하게 감 상할 수 있는 영화는 아니다.
5 Recall Zipf‘s Law It holds also for random samples of words Top frequent words
6 Measuring Monolinguality Given a corpus of language A with x% noise of language B, the amount of noise is measured: For top frequency words of B, divide the relative frequency in the corpus by the relative frequency of a clean B corpus The amount of noise is the predominant ratio: many ratios will be close to x%.
7 The top frequency words of B w.r.t. A Words that do not occur in language A. Their frequency ratio will be around x%. Words that are also amongst the highest frequency words of language A and moreover have the same function. Their frequency ratio will be around 1. Words that occur in language A, but at different frequency bands. They are a random sample of words of L and distributed in a Zipf way Words of B that are often used in named entities and titles (such as capitalized stop words). They appear in the corpus of language A more frequently then the expected x% of noise. The second group of words is only present in languages that are very similar to each other.
8 Lexical overlap in top 1000 words
9 Experiment 1 Artificial noise mixtures: Injecting alien language material in monolingual corpora Experiment 1a: Injecting different amounts of German Noise in a chunk of the British National Corpus (~ 20 Million words) Experiment 1b: Injecting 1% noise of Norwegian, Swedish and Dutch into a Danish corpus (~17 Million words) For measuring, we used the top 1000 words
10 German in BNC
11 Invading Denmark
12 Experiment 2 For a collection of web documents (~700 Million words from.de domains, we measure the effect of a corpus cleaning method that strips alien language material Before cleaningAfter cleaning Number of top words found Approx. Frequency ratio Number of top words found Frequency ratio German English French Dutch Turkish
13 Cleaning.de web
14 Conclusion Measure captures well the amount of noise Noise measured down to a ratio of Effective: involves 1000 frequency counts per language
15 Application: Monolingual Corpora Screenshot corpora
16 Workflow Text Language detection, Cleaning lang lang. 2lang. n POS Tagging Classified Objects Texts: Web / Newspapers Crawling Standard Size Corpora URLs Language Statistics Small Worlds Co-occurrences etc. Clustering Classification Neologisms Trend Mining Topic Tracking Language +Time Tools Dictionaries (Dornseiff, WordNets, Wikipedia,...) Web Statistics Small Worlds Small Worlds Words Dictionaries Resources Techniques Results Similar objects (words, sentences, documents, URLs) Classification (se- mantic properties, subject areas,...) Combined objects (NE-Recognition, terminology,...): determine patterns, extract multi-words Decomposition Morphology Inflection Translation pairs
17 Corpus Browser Per word: Frequency Example sentences Co-occurrences: left and right neighbours, sentence-based Co-occurrence graph
18 Only a few copies left! DVD: 15 languages Corpus Browser Corpora in plain text and database format
19 Questions?? THANK YOU!
20