Download presentation
Presentation is loading. Please wait.
1
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language and Speech Resources, Genova 27 May 2006
2
2 Why Monolinguality ? Alien language noise disturbs statistics for corpus-based methods: Language Models, e.g. n-gram Lexical Acquisition Semantic Indexing Co-occurrence Statistics
3
3 What is Monolinguality? Foreign language sentences should be removed Sentences containing few foreign language words or phrases, such as movie titles, terminology etc. should remain.
4
4 Korean Example A:Yes. The traffic cop said I had one too many and made me take the sobriety test, but I passed it. B:Lucky you ! 무인도 표류 소년 25 명 통해 인간의 야만 성 그려 영국 소설가 윌리엄 골딩의 83 년 노벨문학상 수상작 을 영화화한 ` 파리대 왕 '(Lord of the flies) 은 결코 편안하게 감 상할 수 있는 영화는 아니다.
5
5 Recall Zipf‘s Law It holds also for random samples of words Top frequent words
6
6 Measuring Monolinguality Given a corpus of language A with x% noise of language B, the amount of noise is measured: For top frequency words of B, divide the relative frequency in the corpus by the relative frequency of a clean B corpus The amount of noise is the predominant ratio: many ratios will be close to x%.
7
7 The top frequency words of B w.r.t. A Words that do not occur in language A. Their frequency ratio will be around x%. Words that are also amongst the highest frequency words of language A and moreover have the same function. Their frequency ratio will be around 1. Words that occur in language A, but at different frequency bands. They are a random sample of words of L and distributed in a Zipf way Words of B that are often used in named entities and titles (such as capitalized stop words). They appear in the corpus of language A more frequently then the expected x% of noise. The second group of words is only present in languages that are very similar to each other.
8
8 Lexical overlap in top 1000 words
9
9 Experiment 1 Artificial noise mixtures: Injecting alien language material in monolingual corpora Experiment 1a: Injecting different amounts of German Noise in a chunk of the British National Corpus (~ 20 Million words) Experiment 1b: Injecting 1% noise of Norwegian, Swedish and Dutch into a Danish corpus (~17 Million words) For measuring, we used the top 1000 words
10
10 German in BNC
11
11 Invading Denmark
12
12 Experiment 2 For a collection of web documents (~700 Million words from.de domains, we measure the effect of a corpus cleaning method that strips alien language material Before cleaningAfter cleaning Number of top-1000- words found Approx. Frequency ratio Number of top-1000- words found Frequency ratio German10000.70810000.946 English9950.1269870.0010 French9240.03989060.00002 Dutch9950.0008917750.000006 Turkish6420.00006315620.000006
13
13 Cleaning.de web
14
14 Conclusion Measure captures well the amount of noise Noise measured down to a ratio of 10 -5 Effective: involves 1000 frequency counts per language
15
15 Application: Monolingual Corpora Screenshot corpora http://corpora.uni-leipzig.de
16
16 Workflow Text Language detection, Cleaning lang. 1... lang. 2lang. n POS Tagging Classified Objects Texts: Web / Newspapers Crawling Standard Size Corpora URLs Language Statistics Small Worlds Co-occurrences etc. Clustering Classification Neologisms Trend Mining Topic Tracking Language +Time Tools Dictionaries (Dornseiff, WordNets, Wikipedia,...) Web Statistics Small Worlds Small Worlds Words Dictionaries Resources Techniques Results Similar objects (words, sentences, documents, URLs) Classification (se- mantic properties, subject areas,...) Combined objects (NE-Recognition, terminology,...): determine patterns, extract multi-words Decomposition Morphology Inflection Translation pairs
17
17 Corpus Browser Per word: Frequency Example sentences Co-occurrences: left and right neighbours, sentence-based Co-occurrence graph
18
18 Only a few copies left! DVD: 15 languages Corpus Browser Corpora in plain text and database format
19
19 Questions?? THANK YOU!
20
20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.