Presentation is loading. Please wait.

Presentation is loading. Please wait.

Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.

Similar presentations


Presentation on theme: "Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language."— Presentation transcript:

1 Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language and Speech Resources, Genova 27 May 2006

2 2 Why Monolinguality ? Alien language noise disturbs statistics for corpus-based methods: Language Models, e.g. n-gram Lexical Acquisition Semantic Indexing Co-occurrence Statistics

3 3 What is Monolinguality? Foreign language sentences should be removed Sentences containing few foreign language words or phrases, such as movie titles, terminology etc. should remain.

4 4 Korean Example A:Yes. The traffic cop said I had one too many and made me take the sobriety test, but I passed it. B:Lucky you ! 무인도 표류 소년 25 명 통해 인간의 야만 성 그려 영국 소설가 윌리엄 골딩의 83 년 노벨문학상 수상작 을 영화화한 ` 파리대 왕 '(Lord of the flies) 은 결코 편안하게 감 상할 수 있는 영화는 아니다.

5 5 Recall Zipf‘s Law It holds also for random samples of words Top frequent words

6 6 Measuring Monolinguality Given a corpus of language A with x% noise of language B, the amount of noise is measured: For top frequency words of B, divide the relative frequency in the corpus by the relative frequency of a clean B corpus The amount of noise is the predominant ratio: many ratios will be close to x%.

7 7 The top frequency words of B w.r.t. A Words that do not occur in language A. Their frequency ratio will be around x%. Words that are also amongst the highest frequency words of language A and moreover have the same function. Their frequency ratio will be around 1. Words that occur in language A, but at different frequency bands. They are a random sample of words of L and distributed in a Zipf way Words of B that are often used in named entities and titles (such as capitalized stop words). They appear in the corpus of language A more frequently then the expected x% of noise. The second group of words is only present in languages that are very similar to each other.

8 8 Lexical overlap in top 1000 words

9 9 Experiment 1 Artificial noise mixtures: Injecting alien language material in monolingual corpora Experiment 1a: Injecting different amounts of German Noise in a chunk of the British National Corpus (~ 20 Million words) Experiment 1b: Injecting 1% noise of Norwegian, Swedish and Dutch into a Danish corpus (~17 Million words) For measuring, we used the top 1000 words

10 10 German in BNC

11 11 Invading Denmark

12 12 Experiment 2 For a collection of web documents (~700 Million words from.de domains, we measure the effect of a corpus cleaning method that strips alien language material Before cleaningAfter cleaning Number of top-1000- words found Approx. Frequency ratio Number of top-1000- words found Frequency ratio German10000.70810000.946 English9950.1269870.0010 French9240.03989060.00002 Dutch9950.0008917750.000006 Turkish6420.00006315620.000006

13 13 Cleaning.de web

14 14 Conclusion Measure captures well the amount of noise Noise measured down to a ratio of 10 -5 Effective: involves 1000 frequency counts per language

15 15 Application: Monolingual Corpora Screenshot corpora http://corpora.uni-leipzig.de

16 16 Workflow Text Language detection, Cleaning lang. 1... lang. 2lang. n POS Tagging Classified Objects Texts: Web / Newspapers Crawling Standard Size Corpora URLs Language Statistics Small Worlds Co-occurrences etc. Clustering Classification Neologisms Trend Mining Topic Tracking Language +Time Tools Dictionaries (Dornseiff, WordNets, Wikipedia,...) Web Statistics Small Worlds Small Worlds Words Dictionaries Resources Techniques Results Similar objects (words, sentences, documents, URLs) Classification (se- mantic properties, subject areas,...) Combined objects (NE-Recognition, terminology,...): determine patterns, extract multi-words Decomposition Morphology Inflection Translation pairs

17 17 Corpus Browser Per word: Frequency Example sentences Co-occurrences: left and right neighbours, sentence-based Co-occurrence graph

18 18 Only a few copies left! DVD: 15 languages Corpus Browser Corpora in plain text and database format

19 19 Questions?? THANK YOU!

20 20


Download ppt "Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language."

Similar presentations


Ads by Google