Measuring Monolinguality

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
Extracting an Inventory of English Verb Constructions from Language Corpora Matthew Brook O’Donnell Nick C. Ellis Presentation.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Distributions and Distributional Lexical Semantics for Stop Lists Corpus Profiling 2008 BCS London Neil Cooke BSc DMS CEng FIET PhD Student CCSR Dr Lee.
1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,
Chapter 5: Information Retrieval and Web Search
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Chapter 6: Information Retrieval and Web Search
Decision Support Systems
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Link Distribution in Wikipedia [0324] KwangHee Park.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Statistical NLP: Lecture 7
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Clustering of Web pages
A Brief Introduction to Distant Supervision
Text Based Information Retrieval
Searching corpora.
Reading Report on Hybrid Question Answering System
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Memory Standardization
Natural Language Processing (NLP)
Terminology problems in literature mining and NLP
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Text Analytics Giuseppe Attardi Università di Pisa
Research at Open Systems Lab IIIT Bangalore
Machine Learning in Natural Language Processing
Statistical NLP: Lecture 9
Extracting Semantic Concept Relations
Introduction Task: extracting relational facts from text
Inf 722 Information Organisation
Text Mining & Natural Language Processing
Family History Technology Workshop
Statistical n-gram David ling.
Chapter 5: Information Retrieval and Web Search
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
Content Analysis of Text
Text Mining & Natural Language Processing
Introduction to Text Analysis
LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 3 prof. ssa Laura Liucci –
A Suite to Compile and Analyze an LSP Corpus
Mining Anchor Text for Query Refinement
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
Word embeddings (continued)
From Unstructured Text to StructureD Data
Translating Collocations for Bilingual Lexicons
Statistical NLP : Lecture 9 Word Sense Disambiguation
Natural Language Processing (NLP)
Presentation transcript:

Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language and Speech Resources, Genova 27 May 2006

Why Monolinguality ? Alien language noise disturbs statistics for corpus-based methods: Language Models, e.g. n-gram Lexical Acquisition Semantic Indexing Co-occurrence Statistics

What is Monolinguality? Foreign language sentences should be removed Sentences containing few foreign language words or phrases, such as movie titles, terminology etc. should remain.

Korean Example A:Yes. The traffic cop said I had one too many and made me take the sobriety test, but I passed it. B:Lucky you ! 무인도 표류 소년 25명 통해 인간의 야만성 그려 영국 소설가 윌리엄 골딩의 83년 노벨문학상 수상작을 영화화한 `파리대왕'(Lord of the flies)은 결코 편안하게 감상할 수 있는 영화는 아니다 .

Recall Zipf‘s Law It holds also for random samples of words Top frequent words It holds also for random samples of words

Measuring Monolinguality Given a corpus of language A with x% noise of language B, the amount of noise is measured: For top frequency words of B, divide the relative frequency in the corpus by the relative frequency of a clean B corpus The amount of noise is the predominant ratio: many ratios will be close to x%.

The top frequency words of B w.r.t. A Words that do not occur in language A. Their frequency ratio will be around x%. Words that are also amongst the highest frequency words of language A and moreover have the same function. Their frequency ratio will be around 1. Words that occur in language A, but at different frequency bands. They are a random sample of words of L and distributed in a Zipf way Words of B that are often used in named entities and titles (such as capitalized stop words). They appear in the corpus of language A more frequently then the expected x% of noise. The second group of words is only present in languages that are very similar to each other.

Lexical overlap in top 1000 words

Experiment 1 Artificial noise mixtures: Injecting alien language material in monolingual corpora Experiment 1a: Injecting different amounts of German Noise in a chunk of the British National Corpus (~ 20 Million words) Experiment 1b: Injecting 1% noise of Norwegian, Swedish and Dutch into a Danish corpus (~17 Million words) For measuring, we used the top 1000 words

German in BNC

Invading Denmark

Experiment 2 For a collection of web documents (~700 Million words from .de domains, we measure the effect of a corpus cleaning method that strips alien language material   Before cleaning After cleaning Number of top-1000-words found Approx. Frequency ratio Frequency ratio German 1000 0.708 0.946 English 995 0.126 987 0.0010 French 924 0.0398 906 0.00002 Dutch 0.000891 775 0.000006 Turkish 642 0.0000631 562

Cleaning .de web

Conclusion Measure captures well the amount of noise Noise measured down to a ratio of 10-5 Effective: involves 1000 frequency counts per language

Application: Monolingual Corpora Screenshot corpora http://corpora.uni-leipzig.de

Workflow Texts: Web / Newspapers Dictionaries (Dornseiff, WordNets, Wikipedia, ...) Small Worlds URLs Crawling Small Worlds Clustering Classification Words Text Text Text Text Similar objects (words, sentences, documents, URLs) Classification (se- mantic properties, subject areas, ...) Combined objects (NE-Recognition, terminology, ...): determine patterns, extract multi-words Resources Techniques Results Language detection, Cleaning Decomposition Morphology Inflection Translation pairs lang. 1 lang. 2 lang. n ... Language +Time Tools Co-occurrences etc. POS Tagging Neologisms Trend Mining Topic Tracking Standard Size Corpora Web Statistics Classified Objects Dictionaries Language Statistics Small Worlds

Corpus Browser Per word: Frequency Example sentences Co-occurrences: left and right neighbours, sentence-based Co-occurrence graph

Only a few copies left! DVD: 15 languages Corpus Browser Corpora in plain text and database format

Questions?? THANK YOU!