Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
REDUCED N-GRAM MODELS FOR IRISH, CHINESE AND ENGLISH CORPORA Nguyen Anh Huy, Le Trong Ngoc and Le Quan Ha Hochiminh City University of Industry Ministry.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
A Framework for Automated Corpus Generation for Semantic Sentiment Analysis Amna Asmi and Tanko Ishaya, Member, IAENG Proceedings of the World Congress.
Extracting an Inventory of English Verb Constructions from Language Corpora Matthew Brook O’Donnell Nick C. Ellis Presentation.
7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Distributions and Distributional Lexical Semantics for Stop Lists Corpus Profiling 2008 BCS London Neil Cooke BSc DMS CEng FIET PhD Student CCSR Dr Lee.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
A probabilistic approach to semantic representation Paper by Thomas L. Griffiths and Mark Steyvers.
1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Chapter 5: Information Retrieval and Web Search
Research methods in corpus linguistics Xiaofei Lu.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Machine translation Context-based approach Lucia Otoyo.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Chapter 6: Information Retrieval and Web Search
Decision Support Systems
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Overview of Statistical NLP IR Group Meeting March 7, 2006.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Measuring Monolinguality
Memory Standardization
Natural Language Processing (NLP)
Research at Open Systems Lab IIIT Bangalore
Introduction Task: extracting relational facts from text
Inf 722 Information Organisation
Statistical n-gram David ling.
A Suite to Compile and Analyze an LSP Corpus
Natural Language Processing (NLP)
From Unstructured Text to StructureD Data
Natural Language Processing (NLP)
Presentation transcript:

Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language and Speech Resources, Genova 27 May 2006

2 Why Monolinguality ? Alien language noise disturbs statistics for corpus-based methods: Language Models, e.g. n-gram Lexical Acquisition Semantic Indexing Co-occurrence Statistics

3 What is Monolinguality? Foreign language sentences should be removed Sentences containing few foreign language words or phrases, such as movie titles, terminology etc. should remain.

4 Korean Example A:Yes. The traffic cop said I had one too many and made me take the sobriety test, but I passed it. B:Lucky you ! 무인도 표류 소년 25 명 통해 인간의 야만 성 그려 영국 소설가 윌리엄 골딩의 83 년 노벨문학상 수상작 을 영화화한 ` 파리대 왕 '(Lord of the flies) 은 결코 편안하게 감 상할 수 있는 영화는 아니다.

5 Recall Zipf‘s Law It holds also for random samples of words Top frequent words

6 Measuring Monolinguality Given a corpus of language A with x% noise of language B, the amount of noise is measured: For top frequency words of B, divide the relative frequency in the corpus by the relative frequency of a clean B corpus The amount of noise is the predominant ratio: many ratios will be close to x%.

7 The top frequency words of B w.r.t. A Words that do not occur in language A. Their frequency ratio will be around x%. Words that are also amongst the highest frequency words of language A and moreover have the same function. Their frequency ratio will be around 1. Words that occur in language A, but at different frequency bands. They are a random sample of words of L and distributed in a Zipf way Words of B that are often used in named entities and titles (such as capitalized stop words). They appear in the corpus of language A more frequently then the expected x% of noise. The second group of words is only present in languages that are very similar to each other.

8 Lexical overlap in top 1000 words

9 Experiment 1 Artificial noise mixtures: Injecting alien language material in monolingual corpora Experiment 1a: Injecting different amounts of German Noise in a chunk of the British National Corpus (~ 20 Million words) Experiment 1b: Injecting 1% noise of Norwegian, Swedish and Dutch into a Danish corpus (~17 Million words) For measuring, we used the top 1000 words

10 German in BNC

11 Invading Denmark

12 Experiment 2 For a collection of web documents (~700 Million words from.de domains, we measure the effect of a corpus cleaning method that strips alien language material Before cleaningAfter cleaning Number of top words found Approx. Frequency ratio Number of top words found Frequency ratio German English French Dutch Turkish

13 Cleaning.de web

14 Conclusion Measure captures well the amount of noise Noise measured down to a ratio of Effective: involves 1000 frequency counts per language

15 Application: Monolingual Corpora Screenshot corpora

16 Workflow Text Language detection, Cleaning lang lang. 2lang. n POS Tagging Classified Objects Texts: Web / Newspapers Crawling Standard Size Corpora URLs Language Statistics Small Worlds Co-occurrences etc. Clustering Classification Neologisms Trend Mining Topic Tracking Language +Time Tools Dictionaries (Dornseiff, WordNets, Wikipedia,...) Web Statistics Small Worlds Small Worlds Words Dictionaries Resources Techniques Results Similar objects (words, sentences, documents, URLs) Classification (se- mantic properties, subject areas,...) Combined objects (NE-Recognition, terminology,...): determine patterns, extract multi-words Decomposition Morphology Inflection Translation pairs

17 Corpus Browser Per word: Frequency Example sentences Co-occurrences: left and right neighbours, sentence-based Co-occurrence graph

18 Only a few copies left! DVD: 15 languages Corpus Browser Corpora in plain text and database format

19 Questions?? THANK YOU!

20