A German Corpus for Similarity Detection

Slides:



Advertisements
Similar presentations
Albert Gatt Corpora and Statistical Methods Lecture 13.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
CLUSTERING PROXIMITY MEASURES
Maurice Hermans.  Ontologies  Ontology Mapping  Research Question  String Similarities  Winkler Extension  Proposed Extension  Evaluation  Results.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Introduction.  Classification based on function role in classroom instruction  Placement assessment: administered at the beginning of instruction 
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
SAC 1 Informal Discourse Comparative Analysis. Analytical Commentary SAC 1: Analytical Commentary What is it? Linguistic analysis. Articulate your understanding.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Line detection Assume there is a binary image, we use F(ά,X)=0 as the parametric equation of a curve with a vector of parameters ά=[α 1, …, α m ] and X=[x.
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
A Lightweight and High Performance Monolingual Word Aligner Xuchen Yao, Benjamin Van Durme, (Johns Hopkins) Chris Callison-Burch and Peter Clark (UPenn)
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Kadupitiya JCS Dr. Surangika Ranathunga Prof. Gihan Dias Department of Computer Science and Engineering University of Moratuwa Sri Lanka.
WP4 Models and Contents Quality Assessment
Automatic Writing Evaluation
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Statistical Machine Translation Part II: Word Alignments and EM
Queensland University of Technology
contrastive linguistics
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Statistical NLP: Lecture 7
Sentiment analysis algorithms and applications: A survey
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
contrastive linguistics
Suggestions for Class Projects
Result of Ontology Alignment with RiMOM at OAEI’06
Lecture 12: Data Wrangling
Statistical NLP: Lecture 9
School of Library and Information Science
Representation of documents and queries
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
[jws13] Evaluation of instance matching tools: The experience of OAEI
Information Retrieval
contrastive linguistics
INF 141: Information Retrieval
Hierarchical, Perceptron-like Learning for OBIE
contrastive linguistics
By Hossein Hematialam and Wlodek Zadrozny Presented by
Extracting Why Text Segment from Web Based on Grammar-gram
Statistical NLP : Lecture 9 Word Sense Disambiguation
System Model Acquisition from Requirements Text
Statistical NLP: Lecture 10
Presentation transcript:

A German Corpus for Similarity Detection Juan-Manuel Torres-Moreno, Gerardo Sierra and Peter Peinl Tuxtla Gutiérrez, México 19-22 / 11/ 2014

Similarity detection (SD) Words Sentences Paragraphs Documents SD aims at comparing different objects and to observe the common features they share according to certain parameters. An object a is similar to b only referring to a context c Plagiarism Verbatim Paraphrasing Other similarity Topic similarity Style similarity Reuse Adaptation Translation

Measuring text similarity d(A,B) = 0  A=B s(A,B) = 1  A=B The metrics calculate a score that can be normalized to between zero and one. For paraphrase detection purposes, a binary result is considered, but the similarity measures get a grade as result. Based on a threshold it is determined whether the compared texts are the same, a paraphrase or different.

SD measures Jaro-Winkler distance Levenshtein distance N-gram simple Cosine similarity Dice similarity Jaccard similarity Euclidean distance Manhattan distance Vector space models (term-based) Jaro-Winkler distance Levenshtein distance Text alignment (linguistic knowledge-based) N-gram simple K-skip-n-gram N-gram overlapping (string-based)

Microsoft Research Paraphrase Corpus Current corpora for SD Corpus of news texts collected manually from a news agency and nine daily newspapers that reuse these newswires. The resulting 1717 texts were manually classified at the document level into three categories: wholly derived (WD), partially derived (PD), and non-derived (ND). At the phrasal level, individual words and phrases were compared to find verbatim text, paraphrased text or none at all. METER corpus Consists of 5801 pairwise aligned sentences that exhibit lexical and/or structural paraphrase alternations extracted from news reports. It was created automatically using string edit distance and discourse-based heuristic extraction techniques. The corpus has binary statements indicating whether human evaluators considered the pair of sentences to be semantically equivalent or not. Microsoft Research Paraphrase Corpus Corpus for the evaluation of automatic plagiarism detection algorithms. Texts of artificial plagiarism were created automatically through a heuristic of changing some parameter. PAN Plagiarism Corpus

Criteria for our corpus The general purpose of our corpus is to automatically assess the similarity between a pair of texts. Either on the document or on the sentence level. It contains paraphrases of different complexity levels From basic up to more complex ones. The difference to existing corpora lies In the granularity, in the ranking of paraphrase, in the annotation method for evaluation purposes. Our granularity is phrase-to-phrase.

Subject and structure of our corpus Original text Baumkuchen Wikipedia, 30 sentences Basic/simple paraphrase subcorpus Complex paraphrase subcorpus No paraphrase subcorpus

Paraphrasing rules Basic paraphrase Complex paraphrase Length of an article ± 1 sentence ± 5 sentences Order of sentences While comprehensive and cohesive Exchange of segments (splitting & merging) Forbidden Encouraged Semantic changes Allowed

Semantic changes Abbreviations vs or versus Numbers 15, fifteen, XV Hyperonyms and hyponyms sugary substance, honey, bee honey wood, pinewood Synonyms and definitions manuscript, handwritten document Compound words Produktion von Kuchen or Kuchenproduktion Complex phrase structure lengthy phrases of complex structure, i.e. containing (several layers of nested) several sub-phrases Temporal 1855 vs. in the midst of the nineteenth century Geographical Japan vs. the land of the rising sun Personal Prince Elector Frederick William vs. the head of state of Brandenburg Quantitative 8 vs. between 6 and 9 vs. a one digit number

References To evaluate the use of the corpus, the relationship between the original and the paraphrased document was devised. 0 : 29 Phrase 0 of the original is found in phrase 29 of the paraphrased document 2: Phrase 2 of the original has been removed 3: 1 4 :1 Phrases 3 and 4 from the original have been merged into phrase 1 of the paraphrased document 13 : 11,12 Phrase 13 from the original has been split and may be found in phrases 11 and 12 of the paraphrased document 5: 5,6 6: 5,6 Segment of phrases 5 and 6 from the original have been exchanged :17 Phrase 17 has been added with respect to the original

Application of simple SD measures

Conclusions The German corpus for paraphrasing detection presents two main particular characteristics against current corpora: a set of aligned text pairs at paraphrase level with a set of references, and three levels of paraphrase for each document, high, low and no paraphrase. The mapping between the source text and the paraphrased one even lets persons that do not speak German study the corpus for similarity detection, becoming a language independent corpus. The level of paraphrase allows to refine the algorithms for paraphrasing detection, in order to determine more precisely the degree of paraphrase. The application of the simple measures let us see, as expected, how the similarity score increases in direct proportion to the degree of paraphrase.

More information We are still incorporating new texts into the German corpus. We are preparing two other corpora, a Spanish and a French corpus, using the same protocol as the one presented now. the German corpus is available at: http://simtex.talne.eu