Collecting Highly Parallel Data for Paraphrase Evaluation David L. Chen The University of Texas at Austin William B. Dolan Microsoft Research The 49th.

Slides:



Advertisements
Similar presentations
Statistical modelling of MT output corpora for Information Extraction.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Farag Saad i-KNOW 2014 Graz- Austria,
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.
Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Islam Beltagy, Cuong Chau, Gemma Boleda, Dan Garrette, Katrin Erk, Raymond Mooney The University of Texas at Austin Richard Montague Andrey Markov Montague.
Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.
Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.
Recent Trends in MT Evaluation: Linguistic Information and Machine Learning Jason Adams Instructors: Alon Lavie Stephan Vogel.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection David L. Chen The University of Texas at Austin William B. Dolan Microsoft.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
LREC Combining Multiple Models for Speech Information Retrieval Muath Alzghool and Diana Inkpen University of Ottawa Canada.
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Kyoshiro SUGIYAMA, AHC-Lab., NAIST An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering Kyoshiro Sugiyama, Masahiro.
Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
1 TURKOISE: a Mechanical Turk-based Tailor-made Metric for Spoken Language Translation Systems in the Medical Domain Workshop on Automatic and Manual Metrics.
David L. Chen Fast Online Lexicon Learning for Grounded Language Acquisition The 50th Annual Meeting of the Association for Computational Linguistics (ACL)
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Year 7 Homework Support Booklet The home learning projects is your Food Technology homework. These are tasks that will use researched information, as well.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks EMNLP 2008 Rion Snow CS Stanford Brendan O’Connor Dolores.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
An Improved Hierarchical Word Sequence Language Model Using Word Association NARA Institute of Science and Technology Xiaoyi WuYuji Matsumoto.
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Learning to Sportscast: A Test of Grounded Language Acquisition
Week 6 Presentation Ngoc Ta Aidean Sharghi.
Presentation transcript:

Collecting Highly Parallel Data for Paraphrase Evaluation David L. Chen The University of Texas at Austin William B. Dolan Microsoft Research The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL) June 20, 2011

Machine Paraphrasing Goal: Semantically equivalent content Many applications: – Machine Translation – Query Expansion – Summary Generation Lack of standard datasets – No “professional paraphrasers” Lack of standard metric – BLEU does not account for sentence novelty

Two-pronged Solution Crowdsourced paraphrase collection – Highly parallel data – Corpus released for community use Simple n-gram based metric – BLEU for semantic adequacy and fluency – New metric PINC for lexical dissimilarity

Outline Data collection through Mechanical Turk New metric for evaluating paraphrases Correlation with human judgments

Annotation Task Describe video in a single sentence

Data Collection Descriptions of the same video natural paraphrases YouTube videos submitted by workers – Short – Single, unambiguous action/event Bonus: Descriptions in different languages translations

Example Descriptions Someone is coating a pork chop in a glass bowl of flour. A person breads a pork chop. Someone is breading a piece of meat with a white powdery substance. A chef seasons a slice of meat. Someone is putting flour on a piece of meat. A woman is adding flour to meat. A woman is coating a piece of pork with breadcrumbs. A man dredges meat in bread crumbs. A person breads a piece of meat. A woman is breading some meat. A woman coats a meat cutlet in a dish.

Quality Control Tier 1 $0.01 per description Tier 2 $0.05 per description Initially everyone only has access to Tier-1 tasks

Quality Control Tier 1 $0.01 per description Tier 2 $0.05 per description Good workers are promoted to Tier-2 based on # descriptions, English fluency, quality of descriptions

Quality Control Tier 1 $0.01 per description Tier 2 $0.05 per description The two tiers have identical tasks but have different pay rates

Statistics of data collected 122K descriptions for 2089 videos Spent around $5,000

Paraphrase Evaluations Human judges ParaMetric (Callison-Burch 2005) – Precision/recall of paraphrases discovered between two parallel documents Paraphrase Evaluation Metric (PEM) (Liu et al. 2010) – Pivot language for semantic equivalence – SVM trained on human ratings to combine semantic adequacy, fluency and lexical dissimilarity scores

Semantic Adequacy and Fluency Use BLEU score with multiple references Highly parallel data captures a wide space of equivalent sentences Natural distribution of descriptions

Lexical Dissimilarity Paraphrase In N-gram Changes (PINC) % n-grams that differ For source s and candidate c:

PINC Example Source: a man fires a revolver at a practice range. Candidates:PINC a man fires a gun at a practice range36.41 a man shoots a gun at a practice range56.75 someone is practice shooting at a gun range 87.05

Building Paraphrase Model Source SentenceParaphrase A person breads a pork chop.A woman is adding flour to meat. A chef seasons a slice of meat.A person breads a piece of meat. A woman is adding flour to meat.A woman is breading some meat. Moses (English to English) Training data

Constructing Training Pairs A person breads a pork chop. A chef seasons a slice of meat. Someone is putting flour on a piece of meat. A woman is adding flour to meat. A man dredges meat in bread crumbs. A person breads a piece of meat. A woman is breading some meat. For each source sentence, randomly select n descriptions of the same video as target paraphrases Descriptions of the same video

Constructing Training Pairs A person breads a pork chop. A chef seasons a slice of meat. Someone is putting flour on a piece of meat. A woman is adding flour to meat. A man dredges meat in bread crumbs. A person breads a piece of meat. A woman is breading some meat. For n = 2 A person breads a pork chop. A woman is adding flour to meat.. A person breads a pork chop. A person breads a piece of meat. Descriptions of the same videoTraining pairs

Constructing Training Pairs A person breads a pork chop. A chef seasons a slice of meat. Someone is putting flour on a piece of meat. A woman is adding flour to meat. A man dredges meat in bread crumbs. A person breads a piece of meat. A woman is breading some meat. Move to the next sentence as the source A person breads a pork chop. A woman is adding flour to meat.. A person breads a pork chop. A person breads a piece of meat. Descriptions of the same videoTraining pairs

Constructing Training Pairs A person breads a pork chop. A chef seasons a slice of meat. Someone is putting flour on a piece of meat. A woman is adding flour to meat. A man dredges meat in bread crumbs. A person breads a piece of meat. A woman is breading some meat. A person breads a pork chop. A woman is adding flour to meat.. A person breads a pork chop. A person breads a piece of meat. A chef seasons a slice of meat. A person breads a pork chop. A chef seasons a slice of meat. A woman is adding flour to meat. Descriptions of the same videoTraining pairs Move to the next sentence as the source

Constructing Training Pairs A person breads a pork chop. A chef seasons a slice of meat. Someone is putting flour on a piece of meat. A woman is adding flour to meat. A man dredges meat in bread crumbs. A person breads a piece of meat. A woman is breading some meat. Repeat so each sentence as the source once Descriptions of the same videoTraining pairs A person breads a pork chop. A woman is adding flour to meat.. A person breads a pork chop. A person breads a piece of meat. A chef seasons a slice of meat. A person breads a pork chop. A chef seasons a slice of meat. A woman is adding flour to meat. Someone is putting flour on a piece of meat. A person breads a pork chop. Someone is putting flour on a piece of meat. A person breads a piece of meat.

Testing A person breads a pork chop. A chef seasons a slice of meat. Someone is putting flour on a piece of meat. A woman is adding flour to meat. A man dredges meat in bread crumbs. A person breads a piece of meat. A woman is breading some meat. A person breads a piece of meat. Moses (English to English) Use each sentence in the test set once as the source Descriptions of the same video

Testing A person breads a pork chop. A chef seasons a slice of meat. Someone is putting flour on a piece of meat. A woman is adding flour to meat. A man dredges meat in bread crumbs. A person breads a piece of meat. A woman is breading some meat. A person seasons some pork. Moses (English to English) Use each sentence in the test set once as the source Descriptions of the same video

Testing A person breads a pork chop. A chef seasons a slice of meat. Someone is putting flour on a piece of meat. A woman is adding flour to meat. A man dredges meat in bread crumbs. A person breads a piece of meat. A woman is breading some meat. A person breads meat. Moses (English to English) Use each sentence in the test set once as the source Descriptions of the same video

Testing A person breads a pork chop. A chef seasons a slice of meat. Someone is putting flour on a piece of meat. A woman is adding flour to meat. A man dredges meat in bread crumbs. A person breads a piece of meat. A woman is breading some meat. A person breads meat. Moses (English to English) Reference sentences for BLEU Use all sentences in the same set as references Descriptions of the same video

Testing A person breads a pork chop. A chef seasons a slice of meat. Someone is putting flour on a piece of meat. A woman is adding flour to meat. A man dredges meat in bread crumbs. A person breads a piece of meat. A woman is breading some meat. A person breads meat. Moses (English to English) Source sentences for PINC Compute PINC with just the selected source Descriptions of the same video

Paraphrase experiment Split videos into 90% for training, 10% for testing Use only Tier-2 sentences Train: source sentences Test: 3367 source sentences Train on different number of pairs – n=1: 28,758 pairs – n=5: 143,776 pairs – n=10: 287,198 pairs – n=all: 449,026 pairs

Example paraphrase output n=1n=all a bunny is cleaning its paw a rabbit is licking its pawa rabbit is cleaning itself a boy is doing karate a man is doing karatea boy is doing martial arts a big turtle is walking a huge turtle is walkinga large tortoise is walking a guy is doing a flip over a park bench a man does a flip over a bencha man is doing stunts on a bench

Paraphrase Evaluation

Human Judgments Two fluent English speakers 200 randomly selected sentences Candidates from two systems: – n=1 – n=all Rated 1 to 4 on the following categories: – Semantic Equivalence – Lexical Dissimilarity – Overall Measure correlation using Pearson’s coefficient

Correlation with Human Judgments Semantic Equivalence Lexical Dissimilarity Overall Judge A vs. B BLEU vs. Human0.5095N/A PINC vs. HumanN/A PEM (Liu et al. 2010) vs. Human N/A Correlation strength: Strong Medium Weak None

Combined BLEU/PINC vs. Human Overall Arithmetic Mean Geometric Mean Harmonic Mean Correlation strength: Strong Medium Weak None

Conclusion Introduced a novel paraphrase collection framework using crowdsourcing Data available for download at – Or search for “Microsoft Research Video Description Corpus” Described a way of utilizing BLEU and a new metric PINC to evaluate paraphrases

Backup Slides

Video Description vs. Direct Paraphrasing Randomly selected 1000 sentences and asked the same pool of workers to paraphrase them 92% found video descriptions more enjoyable 75% found them easier 50% preferred the video description task versus only 16% that preferred direct paraphrasing More divergence, PINC vs Only drawback is the time to load the videos

Example video

English Descriptions A man eats sphagetti sauce. A man is eating food. A man is eating from a plate. A man is eating something. A man is eating spaghetti from a large bowl while standing. A man is eating spaghetti out of a large bowl. A man is eating spaghetti. A man is eating. A man tasting some food in the kitchen is expressing his satisfaction. The man ate some pasta from a bowl. The man is eating. The man tried his pasta and sauce.

Statistics of data collected Total money spent: $5000 Total number of workers: 835

Quality Control Worker has to prove actual task competence – Novotney and Callison-Burch, NAACL 2010 AMT workshop Promote workers based on work submitted – # submissions – English fluency – Describing the videos well

PINC vs. Human (BLEU > threshold) Threshold Lexical Dissimilarity Overall Correlation strength: Strong Medium Weak None

Combined BLEU/PINC vs. Human Overall Arithmetic Mean Geometric Mean Harmonic Mean PINC × Oracle Sigmoid(BLEU) Correlation strength: Strong Medium Weak None

Correlation with Human Judgments