Statistical Machine Translation

Slides:



Advertisements
Similar presentations
Can I Use It, and If so, How? Christian Lieske SAP AG – MultiLingual Technology Discussion of Consortium Proposal for OLIF2 File Header.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Requirements. UC&R: Phase Compliance model –RIF must define a compliance model that will identify required/optional features Default.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Natural Language Interfaces to Ontologies Danica Damljanović
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Logics for Data and Knowledge Representation Projects and thesis introduction.
ESDSWG2011 – Semantic Web session Semantic Web Sub-group Session ESDSWG 2011 Meeting – Semantic Web sub-group session Wednesday, November 2, 2011 Norfolk,
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
© Copyright 2012 STI INNSBRUCK Apache Stanbol.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Behshid Behkamal Ferdowsi University of Mashhad Web Technology Lab.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
RDF: Building Block for the Semantic Web Jim Ellenberger UCCS CS5260 Spring 2011.
Sign Language Representation for Machine Translation Sara Morrissey NCLT/CNGL Seminar Series 1 st April, 2009.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Thesaurusmanagement Quickstart Introduction. What are controlled vocabularies? organized arrangement of words and phrases used to index content and/or.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
WP5.4 - Introduction  Knowledge Extraction from Complementary Sources  This activity is concerned with augmenting the semantic multimedia metadata basis.
University of Sheffield, NLP Entity Linking Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Deliverable Report on Standards Contribution 2012 CNR-ISTC STLab Aldo Gangemi, CNR-ISTC & Université Paris 13 Luxembourg Final Review, 14/03/2013.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.
Coping with Babel How to Localize XML. Designing for Localization Document design can seriously impact the costs of translation and localization. Remember.
Boris Villazón-Terrazas, Ghislain Atemezing FI, UPM, EURECOM, Introduction to Linked Data.
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
© Copyright 2013 STI INNSBRUCK “How to put an annotation in HTML?” Ioannis Stavrakantonakis.
Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.
Tool for Ontology Paraphrasing, Querying and Visualization on the Semantic Web Project By Senthil Kumar K III MCA (SS)‏
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Introduction to the Semantic Web and Linked Data
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Shridhar Bhalerao CMSC 601 Finding Implicit Relations in the Semantic Web.
Semantic Publishing Benchmark Task Force Fourth TUC Meeting, Amsterdam, 03 April 2014.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 A Medical Information Management System Using the Semantic Web Technology Networked Computing and Advanced INFORMATION MANAGEMENT, NCM '08. Fourth.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Linked Open Data Dataset from Related Documents Petya Osenova and Kiril Simov IICT-BAS LDL-2016, LREC, Portoroz.
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
Samad Paydar WTLab Research Group Ferdowsi University of Mashhad LD2SD: Linked Data Driven Software Development 24 th February.
INHA UNIVERSITY, KOREA Rainer Simon Austrian Institute of Technology.
A report by Olaf-Michael Stefanov to the JIAMCATT community
Centre for Translation Studies FACULTY OF ARTS
Statistical Machine Translation Part II: Word Alignments and EM
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Topics Question answering at Bing
Big Data Quality the next semantic challenge
Building the Localization Web
European Network of e-Lexicography
On the Impact of Various Types of Noise on Neural Machine Translation
Big Data Quality the next semantic challenge
DBpedia 2014 Liang Zheng 9.22.
LOD reference architecture
Chaitali Gupta, Madhusudhan Govindaraju
Linked Data Reuse in the Language Services Industry
Taxonomy of public services
Taxonomy of public services
Big Data Quality the next semantic challenge
Presentation transcript:

Statistical Machine Translation how to configure Statistical Machine Translation with Linked Open Data Resources Ankit Srivastava, Felix Sasaki, Peter Bourgonje, Julian Moreno-Schneider, Jan Nehring, and Georg Rehm German Research Center for Artificial Intelligence DFKI GmbH – Language Technology Lab, Berlin, Germany 19th November 2016, London

Consider the following MT outputs… Source Language (en) Target Language (de) A European Commission spokesman… MS Paint is a good option. Ein Sprecher der European Commission… Frau Farbe ist eine gute Wahl. Motivating Examples TC38 - SMT/LOD - Nov 2016

Consider the following MT outputs… Source Language (en) Target Language (de) A European Commission spokesman… MS Paint is a good option. Ein Sprecher der European Commission… Frau Farbe ist eine gute Wahl. In (1), ”European Commission” should be translated into its corresponding German “Europäische Kommission” In (2), “MS Paint” is misidentified as a person and should retain its form in translation Motivating Examples Unknown Word Entity Disambiguation TC38 - SMT/LOD - Nov 2016

SMT = Statistical Machine Translation Moses Statistical Machine Translation Multilingual Semantic Knowledge Graph (Linked Data) such as DBpedia Both types of errors can be rectified by interfacing the SMT system with LOD resources LOD = Linked Open Data SMT = Statistical Machine Translation TC38 - SMT/LOD - Nov 2016

About this Presentation Overview of Background Technologies Step-by-Step Recipe for configuring SMT with LOD Experimental Evaluation Critical Analysis Endnote What this talk is about: overview of how this talk is structured, ingredients TC38 - SMT/LOD - Nov 2016

Background Technologies 1 Phrase-based Statistical MT Other paradigms (TM, Hybrid, Neural,…) Moses (Open Source Toolkit) Statistical MT Linked Open Data Semantic Web Tools Projects DKT & FREME Potential Question: Translation Memories Vs LOD-enriched SMT TC38 - SMT/LOD - Nov 2016

TC38 - SMT/LOD - Nov 2016

Background Technologies 1 Phrase-based Statistical MT Other paradigms (TM, Hybrid, Neural,…) Moses (Open Source Toolkit) Enrich source-target translation models with knowledge leveraged from linked data resources on the web Statistical MT Linked Open Data Semantic Web Tools Projects DKT & FREME Potential Question: Translation Memories Vs LOD-enriched SMT TC38 - SMT/LOD - Nov 2016

LOD LOD LOD TC38 - SMT/LOD - Nov 2016

Background Technologies 2 Linguistic Resources (lexical) linked via Uniform Resource Identifiers (URI) Datasets such as DBpedia, BabelNet, … Statistical MT Linked Open Data Semantic Web Tools Projects DKT & FREME 4.58 million entities 125 languages 29.8 million links Potential Question: Translation Memories Vs LOD-enriched SMT Dbpedia: crowd sourced knowledge base linked data (often capitalized as Linked Data) is a method of publishing structured data so that it can be interlinked and become more useful through semantic queries. TC38 - SMT/LOD - Nov 2016

Other Examples of Linked Data 14 million entries 270 languages Babelnet: multilingual dictionary, 14 million entries, 270 languages Jrc-names: multilingual named entity resource: 205k named entities, 20+ languages > 205,000 entries 20+ languages TC38 - SMT/LOD - Nov 2016

Background Technologies 3 Tools & technologies which help us access linked data on the web Semantic Web (Web 3.0) Making links so that a person or a machine can explore the web of data Statistical MT Linked Open Data Semantic Web Tools Projects DKT & FREME Potential Question: Translation Memories Vs LOD-enriched SMT TC38 - SMT/LOD - Nov 2016

TC38 - SMT/LOD - Nov 2016

Background Technologies 3 RDF Resource Description Framework XML-like formalism for data on web NIF NLP Interchange Format RDF-based interoperability framework SPARQL Sparql Protocol and RDF Query Language Language to retrieve information from RDF-encoded data Statistical MT Linked Open Data Semantic Web Tools Projects DKT & FREME Potential Question: Translation Memories Vs LOD-enriched SMT TC38 - SMT/LOD - Nov 2016

Background Technologies 4 Digital Curation Technologies http://digitale-kuratierung.de FREME http://www.freme-project.eu Statistical MT Linked Open Data Semantic Web Tools Projects DKT & FREME Potential Question: Translation Memories Vs LOD-enriched SMT TC38 - SMT/LOD - Nov 2016

Methodology / Recipe Covert sentence (to be translated) from plaintext to NIF Demonstrate each step graphically / with an example TC38 - SMT/LOD - Nov 2016

Methodology: NIF Document TC38 - SMT/LOD - Nov 2016

Methodology / Recipe Covert sentence (to be translated) from plaintext to NIF Perform Named Entity Recognition (Tag the entities) Entity Linking (DBpedia spotlight) (Link to Dbpedia entries) Demonstrate each step graphically / with an example TC38 - SMT/LOD - Nov 2016

Methodology: NIF with DBPedia Entity <http://freme-project.eu/#char=0,2> a nif:RFC5147String , nif:Word ; nif:anchorOf "MS-Paint" ; nif:beginIndex "0" ; nif:endIndex ”8" ; nif:nextWord <http://freme-project.eu/#char=3,8> ; nif:referenceContext <http://freme-project.eu/#char=0,26> ; nif:sentence <http://freme-project.eu/#char=0,26> ; itsrdf:taIdentRef <http://dbpedia.org/resource/Paint_(software)> . TC38 - SMT/LOD - Nov 2016

Methodology / Recipe Covert sentence (to be translated) from plaintext to NIF Perform Named Entity Recognition (Tag the entities) Entity Linking (DBpedia spotlight) (Link to Dbpedia entries) Retrieve target language translation (SPARQL query) Demonstrate each step graphically / with an example TC38 - SMT/LOD - Nov 2016

TC38 - SMT/LOD - Nov 2016

http://dbpedia.org/page/Paint_(software) TC38 - SMT/LOD - Nov 2016

TC38 - SMT/LOD - Nov 2016

Methodology / Recipe Covert sentence (to be translated) from plaintext to NIF Perform Named Entity Recognition (Tag the entities) Entity Linking (DBpedia spotlight) (Link to Dbpedia entries) Retrieve target language translation (SPARQL query) Translate using Moses (xml-input) Display output Demonstrate each step graphically / with an example TC38 - SMT/LOD - Nov 2016

Methodology: Moses Command % echo '<np translation="Microsoft Paint">MS Paint</np> is a good option ."| moses -xml-input exclusive -f moses.ini TC38 - SMT/LOD - Nov 2016

Methodology / Recipe 4 Get the correct MT output 3 Identify the DBpedia entry for an entity Retrieve the linked target language translation via SPARQL query on rdfs:label Send the alternate translation to MT decoder Get the correct MT output 4 3 2 1 Moses Statistical Machine Translation To further illustrate the mechanism, we use example (1) from the previous slide “European Commission” Note this slide has animation (sequential appearance with a click) Step 1: Execute NER on input text using DBpedia as a resource. Identify “European Commission” as an entity and retrieve its resource link: http://dbpedia.org/page/European_Commission Step 2: Via SPARQL query on properties rdfs:label and owl:sameAs, retrieve the corresponding German DBpedia page “dbpedia-de:European Commission”: http://de.dbpedia.org/page/Europäische_Kommission Step 3: Send it to Moses SMT system (in-house DKT) with the decoder feature xml-input switched on to force the decoder to use this translation for European Commission Display output TC38 - SMT/LOD - Nov 2016

Experimental Evaluation English-German IT-domain (WMT 2016 Shared Task) Named Entity Forced Translations Translating 1000 segments Bleu Score Improvement from 34.0 to 34.8 12% more terms were translated correctly than baseline TC38 - SMT/LOD - Nov 2016

Critical Analysis Competing Alternatives Advantages Weaknesses Other ontology schemas Advantages User-defined, constantly updated Consistency of Terminology Weaknesses User-defined data, error-prone Entity Linking Errors TC38 - SMT/LOD - Nov 2016

Endnote Easily implementable modules Available on GitHub: https://github.com/dkt-projekt A Step towards making Machine Translation Semantic Web Aware TC38 - SMT/LOD - Nov 2016

THANKS! Any Questions? Ankit.Srivastava@dfki.de TC38 - SMT/LOD - Nov 2016

Links to References Ankit.Srivastava@dfki.de DBpedia: http://wiki.dbpedia.org DBpedia Spotlight: https://github.com/dbpedia-spotlight/ DKT: http://digitale-kuratierung.de DKT GitHub: https://github.com/dkt-projekt FREME: http://www.freme-project.eu FREME GitHub: https://github.com/freme-project Moses: http://www.statmt.org/moses/ NIF: http://persistence.uni-leipzig.org/nlp2rdf/ SPARQL: http://www.w3.org/TR/rdf-sparql-query/ TC38 - SMT/LOD - Nov 2016