School of something FACULTY OF OTHER School of Languages, Cultures and Societies – Faculty of Arts School of Computing – Faculty of Engineering Multilingual Information Extraction framework for real-time detection of terrorist propaganda threats in on-line communication Bogdan Babych Centre for Translation Studies XI International Conference “Military education and science: the present and the future” Military Institute of Taras Shevchenko National University, Kyiv, Ukraine, 27 November 2015 Eric Atwell Artificial Intelligence Research Group
Overview NLP for detection of direct terrorist threats is not enough Propaganda treats: radicalization, recruitment, justification State propaganda as an extension of ‘soft power’ used as a military instrument EU Horizon2020 proposal: automated real-time multilingual detection of security & terrorist propaganda threats Technologies: Machine Translation (MT) + Information Extraction (IE) Innovative challenges: IE template filling task for propaganda messages Exploitation: community intelligence and response development Future work: technological outlook & invitation for collaboration
Natural Language Processing (NLP) for direct threat detection is not enough NLP techniques for Traditionally: identification of direct terrorist threats Focus on illegal activities (planned attacks) Discovering actionable information preventing an attack uncovering a network Alerts for analysts about suspect communication Database of connected facts Intelligent decision-support systems US DARPA DEFT project: own-deep-learning-project-for-natural-language- processing/ own-deep-learning-project-for-natural-language- processing/ UK IDEAS Factory - Detecting Terrorist Activities: Making Sense (included Leeds team), EPSRC/ESRC/CPNI P/H023135/1 P/H023135/1
Natural Language Processing (NLP) for direct threat detection is not enough Problem: propaganda not captured by traditional direct threat detection Terrorist propaganda, fundamentalist radicalization not strictly illegal Increasingly used by terrorist groups & states-sponsors of terrorism for: [Radicalization] [Recruiting fighters] Creation of local cells, ‘5 th column’ Ideological justification of causes for terrorism, manipulation of public opinion Crowdsourcing political influence: ‘soft power’ turned ‘hard’ military instrument State propaganda targets international public opinion and political decisions Has direct military consequences
Computational Linguistics in propaganda wars: tasks of creating and countering propaganda In Russia – at least since 2004: evidence of funding research on linguistic means for manipulating public opinion Models based on Melchuk’s ‘Meaning Text Theory’
Technologies rely on combination of: Machine Translation (MT): Statistical+Rule-Based=Hybrid Linguistic features for Part-of-Speech Tagging + Lemmatization Parsing (string-to-tree MT) Information Extraction (IE) from MT-translated texts (en) Named Entity recognition (Person, Organization, Location… names) Scenario template filling (Detection of Events, Relations, Participants) Text similarity detection: e.g., lexical overlap (L) + structure (S) + keywords (K) + named entities (N) (Su and Babych, 2012) Computational Linguistics in propaganda wars: tasks of creating and countering propaganda
Technologies for Text and Speech processing (propaganda sites) Statistical / Hybrid MT Open-source‘Moses’ decoder Euronews site dump ~ (ar, de, en, fr, gr, hu, it, pe, pt, ru, tr, uk) Plain text extraction & tokenization; Hunalign sentence alignment ign/ ign/ Part-of-speech tagging (for factored models: lemma/PoS/word) TnT saarland.de/~thorsten/tnt/ + parameter files saarland.de/~thorsten/tnt/ Leeds MT system (file translation): ar- en, fr-en, es-en, de-en, ru-en, uk-en file.html file.html Statistical decoder Phrase Table (Translation Model) ST TT Parallel texts (translat ions) Target texts Parallel texts (translat ions) training Target Language model training Target texts Linguistic features & analysis
Technologies for Text and Speech processing (propaganda sites) Information Extraction (IE) Identification of relevant information, NOT full text understanding Scenario template filling task = structured database of events from text GATE ANNIE: NER + Co-reference Scenario Template Filling Ontology PoS Tagging + chunking + Named Entity recognition + co-reference resolution System used: GATE (University of Sheffield) Traditionally: for direct threat detection
Challenge: IE templates for detecting state- and terrorist propaganda messages Scenario template filling Templates for identification of factual inconsistencies in texts Alerts about propaganda threats Tracking source (multilingual) Resources (facts) for real-time development of a response
Templates for identification of factual inconsistencies in texts Alerts about propaganda threats Tracking source (multilingual) Resources (facts) for real-time development of a response Challenge: IE templates for detecting state- and terrorist propaganda messages Scenario template filling ru-en MT
Challenge: IE templates for detecting state- and terrorist propaganda messages More complex templates: attitude frameworks Consistent response needs an alternative framework How to identify resource for a response: European values system {?}
Challenge: IE templates for detecting state- and terrorist propaganda messages More complex templates: attitude frameworks Consistent response needs an alternative framework How to identify resource for a response: European values system {?}
Challenge: IE templates for detecting state- and terrorist propaganda messages More complex templates: attitude frameworks Consistent response needs an alternative framework How to identify resource for a response: European values system {?}
Sensitivity to MT quality for Organization Names, scenario template filling Precision OK; Recall goes Solution: adapting MT to IE? Challenge: Information Extraction from MT output
Future work Invitation for collaboration ( Community response to propaganda threats Beyond security analysts: anti-terrorist volunteers and crowd intelligence Automatic creation of IE propaganda templates Template similarity and event similarity detection; argumentative texts Learning defense and security ontologies from corpora Automated reasoning using ontologies (predicate & description logic) Modeling language distortion for real-world communication Dialectal, graphical variation, misspelling, abbreviations MT and IE for non-literal language usage metaphors, euphemisms, indirect references