Towards an NLP `module’ The role of an utterance-level interface.

Slides:



Advertisements
Similar presentations
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Advertisements

Compilers and Language Translation
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Kakia Chatsiou GreekGram: Building a parallel grammar for Modern Greek LAC day GreekGram Building a parallel grammar for Modern Greek Kakia.
Kakia Chatsiou Modern Greek Grammar fragment Implementation using XLE FLATLANDS GreekGram Reporting on the progress of the implementation.
Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
Information Retrieval in Practice
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
References Kempen, Gerard & Harbusch, Karin (2002). Performance Grammar: A declarative definition. In: Nijholt, Anton, Theune, Mariët & Hondorp, Hendri.
Developing Semantic Web Sites: Results and Lessons Learnt Enrico Motta, Yuangui Lei, Martin Dzbor, Vanessa Lopez, John Domingue, Jianhan Zhu, Liliana Cabral,
Jumping Off Points Ideas of possible tasks Examples of possible tasks Categories of possible tasks.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Software Requirements
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
BİL744 Derleyici Gerçekleştirimi (Compiler Design)1.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
MACHINE TRANSLATION TRANSLATION(5) LECTURE[1-1] Eman Baghlaf.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
31 st October, 2012 CSE-435 Tashwin Kaur Khurana.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
TH-OCR NK. content introduction go to next page background assumptions overall structure chart IPO for overall structure dataflow diagram of overall structure.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Chapter 6 : Software Metrics
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Lexical Analysis Hira Waseem Lecture
Copyright © 2007 Addison-Wesley. All rights reserved.1-1 Reasons for Studying Concepts of Programming Languages Increased ability to express ideas Improved.
Grammar Engineering: What is it good for? Miriam Butt (University of Konstanz) and Martin Forst (NetBase Solutions) Colombo 2014.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Lecture 12 Applications and demos. Building applications Previous lectures have discussed stages in processing: algorithms have addressed aspects of language.
ProgrammingLanguages Programming Languages Language Definition, Translation and Design.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Weaving a Debugging Aspect into Domain-Specific Language Grammars SAC ’05 PSC Track Santa Fe, New Mexico USA March 17, 2005 Hui Wu, Jeff Gray, Marjan Mernik,
The Minimalist Program
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Translingual Information Management Stephan Busemann Language Technology Lab German Research Center for Artificial Intelligence.
Black Box Testing : The technique of testing without having any knowledge of the interior workings of the application is Black Box testing. The tester.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
1 Galatea: Open-Source Software for Developing Anthropomorphic Spoken Dialog Agents S. Kawamoto, et al. October 27, 2004.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Human-Assisted Machine Annotation Sergei Nirenburg, Marjorie McShane, Stephen Beale Institute for Language and Information Technologies University of Maryland.
Artificial Intelligence
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Chapter 1 Introduction PHONOLOGY (Lane 335). Phonetics & Phonology Phonetics: deals with speech sounds, how they are made (articulatory phonetics), how.
Software Architecture for Multimodal Interactive Systems : Voice-enabled Graphical Notebook.
CS416 Compiler Design1. 2 Course Information Instructor : Dr. Ilyas Cicekli –Office: EA504, –Phone: , – Course Web.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Approaches to Machine Translation
CS416 Compiler Design lec00-outline September 19, 2018
Approaches to Machine Translation
CS416 Compiler Design lec00-outline February 23, 2019
Parsing Unrestricted Text
Requirements Document
Chapter 10: Compilers and Language Translation
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Towards an NLP `module’ The role of an utterance-level interface

Modular architecture Language independent application Meaning representation Language module text or speech Utterance-level interface

Desiderata for NLP module 1. Application- and domain- independent 2. Bidirectional processing 3. No grammar-specific information should be needed in the application 4. Architecture should support multiple languages 5. Practical 6. Coverage: all well-formed input should be accepted, robust to speaker errors

Why? developers could build `intelligent’ responsive applications without being NLP experts themselves less time-consuming and expensive than doing the NLP for each application domain multilingual applications support further research

LinGO/DELPH-IN Software and `lingware’ for application- and domain- independent NLP Linguistically-motivated (HPSG), deep processing Multiple languages Analysis and generation Informal collaboration since c.1995 NLP research and development, theoretical research and teaching

What’s different? Open Source, integrated systems Data-driven techniques combined with linguistic expertise Testing empirical basis, evaluation linguistic motivation No toy systems! large scale grammars maintainable software development and runtime tools

Progress 1. Application- and domain- independent: reasonable (lexicons, text structure) 2. Bidirectional processing: yes 3. No grammar-specifics in applications: yes 4. Multiple languages: English, Japanese, German, Norwegian, Korean, Greek, Italian, French: plus grammar sharing via the Matrix 5. Practical: efficiency OK for some applications and improving, interfaces? 6. Coverage and robustness: 80%+ coverage on English, good parse selection, not robust

Integrating deep and shallow processing Shallow processing: speed and robustness, but lacks precision Pairwise integration of systems is time-consuming, brittle Common semantic representation language: shallow processing underspecified Demonstrated effectiveness on IE (Deep Thought) Requires that systems share tokenization (undesirable and impractical) or that output can be precisely aligned with original document Markup complicates this

Utterance-level interface text or speech complex cases text structure (e.g., headings, lists) non-text (e.g., formulae, dates, graphics) segmentation (esp., Japanese, Chinese) speech lattices integration of multiple analyzers

Utterance interface Standard interface language allow for ambiguity at all levels XML collaborating with ISO working group (MAF) processors deliver standoff annotations to original text Plan to develop finite-state preprocessors for some text types, allow for others Plan to experiment with speech lattices

Assumptions about tokenization tokenization: input data is transformed to form suitable for morph processing or lexical lookup: What’s in those 234 dogs’ bowls, Fred? what ’s in those dogs ’s bowls, Fred ? tokenization is therefore pre-lexical and cannot depend on lexical lookup normalization (case, numbers, dates, formulae) as well as segmentation used to be common to strip punctuation, but large- coverage systems utilize it in generation: go from tokens to final output

Tokenization ambiguity Unusual to find cases where humans have any difficulty: problem arises because we need a pipelined system Some examples: `I washed the dogs’ bowls’, I said. (first ’ could be end of quote) The ’keeper’s reputations are on the line. (first ’ actually indicating abbreviation for goalkeeper but could be start of quote in text where ’ is not distinct from `) I want a laptop-with a case. (common in not to have spaces round dash)

Modularity problems lexicon developers may assume particular tokenization: e.g., hyphen removal different systems tokenize differently: big problem for system integration DELPH-IN - `characterization’ – record original string character positions in token and all subsequent units

Speech output Speech output from a transcribing recognizer is treated as a lattice of tokens may actually require retokenization

Non-white space languages Segmentation in Japanese (e.g., Chasen) is (in effect) accompanied by lexical lookup / morphological analysis definitely do not want to assume this for English – for some forms of processing we may not have a lexicon.