Towards an NLP `module’ The role of an utterance-level interface
Modular architecture Language independent application Meaning representation Language module text or speech Utterance-level interface
Desiderata for NLP module 1. Application- and domain- independent 2. Bidirectional processing 3. No grammar-specific information should be needed in the application 4. Architecture should support multiple languages 5. Practical 6. Coverage: all well-formed input should be accepted, robust to speaker errors
Why? developers could build `intelligent’ responsive applications without being NLP experts themselves less time-consuming and expensive than doing the NLP for each application domain multilingual applications support further research
LinGO/DELPH-IN Software and `lingware’ for application- and domain- independent NLP Linguistically-motivated (HPSG), deep processing Multiple languages Analysis and generation Informal collaboration since c.1995 NLP research and development, theoretical research and teaching
What’s different? Open Source, integrated systems Data-driven techniques combined with linguistic expertise Testing empirical basis, evaluation linguistic motivation No toy systems! large scale grammars maintainable software development and runtime tools
Progress 1. Application- and domain- independent: reasonable (lexicons, text structure) 2. Bidirectional processing: yes 3. No grammar-specifics in applications: yes 4. Multiple languages: English, Japanese, German, Norwegian, Korean, Greek, Italian, French: plus grammar sharing via the Matrix 5. Practical: efficiency OK for some applications and improving, interfaces? 6. Coverage and robustness: 80%+ coverage on English, good parse selection, not robust
Integrating deep and shallow processing Shallow processing: speed and robustness, but lacks precision Pairwise integration of systems is time-consuming, brittle Common semantic representation language: shallow processing underspecified Demonstrated effectiveness on IE (Deep Thought) Requires that systems share tokenization (undesirable and impractical) or that output can be precisely aligned with original document Markup complicates this
Utterance-level interface text or speech complex cases text structure (e.g., headings, lists) non-text (e.g., formulae, dates, graphics) segmentation (esp., Japanese, Chinese) speech lattices integration of multiple analyzers
Utterance interface Standard interface language allow for ambiguity at all levels XML collaborating with ISO working group (MAF) processors deliver standoff annotations to original text Plan to develop finite-state preprocessors for some text types, allow for others Plan to experiment with speech lattices
Assumptions about tokenization tokenization: input data is transformed to form suitable for morph processing or lexical lookup: What’s in those 234 dogs’ bowls, Fred? what ’s in those dogs ’s bowls, Fred ? tokenization is therefore pre-lexical and cannot depend on lexical lookup normalization (case, numbers, dates, formulae) as well as segmentation used to be common to strip punctuation, but large- coverage systems utilize it in generation: go from tokens to final output
Tokenization ambiguity Unusual to find cases where humans have any difficulty: problem arises because we need a pipelined system Some examples: `I washed the dogs’ bowls’, I said. (first ’ could be end of quote) The ’keeper’s reputations are on the line. (first ’ actually indicating abbreviation for goalkeeper but could be start of quote in text where ’ is not distinct from `) I want a laptop-with a case. (common in not to have spaces round dash)
Modularity problems lexicon developers may assume particular tokenization: e.g., hyphen removal different systems tokenize differently: big problem for system integration DELPH-IN - `characterization’ – record original string character positions in token and all subsequent units
Speech output Speech output from a transcribing recognizer is treated as a lattice of tokens may actually require retokenization
Non-white space languages Segmentation in Japanese (e.g., Chasen) is (in effect) accompanied by lexical lookup / morphological analysis definitely do not want to assume this for English – for some forms of processing we may not have a lexicon.