Download presentation
Presentation is loading. Please wait.
1
Towards an NLP `module’ The role of an utterance-level interface
2
Modular architecture Language independent application Meaning representation Language module text or speech Utterance-level interface
3
Desiderata for NLP module 1. Application- and domain- independent 2. Bidirectional processing 3. No grammar-specific information should be needed in the application 4. Architecture should support multiple languages 5. Practical 6. Coverage: all well-formed input should be accepted, robust to speaker errors
4
Why? developers could build `intelligent’ responsive applications without being NLP experts themselves less time-consuming and expensive than doing the NLP for each application domain multilingual applications support further research
5
LinGO/DELPH-IN Software and `lingware’ for application- and domain- independent NLP Linguistically-motivated (HPSG), deep processing Multiple languages Analysis and generation Informal collaboration since c.1995 NLP research and development, theoretical research and teaching www.delph-in.net
6
What’s different? Open Source, integrated systems Data-driven techniques combined with linguistic expertise Testing empirical basis, evaluation linguistic motivation No toy systems! large scale grammars maintainable software development and runtime tools
7
Progress 1. Application- and domain- independent: reasonable (lexicons, text structure) 2. Bidirectional processing: yes 3. No grammar-specifics in applications: yes 4. Multiple languages: English, Japanese, German, Norwegian, Korean, Greek, Italian, French: plus grammar sharing via the Matrix 5. Practical: efficiency OK for some applications and improving, interfaces? 6. Coverage and robustness: 80%+ coverage on English, good parse selection, not robust
8
Integrating deep and shallow processing Shallow processing: speed and robustness, but lacks precision Pairwise integration of systems is time-consuming, brittle Common semantic representation language: shallow processing underspecified Demonstrated effectiveness on IE (Deep Thought) Requires that systems share tokenization (undesirable and impractical) or that output can be precisely aligned with original document Markup complicates this
9
Utterance-level interface text or speech complex cases text structure (e.g., headings, lists) non-text (e.g., formulae, dates, graphics) segmentation (esp., Japanese, Chinese) speech lattices integration of multiple analyzers
10
Utterance interface Standard interface language allow for ambiguity at all levels XML collaborating with ISO working group (MAF) processors deliver standoff annotations to original text Plan to develop finite-state preprocessors for some text types, allow for others Plan to experiment with speech lattices
11
Assumptions about tokenization tokenization: input data is transformed to form suitable for morph processing or lexical lookup: What’s in those 234 dogs’ bowls, Fred? what ’s in those dogs ’s bowls, Fred ? tokenization is therefore pre-lexical and cannot depend on lexical lookup normalization (case, numbers, dates, formulae) as well as segmentation used to be common to strip punctuation, but large- coverage systems utilize it in generation: go from tokens to final output
12
Tokenization ambiguity Unusual to find cases where humans have any difficulty: problem arises because we need a pipelined system Some examples: `I washed the dogs’ bowls’, I said. (first ’ could be end of quote) The ’keeper’s reputations are on the line. (first ’ actually indicating abbreviation for goalkeeper but could be start of quote in text where ’ is not distinct from `) I want a laptop-with a case. (common in email not to have spaces round dash)
13
Modularity problems lexicon developers may assume particular tokenization: e.g., hyphen removal different systems tokenize differently: big problem for system integration DELPH-IN - `characterization’ – record original string character positions in token and all subsequent units
14
Speech output Speech output from a transcribing recognizer is treated as a lattice of tokens may actually require retokenization
15
Non-white space languages Segmentation in Japanese (e.g., Chasen) is (in effect) accompanied by lexical lookup / morphological analysis definitely do not want to assume this for English – for some forms of processing we may not have a lexicon.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.