23.3 Information Extraction More complicated than an IR (Information Retrieval) system. Requires a limited notion of syntax and semantics
Attribute-Based Systems Assumes entire text refers to a single object. Often, uses regular expressions to pull out values for attributes –[0-9], ?, +, *
Relational-Based Systems The text might refer to multiple objects. FAUSUS uses cascaded finite state transducers to perform the following steps: –Tokenization –Complex Word Handling –Basic Group Handling (NG, VG, PR, CJ) –Complex Phrase Handling –Structure Merging
23.4 Machine Translation Rough Translation (“gist”) Restricted Source (weather) Pre-edited (Caterpillar English) Literary (unsolved) Interlingua: A representation language that captures all meanings of an idea
Transfer System, Figure 23.5 Lexical Rule, e.g. ENG[cat] FR[chat] Syntactic Rule, e.g. ENG[adj noun] FR[noun adj] Memory Based Rule, e.g. ENG[The cat comes] FR[Le chat arrive]
Statistical Machine Translation argmax F P(F | E) = argmax F P(E | F) * P(F) / P(E) argmax F P(E | F) * P(F) P(F), language model (e.g. bigram model) P(E | F), translation model, p. 856 –P(fertility = n | word F ), fertility model –P(word E | word F ), word choice model –P(offset = o | pos, len E, len F ), offset model
Learning Probabilities Given a French text and an English text Segment into sentences Estimate F language model Align sentences Estimate fertility model Estimate word choice model Estimate offset model Improve models using a technique such as EM