Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 39 of 42 Wednesday, 29 November.

Similar presentations


Presentation on theme: "Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 39 of 42 Wednesday, 29 November."— Presentation transcript:

1 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 39 of 42 Wednesday, 29 November 2006 William H. Hsu Department of Computing and Information Sciences, KSU KSOL course page: http://snipurl.com/v9v3http://snipurl.com/v9v3 Course web site: http://www.kddresearch.org/Courses/Fall-2006/CIS730http://www.kddresearch.org/Courses/Fall-2006/CIS730 Instructor home page: http://www.cis.ksu.edu/~bhsuhttp://www.cis.ksu.edu/~bhsu Reading for Next Class: Sections 22.1, 22.6-7, Russell & Norvig 2 nd edition Natural Language Processing (NLP) Discussion: Machine Translation (MT)

2 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture Outline  Reference: Sections 6.9-6.10, Mitchell  Simple Bayes, aka Naïve Bayes  More examples  Classification: choosing between two classes; general case  Robust estimation of probabilities  Learning in Natural Language Processing (NLP)  Learning over text: problem definitions  Case study: Newsweeder (Naïve Bayes application)  Probabilistic framework  Bayesian approaches to NLP •Issues: word sense disambiguation, part-of-speech tagging •Applications: spelling correction, web and document searching  Related Material, Mitchell; Pearl  Read: “Bayesian Networks without Tears”, Charniak  Go over Chapter 14, Russell and Norvig; Heckerman tutorial (slides)

3 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Learning Framework for Natural Language: (Hidden) Markov Models  Definition of Hidden Markov Models (HMMs)  Stochastic state transition diagram (HMMs: states, aka nodes, are hidden)  Compare: probabilistic finite state automaton (Mealy/Moore model)  Annotated transitions (aka arcs, edges, links) •Output alphabet (the observable part) •Probability distribution over outputs  Forward Problem: One Step in ML Estimation  Given: model h, observations (data) D  Estimate: P(D | h)  Backward Problem: Prediction Step  Given: model h, observations D  Maximize: P(h(X) = x | h, D) for a new X  Forward-Backward (Learning) Problem  Given: model space H, data D  Find: h  H such that P(h | D) is maximized (i.e., MAP hypothesis)  HMMs Also A Case of LSQ (f Values in [Roth, 1999]) 0.4 0.5 0.6 0.8 0.2 0.5 123 A 0.4 B 0.6 A 0.5 G 0.3 H 0.2 E 0.1 F 0.9 E 0.3 F 0.7 C 0.8 D 0.2 A 0.1 G 0.9

4 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence NLP Issues: Word Sense Disambiguation (WSD)  Problem Definition  Given: m sentences, each containing a usage of a particular ambiguous word  Example: “The can will rust.” (auxiliary verb versus noun)  Label: v j  s  correct word sense (e.g., s  {auxiliary verb, noun})  Representation: m examples (labeled attribute vectors )  Return: classifier f: X  V that disambiguates new x  (w 1, w 2, …, w n )  Solution Approach: Use Bayesian Learning (e.g., Naïve Bayes)  Caveat: can’t observe s in the text!  A solution: treat s in P(w i | s) as missing value, impute s (assign by inference)  [Pedersen and Bruce, 1998]: fill in using Gibbs sampling, EM algorithm (later)  [Roth, 1998]: Naïve Bayes, sparse networks of Winnows (SNOW), TBL  Recent Research  T. Pedersen’s research home page: http://www.d.umn.edu/~tpederse/http://www.d.umn.edu/~tpederse/  D. Roth’s Cognitive Computation Group: http://l2r.cs.uiuc.edu/~cogcomp/http://l2r.cs.uiuc.edu/~cogcomp/

5 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence NLP Issues: Part-of-Speech (POS) Tagging  Problem Definition  Given: m sentences containing untagged words  Example: “The can will rust.”  Label (one per word, out of ~30-150): v j  s  (art, n, aux, vi)  Representation: labeled examples  Return: classifier f: X  V that tags x  (w 1, w 2, …, w n )  Applications: WSD, dialogue acts (e.g., “That sounds OK to me.”  ACCEPT)  Solution Approaches: Use Transformation-Based Learning (TBL)  [Brill, 1995]: TBL - mistake-driven algorithm that produces sequences of rules •Each rule of the form (t i, v): a test condition (constructed attribute) and a tag •t i : “w occurs within  k words of w i ” (context words); collocations (windows)  For more info: see [Roth, 1998], [Samuel, Carberry, Vijay-Shankar, 1998]  Recent Research  E. Brill’s page: http://www.cs.jhu.edu/~brill/  K. Samuel’s page: http://www.eecis.udel.edu/~samuel/work/research.html Discourse Labeling Speech Acts Natural Language Parsing / POS Tagging Lexical Analysis

6 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence NLP Applications: Info Retrieval (IR) and Digital Libraries  Information Retrieval (IR)  One role of learning: produce classifiers for documents (see [Sahami, 1999])  Query-based search engines (e.g., for WWW: AltaVista, Lycos, Yahoo)  Applications: bibliographic searches (citations, patent intelligence, etc.)  Bayesian Classification: Integrating Supervised and Unsupervised Learning  Unsupervised learning: organize collections of documents at a “topical” level  e.g., AutoClass [Cheeseman et al, 1988]; self-organizing maps [Kohonen, 1995]  More on this topic (document clustering) soon  Framework Extends Beyond Natural Language  Collections of images, audio, video, other media  Five Ss : Source, Stream, Structure, Scenario, Society  Book on IR [vanRijsbergen, 1979]: http://www.dcs.gla.ac.uk/Keith/Preface.html  Recent Research  M. Sahami’s page (Bayesian IR): http://robotics.stanford.edu/users/sahami  Digital libraries (DL) resources: http://fox.cs.vt.edu

7 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department

8 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Machine Translation 美国关岛国际机场及其办公室均接获一 名自称沙地阿拉伯富商拉登等发出的电 子邮件,威胁将会向机场等公众地方发 动生化袭击後,关岛经保持高度戒备。 The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. The classic acid test for natural language processing. Requires capabilities in both interpretation and generation. About $10 billion spent annually on human translation.

9 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Knowledge Acquisition Strategy Knowledge Representation Strategy All manual Deep/ Complex Shallow/ Simple Fully automated Learn from un- annotated data Phrase tables Word-based only Learn from annotated data Example-based MT Original statistical MT Typical transfer system Classic interlingual system Original direct approach Syntactic Constituent Structure Interlingua New Research Goes Here! Semantic analysis Hand-built by non-experts Hand-built by experts Electronic dictionaries MT Strategies (1954-2004) Slide courtesy of Laurie Gerber

10 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Data-Driven Machine Translation Hmm, every time he sees “banco”, he either types “bank” or “bench” … but if he sees “banco de…”, he always types “bank”, never “bench”… Man, this is so boring. Translated documents

11 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Recent Progress in Statistical MT insistent Wednesday may recurred her trips to Libya tomorrow for flying Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment. And said the official " the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air, a situation her receiving replying are so a trip will pull to Libya a morning Wednesday ". Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya. " The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ". 2002 2003 slide from C. Wayne, DARPA

12 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

13 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

14 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

15 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat.

16 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

17 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

18 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

19 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

20 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

21 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

22 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp process of elimination

23 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp cognate?

24 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. zero fertility

25 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa It’s Really Spanish/English 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos.

26 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Data for Statistical MT and data preparation

27 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Ready-to-Use Online Bilingual Data (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn). Millions of words (English side)

28 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Ready-to-Use Online Bilingual Data (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn). Millions of words (English side) + 1m-20m words for many language pairs

29 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Ready-to-Use Online Bilingual Data Millions of words (English side)  One Billion? ???

30 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence From No Data to Sentence Pairs  Easy way: Linguistic Data Consortium (LDC)  Really hard way: pay $$$  Suppose one billion words of parallel data were sufficient  At 20 cents/word, that’s $200 million  Pretty hard way: Find it, and then earn it!  De-formatting  Remove strange characters  Character code conversion  Document alignment  Sentence alignment  Tokenization (also called Segmentation)

31 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Sentence Alignment The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. El viejo está feliz porque ha pescado muchos veces. Su mujer habla con é l. Los tiburones esperan.

32 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Sentence Alignment 1.The old man is happy. 2.He has fished many times. 3.His wife talks to him. 4.The fish are jumping. 5.The sharks await. 1.El viejo está feliz porque ha pescado muchos veces. 2.Su mujer habla con él. 3.Los tiburones esperan.

33 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Sentence Alignment 1.The old man is happy. 2.He has fished many times. 3.His wife talks to him. 4.The fish are jumping. 5.The sharks await. 1.El viejo está feliz porque ha pescado muchos veces. 2.Su mujer habla con él. 3.Los tiburones esperan.

34 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Sentence Alignment 1.The old man is happy. He has fished many times. 2.His wife talks to him. 3.The sharks await. 1.El viejo está feliz porque ha pescado muchos veces. 2.Su mujer habla con él. 3.Los tiburones esperan. Note that unaligned sentences are thrown out, and sentences are merged in n-to-m alignments (n, m > 0).

35 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Tokenization (or Segmentation)  English  Input (some byte stream): "There," said Bob.  Output (7 “tokens” or “words”): " There, " said Bob.  Chinese  Input (byte stream):  Output: 美国关岛国际机场及其办公室均接获 一名自称沙地阿拉伯富商拉登等发出 的电子邮件。

36 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Lower-Casing  English  Input (7 words): " There, " said Bob.  Output (7 words): " there, " said bob. The the “The “the the Smaller vocabulary size. More robust counting and learning. Idea of tokenizing and lower-casing:

37 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence It Is Possible to Draw Learning Curves: How Much Data Do We Need? Amount of bilingual training data Quality of automatically trained machine translation system

38 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence MT Evaluation

39 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence MT Evaluation  Manual:  SSER (subjective sentence error rate)  Correct/Incorrect  Error categorization  Testing in an application that uses MT as one sub-component  Question answering from foreign language documents  Automatic:  WER (word error rate)  BLEU (Bilingual Evaluation Understudy)

40 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. BLEU Evaluation Metric (Papineni et al, ACL-2002) •N-gram precision (score is between 0 & 1) –What percentage of machine n-grams can be found in the reference translation? –An n-gram is an sequence of n words –Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the the the”) •Brevity penalty –Can’t just type out single word “the” (precision 1.0!) *** Amazingly hard to “game” the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesn’t)

41 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. BLEU Evaluation Metric (Papineni et al, ACL-2002) •BLEU4 formula (counts n-grams up to length 4) exp (1.0 * log p1 + 0.5 * log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-reference / words-in-machine – 1, 0) p1 = 1-gram precision P2 = 2-gram precision P3 = 3-gram precision P4 = 4-gram precision

42 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. Multiple Reference Translations Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance.

43 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence BLEU Tends to Predict Human Judgments slide from G. Doddington (NIST) (variant of BLEU)

44 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence BLEU in Action 枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10

45 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence BLEU in Action 枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 green = 4-gram match (good!) red = word not matched (bad!)

46 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Sample Learning Curves Swedish/English French/English German/English Finnish/English # of sentence pairs used in training BLEU score Experiments by Philipp Koehn

47 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Word-Based Statistical MT

48 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Statistical MT Systems Spanish Broken English Spanish/English Bilingual Text English Text Statistical Analysis Que hambre tengo yo What hunger have I, Hungry I am so, I am so hungry, Have I that hunger … I am so hungry

49 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Statistical MT Systems Spanish Broken English Spanish/English Bilingual Text English Text Statistical Analysis Que hambre tengo yoI am so hungry Translation Model P(s|e) Language Model P(e) Decoding algorithm argmax P(e) * P(s|e) e

50 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Bayes Rule Spanish Broken English Que hambre tengo yoI am so hungry Translation Model P(s|e) Language Model P(e) Decoding algorithm argmax P(e) * P(s|e) e Given a source sentence s, the decoder should consider many possible translations … and return the target string e that maximizes P(e | s) By Bayes Rule, we can also write this as: P(e) x P(s | e) / P(s) and maximize that instead. P(s) never changes while we compare different e’s, so we can equivalently maximize this: P(e) x P(s | e)

51 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Three Problems for Statistical MT  Language model  Given an English string e, assigns P(e) by formula  good English string -> high P(e)  random word sequence -> low P(e)  Translation model  Given a pair of strings, assigns P(f | e) by formula  look like translations -> high P(f | e)  don’t look like translations -> low P(f | e)  Decoding algorithm  Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) * P(f | e)

52 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence The Classic Language Model Word N-Grams Goal of the language model -- choose among: He is on the soccer field He is in the soccer field Is table the on cup the The cup is on the table Rice shrine American shrine Rice company American company

53 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence The Classic Language Model Word N-Grams Generative approach: w1 = START repeat until END is generated: produce word w2 according to a big table P(w2 | w1) w1 := w2 P(I saw water on the table) = P(I | START) * P(saw | I) * P(water | saw) * P(on | water) * P(the | on) * P(table | the) * P(END | table) Probabilities can be learned from online English text.

54 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Translation Model? Mary did not slap the green witch Maria no d ió una botefada a la bruja verde Source-language morphological analysis Source parse tree Semantic representation Generate target structure Generative approach:

55 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Translation Model? Mary did not slap the green witch Maria no d ió una botefada a la bruja verde Source-language morphological analysis Source parse tree Semantic representation Generate target structure Generative story: What are all the possible moves and their associated probability tables?

56 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence The Classic Translation Model Word Substitution/Permutation [IBM Model 3, Brown et al., 1993] Mary did not slap the green witch Mary not slap slap slap the green witch n(3|slap) Maria no d ió una botefada a la bruja verde d(j|i) Mary not slap slap slap NULL the green witch P-Null Maria no d ió una botefada a la verde bruja t(la|the) Generative approach: Probabilities can be learned from raw bilingual text.

57 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … All word alignments equally likely All P(french-word | english-word) equally likely

58 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “la” and “the” observed to co-occur frequently, so P(la | the) is increased.

59 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “house” co-occurs with both “la” and “maison”, but P(maison | house) can be raised without limit, to 1.0, while P(la | house) is limited because of “the” (pigeonhole principle)

60 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … settling down after another iteration

61 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … Inherent hidden structure revealed by EM training! For details, see: • “A Statistical MT Tutorial Workbook” (Knight, 1999). • “The Mathematics of Statistical Machine Translation” (Brown et al, 1993) • Software: GIZA++

62 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … P(juste | fair) = 0.411 P(juste | correct) = 0.027 P(juste | right) = 0.020 … new French sentence Possible English translations, to be rescored by language model

63 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Decoding for “Classic” Models  Of all conceivable English word strings, find the one maximizing P(e) x P(f | e)  Decoding is an NP-complete challenge  (Knight, 1999)  Several search strategies are available  Each potential English output is called a hypothesis.

64 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Greedy decoding (Germann et al, ACL-2001)

65 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Dynamic Programming Beam Search 1 st target word 2 nd target word 3 rd target word 4 th target word start end Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far) all source words covered [Jelinek, 1969; Brown et al, 1996 US Patent; (Och, Ueffing, and Ney, 2001]

66 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Dynamic Programming Beam Search 1 st target word 2 nd target word 3 rd target word 4 th target word start end Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far) all source words covered [Jelinek, 1969; Brown et al, 1996 US Patent; (Och, Ueffing, and Ney, 2001] best predecessor link

67 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence The Classic Results  la politique de la haine. (Foreign Original)  politics of hate. (Reference Translation)  the policy of the hatred. (IBM4+N-grams+Stack)  nous avons signé le protocole. (Foreign Original)  we did sign the memorandum of agreement. (Reference Translation)  we have signed the protocol. (IBM4+N-grams+Stack)  où était le plan solide ? (Foreign Original)  but where was the solid plan ? (Reference Translation)  where was the economic base ? (IBM4+N-grams+Stack) the Ministry of Foreign Trade and Economic Cooperation, including foreign direct investment 40.007 billion US dollars today provide data include that year to November china actually using foreign 46.959 billion US dollars and

68 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Flaws of Word-Based MT  Multiple English words for one French word  IBM models can do one-to-many (fertility) but not many-to-one  Phrasal Translation  “real estate”, “note that”, “interest in”  Syntactic Transformations  Verb at the beginning in Arabic  Translation model penalizes any proposed re-ordering  Language model not strong enough to force the verb to move to the right place

69 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Phrase-Based Statistical MT

70 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Phrase-Based Statistical MT  Foreign input segmented in to phrases  “phrase” is any sequence of words  Each phrase is probabilistically translated into English  P(to the conference | zur Konferenz)  P(into the meeting | zur Konferenz)  Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an intro. This is state-of-the-art! Morgenfliegeichnach Kanadazur Konferenz TomorrowIwill flyto the conferenceIn Canada

71 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Advantages of Phrase-Based  Many-to-many mappings can handle non-compositional phrases  Local context is very useful for disambiguating  “Interest rate”  …  “Interest in”  …  The more data, the longer the learned phrases  Sometimes whole sentences

72 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence How to Learn the Phrase Translation Table?  One method: “alignment templates” (Och et al, 1999)  Start with word alignment, build phrases from that. Mary did not slap the green witch Maria no dió una bofetada a la bruja verde This word-to-word alignment is a by-product of training a translation model like IBM-Model-3. This is the best (or “Viterbi”) alignment.

73 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence How to Learn the Phrase Translation Table?  One method: “alignment templates” (Och et al, 1999)  Start with word alignment, build phrases from that. Mary did not slap the green witch Maria no dió una bofetada a la bruja verde This word-to-word alignment is a by-product of training a translation model like IBM-Model-3. This is the best (or “Viterbi”) alignment.

74 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence IBM Models are 1-to-Many  Run IBM-style aligner both directions, then merge: E  F best alignment Union or Intersection MERGE F  E best alignment

75 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence How to Learn the Phrase Translation Table?  Collect all phrase pairs that are consistent with the word alignment Mary did not slap the green witch Maria no dió una bofetada a la bruja verde one example phrase pair

76 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Consistent with Word Alignment Phrase alignment must contain all alignment points for all the words in both phrases! x x Mary did not slap Maria no dió Mary did not slap Maria no dió Mary did not slap Maria no dió consistentinconsistent

77 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

78 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the)

79 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch)

80 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) …

81 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)

82 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Phrase Pair Probabilities  A certain phrase pair (f-f-f, e-e-e) may appear many times across the bilingual corpus.  We hope so!  So, now we have a vast list of phrase pairs and their frequencies – how to assign probabilities?

83 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Phrase Pair Probabilities  Basic idea:  No EM training  Just relative frequency: P(f-f-f | e-e-e) = count(f-f-f, e-e-e) / count(e-e-e)  Important refinements:  Smooth using word probs P(f | e) for individual words connected in the word alignment  Some low count phrase pairs now have high probability, others have low probability  Discount for ambiguity  If phrase e-e-e can map to 5 different French phrases, due to the ambiguity of unaligned words, each pair gets a 1/5 count  Count BAD events too  If phrase e-e-e doesn’t map onto any contiguous French phrase, increment event count(BAD, e-e-e)

84 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Advanced Training Methods

85 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Basic Model, Revisited argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) = e argmax P(e) x P(f | e) e

86 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Basic Model, Revisited argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) = e argmax P(e) 2.4 x P(f | e) … works better! e

87 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Basic Model, Revisited argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) e argmax P(e) 2.4 x P(f | e) x length(e) 1.1 e Rewards longer hypotheses, since these are unfairly punished by P(e)

88 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Basic Model, Revisited argmax P(e) 2.4 x P(f | e) x length(e) 1.1 x KS 3.7 … e Lots of knowledge sources vote on any given hypothesis. “Knowledge source” = “feature function” = “score component”. Feature function simply scores a hypothesis with a real value. (May be binary, as in “e has a verb”). Problem: How to set the exponent weights?

89 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Maximum BLEU Training (Och, 2003) Translation System (Automatic, Trainable) Translation Quality Evaluator (Automatic) Farsi English MT Output English Reference Translations (sample “right answers”) BLEU score Language Model #1 Translation Model Language Model #2 Length Model Other Features Learning Algorithm for Directly Reducing Translation Error Yields big improvements in quality.

90 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence MT Pyramid SOURCETARGET words syntax semantics interlingua phrases

91 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Why Syntax?  Need much more grammatical output  Need accurate control over re-ordering  Need accurate insertion of function words  Word translations need to depend on grammatically-related words

92 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence. Reorder VB PRPVB2 VB1 TOVB MN TO he adores listening music to Insert desu VB PRPVB2VB1 TOVB MNTO he ha music to ga adores listeningno Translate Kare ha ongaku wo kiku no ga daisuki desu Take Leaves desu VB PRPVB2VB1 TOVB MNTO kare ha ongaku wo ga daisuki kikuno VB PRPVB1 headores listening VB2 VBTO MNTO musicto Parse Tree(E) Sentence(J) Yamada/Knight 01: Modeling and Training

93 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Japanese/English Reorder Table For French/English, useful parameters like P(N ADJ | ADJ N).

94 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Decoded Tree he briefed reporters statement major contents he briefed reporters on main contents of the stmt PRP NPB S PRN NPB S PRN NPB NP VBDNNSNNJJNNS NPB VP NPB S PRN PRPVBDNNSINJJNNSINDTNN NPB PP NP-A PP VP S Decoding with Trigram LMDecoding with Charniak Tree-Based LM

95 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Casting Syntax MT Models As Tree Transducer Automata [Graehl & Knight 04] q S NP1VP VBNP2 S NP1VP NP2 q S PROVP VBNPthere are two men CDNN S PRNP hay dos hombres CD NN NP NP1PP of PNP2 NP NP2 P NP1 q S WH-NPSINV/NP MDS/NPWho did NPVP/NP VB see S Ska SNP S VB PROP dareo NPP ga * Non-local Re-Ordering (English/Arabic)Non-constituent Phrasal Translation (English/Spanish) Lexicalized Re-Ordering (English/Chinese)Long-distance Re-Ordering (English/Japanese)

96 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Summary  Phrase-based models are state-of-the-art  Word alignments  Phrase pair extraction & probabilities  N-gram language models  Beam search decoding  Feature functions & learning weights  But the output is not English  Fluency must be improved  Better translation of person names, organizations, locations  More automatic acquisition of parallel data, exploitation of monolingual data across a variety of domains/languages  Need good accuracy across a variety of domains/languages

97 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Available Resources  Bilingual corpora  100m+ words of Chinese/English and Arabic/English, LDC (www.ldc.upenn.edu)  Lots of French/English, Spanish/French/English, LDC  European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI  (www.isi.edu/~koehn/publications/europarl)  20m words (sentence-aligned) of English/French, Ulrich Germann, ISI  (www.isi.edu/natural-language/download/hansard/)www.isi.edu/natural-language/download/hansard/  Sentence alignment  Dan Melamed, NYU (www.cs.nyu.edu/~melamed/GMA/docs/README.htm)  Xiaoyi Ma, LDC (Champollion)  Word alignment  GIZA, JHU Workshop ’99 (www.clsp.jhu.edu/ws99/projects/mt/)  GIZA++, RWTH Aachen (www-i6.Informatik.RWTH-Aachen.de/web/Software/GIZA++.html)  Manually word-aligned test corpus (500 French/English sentence pairs), RWTH Aachen  Shared task, NAACL-HLT’03 workshop  Decoding  ISI ReWrite Model 4 decoder (www.isi.edu/licensed-sw/rewrite-decoder/)www.isi.edu/licensed-sw/rewrite-decoder/  ISI Pharoah phrase-based decoder  Statistical MT Tutorial Workbook, ISI (www.isi.edu/~knight/)  Annual common-data evaluation, NIST (www.nist.gov/speech/tests/mt/index.htm)

98 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Some Papers Referenced on Slides  ACL  [Och, Tillmann, & Ney, 1999]  [Och & Ney, 2000]  [Germann et al, 2001]  [Yamada & Knight, 2001, 2002]  [Papineni et al, 2002]  [Alshawi et al, 1998]  [Collins, 1997]  [Koehn & Knight, 2003]  [Al-Onaizan & Knight, 2002]  [Och & Ney, 2002]  [Och, 2003]  [Koehn et al, 2003]  EMNLP  [Marcu & Wong, 2002]  [Fox, 2002]  [Munteanu & Marcu, 2002]  AI Magazine  [Knight, 1997]  www.isi.edu/~knight  [MT Tutorial Workbook] •AMTA –[Soricut et al, 2002] –[Al-Onaizan & Knight, 1998] •EACL –[Cmejrek et al, 2003] •Computational Linguistics –[Brown et al, 1993] –[Knight, 1999] –[Wu, 1997] •AAAI –[Koehn & Knight, 2000] •IWNLG –[Habash, 2002] •MT Summit –[Charniak, Knight, Yamada, 2003] •NAACL –[Koehn, Marcu, Och, 2003] –[Germann, 2003] –[Graehl & Knight, 2004] –[Galley, Hopkins, Knight, Marcu, 2004]

99 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Terminology  Simple Bayes, aka Naïve Bayes  Zero counts: case where an attribute value never occurs with a label in D  No match approach: assign an   c/m probability to P(x ik | v j )  m-estimate aka Laplace approach: assign a Bayesian estimate to P(x ik | v j )  Learning in Natural Language Processing (NLP)  Training data: text corpora (collections of representative documents)  Statistical Queries (SQ) oracle: answers queries about P(x ik, v j ) for x ~ D  Linear Statistical Queries (LSQ) algorithm: classification using f(oracle response) •Includes: Naïve Bayes, BOC •Other examples: Hidden Markov Models (HMMs), maximum entropy  Problems: word sense disambiguation, part-of-speech tagging  Applications •Spelling correction, conversational agents •Information retrieval: web and digital library searches

100 Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Summary Points  More on Simple Bayes, aka Naïve Bayes  More examples  Classification: choosing between two classes; general case  Robust estimation of probabilities: SQ  Learning in Natural Language Processing (NLP)  Learning over text: problem definitions  Statistical Queries (SQ) / Linear Statistical Queries (LSQ) framework •Oracle •Algorithms: search for h using only (L)SQs  Bayesian approaches to NLP •Issues: word sense disambiguation, part-of-speech tagging •Applications: spelling; reading/posting news; web search, IR, digital libraries  Next Week: Section 6.11, Mitchell; Pearl and Verma  Read: Charniak tutorial, “Bayesian Networks without Tears”  Skim: Chapter 15, Russell and Norvig; Heckerman slides


Download ppt "Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 39 of 42 Wednesday, 29 November."

Similar presentations


Ads by Google