1 Learning Language from its Perceptual Context Ray Mooney Department of Computer Sciences University of Texas at Austin Joint work with David Chen Joohyun.

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval Sonal Gupta and Raymond Mooney University of Texas at Austin.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning Semantic Parsers Using Statistical.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Adapting Discriminative Reranking to Grounded Language Learning Joohyun Kim and Raymond J. Mooney Department of Computer Science The University of Texas.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Chapter 2: Algorithm Discovery and Design
Introduction to Machine Learning Approach Lecture 5.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
1 Learning to Interpret Natural Language Navigation Instructions from Observation Ray Mooney Department of Computer Science University of Texas at Austin.
1 Learning Natural Language from its Perceptual Context Ray Mooney Department of Computer Science University of Texas at Austin Joint work with David Chen.
1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning Language Semantics from Ambiguous Supervision Rohit J. Kate.
1 David Chen Advisor: Raymond Mooney Research Preparation Exam August 21, 2008 Learning to Sportscast: A Test of Grounded Language Acquisition.
David L. Chen Supervisor: Professor Raymond J. Mooney Ph.D. Dissertation Defense January 25, 2012 Learning Language from Ambiguous Perceptual Context.
David Chen Advisor: Raymond Mooney Research Preparation Exam August 21, 2008 Learning to Sportscast: A Test of Grounded Language Acquisition.
Probabilistic Context Free Grammars for Representing Action Song Mao November 14, 2000.
Learning to Transform Natural to Formal Language Presented by Ping Zhang Rohit J. Kate, Yuk Wah Wong, and Raymond J. Mooney.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
David L. Chen Fast Online Lexicon Learning for Grounded Language Acquisition The 50th Annual Meeting of the Association for Computational Linguistics (ACL)
1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at.
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
David L. Chen and Raymond J. Mooney Department of Computer Science The University of Texas at Austin Learning to Interpret Natural Language Navigation.
1 David Chen & Raymond Mooney Department of Computer Sciences University of Texas at Austin Learning to Sportscast: A Test of Grounded Language Acquisition.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning for Semantic Parsing Raymond.
Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning a Compositional Semantic Parser.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
The Unreasonable Effectiveness of Data
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Reconnecting Computational Linguistics.
Natural Language Generation with Tree Conditional Random Fields Wei Lu, Hwee Tou Ng, Wee Sun Lee Singapore-MIT Alliance National University of Singapore.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning for Semantic Parsing with Kernels under Various Forms of.
David Chen Supervising Professor: Raymond J. Mooney Doctoral Dissertation Proposal December 15, 2009 Learning Language from Perceptual Context 1.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning for Semantic Parsing of Natural.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
1 Learning Language from its Perceptual Context Ray Mooney Department of Computer Sciences University of Texas at Austin Joint work with David Chen Rohit.
Grounded Language Learning
Semantic Parsing for Question Answering
CSC 594 Topics in AI – Natural Language Processing
Using String-Kernels for Learning Semantic Parsers
Learning to Transform Natural to Formal Languages
Joohyun Kim Supervising Professor: Raymond J. Mooney
Learning to Sportscast: A Test of Grounded Language Acquisition
Learning a Policy for Opportunistic Active Learning
Using Natural Language Processing to Aid Computer Vision
Presentation transcript:

1 Learning Language from its Perceptual Context Ray Mooney Department of Computer Sciences University of Texas at Austin Joint work with David Chen Joohyun Kim Rohit Kate

Current State of Natural Language Learning Most current state-of-the-art NLP systems are constructed by training on large supervised corpora. –Syntactic Parsing: Penn Treebank –Word Sense Disambiguation: SenseEval –Semantic Role Labeling: Propbank –Machine Translation: Hansards corpus Constructing such annotated corpora is difficult, expensive, and time consuming. 2

3 Semantic Parsing A semantic parser maps a natural-language (NL) sentence to a complete, detailed formal semantic representation: logical form or meaning representation (MR). For many applications, the desired output is computer language that is immediately executable by another program.

4 CLang: RoboCup Coach Language In RoboCup Coach competition teams compete to coach simulated soccer players The coaching instructions are given in a formal language called CLang Simulated soccer field Coach If the ball is in our penalty area, then all our players except player 4 should stay in our half. CLang ((bpos (penalty-area our)) (do (player-except our{4}) (pos (half our))) Semantic Parsing

5 Learning Semantic Parsers Manually programming robust semantic parsers is difficult due to the complexity of the task. Semantic parsers can be learned automatically from sentences paired with their logical form. NL  MR Training Exs Semantic-Parser Learner Natural Language Meaning Rep Semantic Parser

6 Our Semantic-Parser Learners CHILL+WOLFIE (Zelle & Mooney, 1996; Thompson & Mooney, 1999) –Separates parser-learning and semantic-lexicon learning. –Learns a deterministic parser using ILP techniques. COCKTAIL (Tang & Mooney, 2001) –Improved ILP algorithm for CHILL. SILT (Kate, Wong & Mooney, 2005) –Learns symbolic transformation rules for mapping directly from NL to MR. SCISSOR (Ge & Mooney, 2005) –Integrates semantic interpretation into Collins’ statistical syntactic parser. WASP (Wong & Mooney, 2006; 2007) –Uses syntax-based statistical machine translation methods. KRISP (Kate & Mooney, 2006) –Uses a series of SVM classifiers employing a string-kernel to iteratively build semantic representations. SynSem (Ge & Mooney, 2009) –Uses existing statistical syntactic parser & word alignment.  

7 Learning Language from Perceptual Context Children do not learn language from annotated corpora. Neither do they learn language from just reading the newspaper, surfing the web, or listening to the radio. –Unsupervised language learning –DARPA Learning by Reading Program The natural way to learn language is to perceive language in the context of its use in the physical and social world. This requires inferring the meaning of utterances from their perceptual context.

8 Language Grounding The meanings of many words are grounded in our perception of the physical world: red, ball, cup, run, hit, fall, etc. –Symbol Grounding: Harnad (1990) Even many abstract words and meanings are metaphorical abstractions of terms grounded in the physical world: up, down, over, in, etc. –Lakoff and Johnson’s Metaphors We Live By Its difficult to put my ideas into words. Most NLP work represents meaning without any connection to perception; circularly defining the meanings of words in terms of other words or meaningless symbols with no firm foundation.

Sample Circular Definitions from WordNet sleep (v) –“be asleep” asleep (adj) –“in a state of sleep” 9

10 ??? “Mary is on the phone”

11 “Mary is on the phone” ???

12 “Mary is on the phone” ???

13 Ironing(Mommy, Shirt) “Mary is on the phone” ???

14 Ironing(Mommy, Shirt) Working(Sister, Computer) “Mary is on the phone” ???

15 Ironing(Mommy, Shirt) Working(Sister, Computer) Carrying(Daddy, Bag) “Mary is on the phone” ???

16 Ironing(Mommy, Shirt) Working(Sister, Computer) Carrying(Daddy, Bag) Talking(Mary, Phone) Sitting(Mary, Chair) “Mary is on the phone” Ambiguous Training Example ???

17 Ironing(Mommy, Shirt) Working(Sister, Computer) Talking(Mary, Phone) Sitting(Mary, Chair) “Mommy is ironing a shirt” Next Ambiguous Training Example ???

Ambiguous Supervision for Learning Semantic Parsers Our model of ambiguous supervision corresponds to the type of data that will be gathered from a temporal sequence of perceptual contexts with occasional language commentary. We assume each sentence has exactly one meaning in its perceptual context. –Recently extended to handle sentences with no meaning in its perceptual context. Each meaning is associated with at most one sentence.

19 Sample Ambiguous Corpus Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) saw(john, walks(man, dog)) Forms a bipartite graph

KRISPER (Kate & Mooney, 2007) : KRISP with EM-like Retraining Extension of K RISP that learns from ambiguous supervision. Uses an iterative EM-like self-training method to gradually converge on a correct meaning for each sentence.

21 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 1. Assume every possible meaning for a sentence is correct

22 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 1. Assume every possible meaning for a sentence is correct

23 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 2. Resulting NL-MR pairs are weighted and given to K RISP 1/2 1/4 1/5 1/3

24 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 3. Estimate the confidence of each NL-MR pair using the resulting trained parser

25 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 4. Use maximum weighted matching on a bipartite graph to find the best NL-MR pairs [Munkres, 1957]

26 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 4. Use maximum weighted matching on a bipartite graph to find the best NL-MR pairs [Munkres, 1957]

27 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 5. Give the best pairs to K RISP in the next iteration, and repeat until convergence

28 New Challenge: Learning to Be a Sportscaster Goal: Learn from realistic data of natural language used in a representative context while avoiding difficult issues in computer perception (i.e. speech and vision). Solution: Learn from textually annotated traces of activity in a simulated environment. Example: Traces of games in the Robocup simulator paired with textual sportscaster commentary.

Tactical Generation Learn how to generate NL from MR Example: Pass(Pink2, Pink3)  “Pink2 kicks the ball to Pink3” 29

WASP / WASP -1 (Wong & Mooney, 2006, 2007) Supervised system for learning both a semantic parser and a tactical language generator. Uses probabilistic version of a synchronous context-free grammar (SCFG) that generates two corresponding strings (NL & MR) simultaneously. 30

31 Grounded Language Learning in Robocup Robocup Simulator Sportscaster Simulated Perception Perceived Facts Score!!!! Grounded Language Learner Language Generator Semantic Parser SCFG Score!!!!

Sample Human Sportscast in Korean 32

33 Robocup Sportscaster Trace Natural Language CommentaryMeaning Representation Purple goalie turns the ball over to Pink8 badPass ( Purple1, Pink8 ) Pink11 looks around for a teammate Pink8 passes the ball to Pink11 Purple team is very sloppy today Pink11 makes a long pass to Pink8 Pink8 passes back to Pink11 turnover ( Purple1, Pink8 ) pass ( Pink11, Pink8 ) pass ( Pink8, Pink11 ) ballstopped pass ( Pink8, Pink11 ) kick ( Pink11 ) kick ( Pink8) kick ( Pink11 ) kick ( Pink8 )

34 Robocup Sportscaster Trace Natural Language CommentaryMeaning Representation Purple goalie turns the ball over to Pink8 badPass ( Purple1, Pink8 ) Pink11 looks around for a teammate Pink8 passes the ball to Pink11 Purple team is very sloppy today Pink11 makes a long pass to Pink8 Pink8 passes back to Pink11 turnover ( Purple1, Pink8 ) pass ( Pink11, Pink8 ) pass ( Pink8, Pink11 ) ballstopped pass ( Pink8, Pink11 ) kick ( Pink11 ) kick ( Pink8) kick ( Pink11 ) kick ( Pink8 )

35 Robocup Sportscaster Trace Natural Language CommentaryMeaning Representation Purple goalie turns the ball over to Pink8 badPass ( Purple1, Pink8 ) Pink11 looks around for a teammate Pink8 passes the ball to Pink11 Purple team is very sloppy today Pink11 makes a long pass to Pink8 Pink8 passes back to Pink11 turnover ( Purple1, Pink8 ) pass ( Pink11, Pink8 ) pass ( Pink8, Pink11 ) ballstopped pass ( Pink8, Pink11 ) kick ( Pink11 ) kick ( Pink8) kick ( Pink11 ) kick ( Pink8 )

36 Robocup Sportscaster Trace Natural Language CommentaryMeaning Representation Purple goalie turns the ball over to Pink8 P6 ( C1, C19 ) Pink11 looks around for a teammate Pink8 passes the ball to Pink11 Purple team is very sloppy today Pink11 makes a long pass to Pink8 Pink8 passes back to Pink11 P5 ( C1, C19 ) P2 ( C22, C19 ) P2 ( C19, C22 ) P0 P2 ( C19, C22 ) P1 ( C22 ) P1( C19 ) P1 ( C22 ) P1 ( C19 )

37 Strategic Generation Generation requires not only knowing how to say something (tactical generation) but also what to say (strategic generation). For automated sportscasting, one must be able to effectively choose which events to describe. 37

38 Example of Strategic Generation pass ( purple7, purple6 ) ballstopped kick ( purple6 ) pass ( purple6, purple2 ) ballstopped kick ( purple2 ) pass ( purple2, purple3 ) kick ( purple3 ) badPass ( purple3, pink9 ) turnover ( purple3, pink9 ) 38

39 Example of Strategic Generation pass ( purple7, purple6 ) ballstopped kick ( purple6) pass ( purple6, purple2 ) ballstopped kick ( purple2) pass ( purple2, purple3 ) kick ( purple3 ) badPass ( purple3, pink9 ) turnover ( purple3, pink9 ) 39

Robocup Data Collected human textual commentary for the 4 Robocup championship games from –Avg # events/game = 2,613 –Avg # English sentences/game = 509 –Avg # Korean sentences/game = 499 Each sentence matched to all events within previous 5 seconds. –Avg # MRs/sentence = 2.5 (min 1, max 12) Manually annotated with correct matchings of sentences to MRs (for evaluation purposes only). 40

WASPER WASP with EM-like retraining to handle ambiguous training data. Same augmentation as added to KRISP to create KRISPER. 41

First train KRISPER to disambiguate the data Then train WASP on the resulting unambiguously supervised data. KRISPER-WASP 42

WASPER-GEN Determines the best matching based on generation (MR→NL). Score each potential NL/MR pair by using the currently trained WASP -1 generator. Compute NIST MT score [NIST report, 2002] between the generated sentence and the potential matching sentence. 43

Strategic Generation Learning For each event type (e.g. pass, kick) estimate the probability that it is described by the sportscaster. Requires correct NL/MR matching – Use estimated matching from tactical generation – Iterative Generation Strategy Learning 44

Iterative Generation Strategy Learning (IGSL) Estimates the likelihood of commenting on each event-type directly from the ambiguous training data. Uses EM-like self-training iterations to compute estimates. 45

English Demo Game clip commentated using WASPER- GEN with IGSL strategic generation, since this gave the best results for generation. FreeTTS was used to synthesize speech from textual output.

Machine Sportscast in English 47

Experimental Evaluation Generated learning curves by training on all combinations of 1 to 3 games and testing on all games not used for training. Baselines: –Random Matching: WASP trained on random choice of possible MR for each comment. –Gold Matching: WASP trained on correct matching of MR for each comment. Metrics: –Precision: % of system’s annotations that are correct –Recall: % of gold-standard annotations correctly produced –F-measure: Harmonic mean of precision and recall

Evaluating NL-MR Matching How well does the learner figure out which event (if any) each sentence refers to? 49 Natural Language CommentaryMeaning Representation Purple goalie turns the ball over to Pink8 badPass ( Purple1, Pink8 ) Pink8 passes the ball to Pink11 Purple team is very sloppy today turnover ( Purple1, Pink8 ) pass ( Pink8, Pink11 ) kick ( Pink8) kick ( Pink11 )

Matching Results (F-Measure) 50

Evaluating Semantic Parsing How well does the system learn to interpret the meaning of a novel sentence? Compare result to correct MR from the gold standard matches. 51 Natural Language CommentaryMeaning Representation Purple goalie looses the ball to Pink8 turnover ( Purple1, Pink8 )

Semantic Parsing Results (F-Measure) 52

Evaluating Tactical Generation How accurately does the system generate natural language descriptions of events? Use gold-standard matches to determine the correct sentence for each MR that has one. Evaluation Metric: – BLEU score: [Papineni et al, 2002], N=4 53 Natural Language Commentary Purple goalie looses the ball to Pink8 Meaning Representation turnover ( Purple1, Pink8 )

Tactical Generation Results (BLEU Score) 54

Evaluating Strategic Generation How well does the system predict which events the human sportscaster will mention? 55 pass ( purple7, purple6 ) ballstopped kick ( purple6) pass ( purple6, purple2 ) ballstopped kick ( purple2) pass ( purple2, purple3 ) kick ( purple3 ) badPass ( purple3, pink9 ) turnover ( purple3, pink9 )

Strategic Generation Results 56

Used Amazon’s Mechanical Turk to recruit human judges (36 English, 7 Korean judges per video) 8 commented game clips – 4 minute clips randomly selected from each of the 4 games – Each clip commented once by a human, and once by the machine Presented in random counter-balanced order Judges were not told which ones were human or machine generated 57 Human Evaluation “Pseudo Turing Test”

Human Evaluation Metrics Score English Fluency Semantic Correctness Sportscasting Ability 5FlawlessAlwaysExcellent 4GoodUsuallyGood 3Non-nativeSometimesAverage 2DisfluentRarelyBad 1GibberishNeverTerrible 58 Human? Also asked human judge to predict if a human or machine generated the sportscast, knowing there was some of each in the data.

English Human Evaluation Results 59

Korean Human Evaluation Results 60

61 Future Direction #1 Grounded language learning for direction following in a virtual environments. Eventual goal: Virtual agents in video games and educational software that can take and give instructions in natural language.

Challenge on Generating Instructions in Virtual Environments (GIVE) 62

Learning Approach for Grounded Instructional Language Learning Passive learning –Observes human instructor guiding a human follower Interactive learning as follower –Tries to follow human instructions Interactive learning as instructor –Generates instructions to guide human follower 63

Future Direction #2: Learning for Language and Vision Natural Language Processing (NLP) and Computer Vision (CV) are both very challenging problems. Machine Learning (ML) is now extensively used to automate the construction of both effective NLP and CV systems. Generally uses supervised ML and requires difficult and expensive human annotation of large text or image/video corpora for training.

Cross-Supervision of Language and Vision Use naturally co-occurring perceptual input to supervise language learning. Use naturally co-occurring linguistic input to supervise visual learning. Blue cylinder on top of a red cube. Language Learner Input Supervision Vision Learner Input Supervision

Activity Recognition in Video Recognizing activities in video generally uses supervised learning trained on human- labeled video clips. Linguistic information in closed captions (CCs) can be used as “weak supervision” for training activity recognizers. Automatically trained activity recognizers can be used to improve precision of video retrieval. 66

Sample Soccer Videos “I do not think there is any real intent, just trying to make sure he gets his body across, but it was a free kick.” “Lovely kick.” “Goal kick.” “Good save as well.” “I think brown made a wonderful fingertip save there.” “And it is a really chopped save” Kick Save

“If you are defending a lead, your throw back takes it that far up the pitch and gets a throw-in.” “And Carlos Tevez has won the throw.” “Another shot for a throw.” “When they are going to pass it in the back, it is a really pure touch.” “Look at that, Henry, again, he had time on the ball to take another touch and prepare that ball properly.” “All it needed was a touch.” ThrowTouch

69 Conclusions Current language learning work uses expensive, unrealistic training data. We have developed language learning systems that can learn from sentences paired with an ambiguous perceptual environment. We have evaluated it on learning to sportscast simulated Robocup games where it learns to commentate games about as well as humans. Learning to connect language and perception is an important and exciting research problem.