Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario

Slides:



Advertisements
Similar presentations
TRAFFIC RULES.
Advertisements

ARTIFICIAL PASSENGER.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Matthias Wimmer, Bernd Radig, Michael Beetz Chair for Image Understanding Computer Science TU München, Germany A Person and Context.
Kien A. Hua Division of Computer Science University of Central Florida.
Bell Ringer #3 1.List 3 places you should NOT park. 2.Oops…You just parked in one of those places that you’re NOT supposed to park…describe in detail what.
Internet Vision - Lecture 3 Tamara Berg Sept 10. New Lecture Time Mondays 10:00am-12:30pm in 2311 Monday (9/15) we will have a general Computer Vision.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Asa MacWilliams Lehrstuhl für Angewandte Softwaretechnik Institut für Informatik Technische Universität München Dec Software.
HIGGINS Error handling strategies in a spoken dialogue system Rolf Carlson, Jens Edlund and Gabriel Skantze Error handling research issues The long term.
Personal Driving Diary: Constructing a Video Archive of Everyday Driving Events IEEE workshop on Motion and Video Computing ( WMVC) 2011 IEEE Workshop.
Cognitive modelling (Cognitive Science MSc.) Fintan Costello
© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,
1 Section III Day 2 DMV Manual p. 5-6, Write a scenario about how the driver of the white truck managed to keep his truck on the edge of the.
Part one: Strategies/Tactics and Rules of the Road
Drive Right chapter 2 Thursday, April 20, 2017 lesson 2.1
Morning Arrival 9:00-9:15 BREAKFAST Children who eat breakfast at school may enter the building at 8:50. Please drop them off at 8:50 so they can go to.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
lesson 2.3 ROADWAY MARKINGS
Issues in Multiparty Dialogues Ronak Patel. Current Trend  Only two-party case (a person and a Dialog system  Multi party (more than two persons Ex.
Driving Laws and DMV Objective: pt 30 pt 40 pt 50pt 10 pt 20 pt 30 pt 40 pt 50 pt 10 pt 20 pt 30 pt 40 pt 50 pt 10 pt 20 pt 30pt 40 pt 50 pt 10.
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
Driving in City Traffic.  This chapter discusses the skills necessary to navigate driving situations in city traffic.
 SIGN, SIGNALS, & ROADWAY MARKINGS Do Now - Create a list with as many different road signs you are able to think of. What does each sign tell you? Classify.
Our Vision. Your Safety ™
What’s the parking rule?. What’s the rule? Blue curb.
ROADWAY MARKINGS A roadway marking gives a warning or direction. Roadway markings are usually lines, words, or symbols painted on the roadway. Some markings.
 Every sign’s shape and color have special meaning  Regulatory Signs: Signs that set limits, or give commands.  Example: stop sign, Yield, One Way,
Chapter 2 Signs, Signals, and Roadway Markings Start working on the Start working on the 8 questions on page 39! 8 questions on page 39!
Montana Parking Rules LET’S GO! NEXT BACK Park by white curb markings to pick up or unload passengers only.
Date of download: 7/8/2016 Copyright © 2016 SPIE. All rights reserved. A scalable platform for learning and evaluating a real-time vehicle detection system.
A Plane-Based Approach to Mondrian Stereo Matching
REAL-TIME DETECTOR FOR UNUSUAL BEHAVIOR
What Convnets Make for Image Captioning?
Using Unity as an Animator and Simulator for PaypyrusRT Models
Visual Attributes in Video
Final Exam Prep Questions:
Guillaume-Alexandre Bilodeau
CARP: Context-Aware Reliability Prediction of Black-Box Web Services
A Pool of Deep Models for Event Recognition
Supervised Time Series Pattern Discovery through Local Importance
Traffic Stops Pages
Above and below the object level
Adversarially Tuned Scene Generation
Image Question Answering
Driving in City Traffic
Context-Aware Modeling and Recognition of Activities in Video
Quanzeng You, Jiebo Luo, Hailin Jin and Jianchao Yang
MEASURING INDIVIDUALS’ TRAVEL BEHAVIOUR BY USE OF A GPS-BASED SMARTPHONE APPLICATION IN DAR ES SALAAM CITY 37th Annual Southern African Transport Conference.
A Dialogue Annotation Scheme for Weight Management Chat using the Trans-Theoretical Model of Health Behavior Change Ramesh Manuvinakurike*+ Sumanth Bharadwaj+
Signs, Signals and Roadway Markings
An Introduction to VEX IQ Programming with Modkit
Similarity based on Shape and Appearance
Towards a Personal Briefing Assistant
Multimedia Information Retrieval
Learning a Policy for Opportunistic Active Learning
Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1, Zhizhong.
Dialogue State Tracking & Dialogue Corpus Survey
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Sun Yat-sen University
Welcome 1 This is a document to explains the chosen concept to the animator. This will take you through a 5 section process to provide the necessary details.
lesson 2.3 ROADWAY MARKINGS
Towards an Unequivocal Representation of Actions
lesson 18.4 SPECIAL VEHICLES AND TRAILERS
Drive Right chapter 2 Thursday, June 27, 2019 lesson 2.1 TRAFFIC SIGNS
Visual Question Answering
DEFEND LIGHT.
Stance Classification of Context-Dependent Claims
CVPR 2019 Poster.
Presentation transcript:

Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario Deepthi Karkada*1 Ramesh Manuvinakurike*2 Kallirroi Georgila2 * Equal contributions 1 Intel corp 2 USC Institute for Creative Technologies Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario Ongoing project. Feedback and ideas welcome.

Motivation End-of-trip scenario in a taxi ride. e.g., Uber, Lyft, Taxi cab, etc. Before ‘bye’, the last few exchanges in a taxi ride are usually the rider informing the driver where they prefer to be dropped off. Driver: Where would you like me to stop? Passenger: Could you please drop me off in front of the white car?

Motivation Understanding [and responding to] such instructions requires complex vision and language understanding capabilities. Driver: Where would you like me to stop? Passenger: Could you please drop me off in front of the white car?

Motivation Understanding [and responding to] such instructions requires complex vision and language understanding capabilities. Vision component: Object labeling, position identification etc. White van Blue car Street sign Black car White car Red car Driver: Where would you like me to stop? Passenger: Could you please drop me off in front of the white car?

Motivation Understanding [and responding to] such instructions requires complex vision and language understanding capabilities. Vision component: Object labeling position identification etc. Language component: Dialogue-act classification (e.g., Request). Parameter identification (e.g., Action, Referent, etc.). Blue car White van Black car Street sign Black car White car Red car Driver: Where would you like me to stop? Passenger: Could you please [drop me off] [in front of] [the white car]? ACTION DD Referent

Motivation Understanding [and responding to] such instructions requires complex vision and language understanding capabilities. Vision component: Object labeling position identification etc. Language component: Dialogue-act classification (e.g., Request). Parameter identification (e.g., Action, Referent, etc.). Combining information from vision and language modalities. Referent identification. Target-location identification. Blue car White van Black car Street sign Black car White car Red car Driver: Where would you like me to stop? Passenger: Could you please [drop me off] [in front of] [the white car]? ACTION DD Referent

End-of-trip in a taxi ride Images data-collection. - Synthetic images. - Real world images. Instructions data collection. Annotations. Models. - Referent identification Conclusions & Future work - Destination identification - Real-world images dataset

End-of-trip in a taxi ride Images data-collection. - Synthetic images. - Real world images. Instructions data collection. Annotations. Models. - Referent identification Conclusions & Future work - Destination identification - Real-world images dataset In this work, we’re interested only in a single utterance user instruction. E.g.: Stop right in front of the cop car, Please park behind the street sign, etc.

Images construction (Synthetic) Synthetic images: Constructed from vehicle and object templates. Bird’s eye view . Objects are present on the sidewalk. Vehicles are parked on either side of the street in left-hand drive scenario. Target location placed on the street randomly. Pros: Abstracts the vision problem. Easier to construct large (with known configuration of objects) sets of images. Cons: Not real. Doesn’t capture the dynamics of a real-world scenario. Objects inserted on the sidewalk Street lamp Tree Fire hydrant Safety cone Vehicles parked on either sides of the street Limo Generic Car Van Taxi Cop car E.g. synthetic image with target location.

Images construction (Real) Real images: Constructed using Google street view. Manually placed “target locations” on the street randomly. Pros: Real-scenes and captures the dynamics of a real-world scenario. Cons: Expensive dataset construction, image annotations, target-location identification, etc.

User instructions collection Used Amazon Mechanical Turk (AMT) crowdsourcing platform for the corpus collection. These users on AMT, called turkers are presented with a hypothetical scenario. They’re asked to imagine a hypothetical taxi-ride. The location where they prefer the taxi to stop is marked by a red cross. If they were to instruct the driver, how would they do it? Three such descriptions collected for variety.

A typical description Please stop in front of the cop car Park behind the blue pickup truck ACTION DIRECTION REFERENT ACTION DIRECTION REFERENT

Sample data from the synthetic 2D images

Sample data from the real-world 3d street images Target location description samples: Annotated sample: Stop next to the first white car you see. Stop next to the car behind the blue car. Stop next to the white car. [ACTION: Stop] [DD: next to] [REF: the first white car you see] [ACTION: Stop] [DD: next to] [REF:the car behind the blue car] [ACTION: Stop] [DD: next to] [REF:the white car]

Directional descriptions Natural Language Description Annotations statistics Synthetic Real-world Actions Referents Directional descriptions 273 408 372 173 217 219 The inter-rater reliability for word-level annotations using Cohen’s kappa is 0.81.

Baseline model: Task pipeline Task pipeline for identification of the target location using the user descriptions:

Referent identification BLUE CAR ? Identify the vehicle/object described among the distractors. Once the referent (language) has been identified, we locate the object (visual). The object ground truth labels are used. The embeddings for the visual objects and the referents are extracted.

Referent identification Identify the vehicle/object described among the distractors. Once the referent has been identified, we locate the object. The object ground truth labels are used. The embeddings for the visual objects and the referents are extracted.

Comparison of reference resolution approaches Performs better than a simple sub-string matching method. Embeddings trained using wiki-corpus perform better than the embeddings trained using the in-domain utterances.

A typical description Directional description: Describes the direction to stop w.r.t. the referent.

A typical description Directional description: Describes the direction to stop w.r.t. the referent. The same description can refer to multiple directions: Example: Next to. The context such as the direction of the motion is important.

A typical description Directional description: Describes the direction to stop w.r.t. the referent. The same description can refer to multiple directions: Example: Next to. The context such as the direction of the motion is important. The approach we will take is region prediction: Region prediction using the directional descriptions and the referents is challenging.

Future work: Target location/region identification Once the referent is identified, the goal is to identify the target location. The target location is the function of r (distance from the referent) and ‘theta’ (direction from the referent)

Future work We’re extending the work with a multi-turn dialogue corpus. Use BDD corpus for real-world as the images are pre-annotated. Transfer learning: Identify the brands of the car, colors, etc. BDD-Berkeley Deep Drive dataset (Yu et.al 2018) Cars dataset (Krause et.al 2018)

Contributions A novel corpus containing user descriptions of target locations for synthetic and real-world street images. The natural language description annotations along with the visual annotations for the task of target location prediction. A baseline model for the task of identification of referents from user descriptions.

Thank you Special thanks for discussions: David Traum Ron Artstein Maike Paetzel