R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Making Virtual.

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Making Virtual Conversational Agent Aware of the Addressee of Users’ Utterances in Multi-user Conversation Hung-Hsuan Huang Department of Information & Communication Science, Ritsumeikan University, Japan Naoya Baba Graduate School of Science and Technology, Seikei University, Japan Yukiko Nakano Department of Computer and Information Science, Seikei University, Japan

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Virtual Conversational Agent Tactical Iraqi, Alelo Inc.(USC) Life-like CG characters that have expressive faces and bodies Engage face-to-face communications with users in multiple modalities like facial expressions, speech, gestures and postures Expected in training, pedagogic, entertainment applications One application: information kiosk in public museums and exhibitions Typical situation: visitors come in groups MAX (Bielefeld University)

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Example: a tour guide kiosk agent talking with two users

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Objective and Background Objectives –Make the agent distinguish the addressee of user utterances and response appropriately Naïve hypotheses? –Speaker look at the addressee longer while speaking –Addressee can be identified from discourse structure only Related Works –Rule based [Akker et al. 2009] –Linguistic cues based [Dowding et al. 2006] –Linguistics, video, and audio (voice localization) [Bohus et al. 2011] –Analysis on acoustic cues [Terken et al. 2007]

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Methodology 1.Subject experiment with human-operated agent (Wizard-of-Oz, WOZ) 2.Analysis on nonverbal cues (head orientation and prosody) in collected data 3.Model building and pattern recovery with machine learning techniques 4.The implementation of proactive information- providing agent

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Target Situation of Conversation The users want to collaboratively make a decision from multiple candidates with the help from the agent who is knowledgeable about that task domain The users have a rough image of what they want, but they do have no idea about particular candidates in advance The users discuss on their own and acquire new information from the agent The conversation ends when the users made the final decision

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Settings of the WOZ Experiment Subjects –Users ： 2x21 (28 males, 14 female, averagely 20.8 years, all native Japanese speakers) –Agent operators ： 4 Tasks –Lecture registration  two students are choosing 3 from 12 lectures which they want to take together  schedule, text book, lecturer, prerequisite knowledge, tests –Travel planning  two tourists are planning their visit to Kyoto ( 京都 ) where is the ancient capital of Japan. They got a coupon which allows them to visit 3 out of 14 sightseeing spots freely  History, highlights, restaurants nearby, famous people Collected data –Video –Logs  Users’ face orientation and movements  Agent’s utterances

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Experiment Environment Web camera and video camera curtain agent operator user monitor projector curtain screen user monitor Web camera (Okao) agent operator users wireless headset

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Instructions to the User-role Participants About the agent: –Agent is autonomous –Can talk to the agent not only about the tasks but also about any other topics, e.g. about the agent herself About the situation of the conversation: –Conflicting schedule of the two users –First time to visit Kyoto About general rules: –Reward will be doubled if they made the most popular choices –Can register their decisions temporarily to the agent

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Agent Operators Operation of the agent –Make the agent to utter by choosing predefined sentences from a GUI or by typing freely –Practice the GUI to get used in operating the agent for one hour before the experiment Instructions: –Try to help the users to make the decision which most fit their needs –Try to make the interaction sections end in 10 minutes if possible

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Example Dialog during the Experiment

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Collected Interaction Corpus AgentPartnerTotal Male5095221,031 Female354445799 Total8639671,830 Addressee Subject Pairs 1.17 pairs (10 male and 7 female) are analyzed 2.Average length of the sessions is 524 seconds 3.Voice track is segmented into subject utterances with 200 ms silent periods as the unit of analysis 4.5 coders annotated the whole corpus 5.The other subject is labeled as Partner

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Visual Information Analysis – Gaze Direction Approximation The video track of the corpus is labeled by using a commercial software, FaceAPI The following measurements on the subjects from FaceAPI are used: –Head position (x, y, z) –Head rotation (pitch, yaw, roll) –Confidence score of the measurement

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Post Processing on Face Tracking Data The reason: –Face tracker loses the faces quite frequently –The situation which really matters: whom the subjects are looking at The process: –The person whom the subjects are looking at (the agent, the partner, or elsewhere) is manually labeled in 4 sessions –The results are used to train a C4.5 decision tree with Weka toolkit and achieved a 10-fold cross validation accuracy, 97% over the whole corpus

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Can the Addressees be Judged Merely by how Long They are Looked at? Totally there are 49,129 frames of video data in the corpus from the 30 fps video camera When the agent is the addressee, the speaker gazed at the agent for 93% while speaking When the partner is the addressee, the speaker gazed at the partner for only 33% (65% on the agent)

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Totally there are 49,129 frames of video data in the corpus from the 30 fps video camera When the agent is the addressee, the speaker gazed at the agent for 93% while speaking When the partner is the addressee, the speaker gazed at the partner for only 33% (65% on the agent) May be useful but not enough Can the Addressees be Judged Merely by how Long They are Looked at?

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Prosodic Information Analysis The pitch, intensity(power), speech rate (phoneme/sec), and duration of each utterances is analyzed

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Prosodic Information Analysis The values for Pitch, Intensity, and Duration are higher in utterances addressed to the agent comparing to those addressed to the partner

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Prosodic Information Analysis Speech Rate values are higher in utterances addressed to the partner comparing to those addressed to the agent

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Visual Features The time ratio of approximated gaze direction –Agent –Partner –Elsewhere The number of following head orientation transitions –Agent → Partner –Agent → Elsewhere –Partner → Agent –Elsewhere → Agent 7 features of head orientation were extracted for each participant, and 14 features in total (from two participants)

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Prosodic Features Average pitch (F 0 ) Average intensity Duration Speech rate (# of phonemes per second) The difference from the average of all the subjects’ F 0 The difference from the average of all the subjects’ intensity

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Results of Automatic Classification Feature SetAccuracyF-measure: AgentF-measure: Partner Male Prosody74.49%0.7290.759 Head orientation74.4%0.7820.691 Prosody + Head80.0%0.8070.792 Female Prosody75.97%0.6930.802 Head orientation65.29%0.7090.571 Prosody + Head80.1%0.7930.808 General (M + F) Prosody75.3%0.7170.781 Head orientation71.62%0.7590.656 Prosody + Head80.28%0.7990.806 The 20 (7x2+6) features are used to train a SVM classifier

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Results of Automatic Classification Feature SetAccuracyF-measure: AgentF-measure: Partner Male Prosody74.49%0.7290.759 Head orientation74.4%0.7820.691 Prosody + Head80.0%0.8070.792 Female Prosody75.97%0.6930.802 Head orientation65.29%0.7090.571 Prosody + Head80.1%0.7930.808 General (M + F) Prosody75.3%0.7170.781 Head orientation71.62%0.7590.656 Prosody + Head80.28%0.7990.806 The accuracy is slightly over 80% in all situations

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Results of Automatic Classification Feature SetAccuracyF-measure: AgentF-measure: Partner Male Prosody74.49%0.7290.759 Head orientation74.4%0.7820.691 Prosody + Head80.0%0.8070.792 Female Prosody75.97%0.6930.802 Head orientation65.29%0.7090.571 Prosody + Head80.1%0.7930.808 General (M + F) Prosody75.3%0.7170.781 Head orientation71.62%0.7590.656 Prosody + Head80.28%0.7990.806 Head orientation identifies agent as addressee better

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Results of Automatic Classification Feature SetAccuracyF-measure: AgentF-measure: Partner Male Prosody74.49%0.7290.759 Head orientation74.4%0.7820.691 Prosody + Head80.0%0.8070.792 Female Prosody75.97%0.6930.802 Head orientation65.29%0.7090.571 Prosody + Head80.1%0.7930.808 General (M + F) Prosody75.3%0.7170.781 Head orientation71.62%0.7590.656 Prosody + Head80.28%0.7990.806 Prosodic cues performs better than visual cues in identifying the partner as the addressee

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Architecture of Real-time Agent System

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Evaluation of the Real-time Addressee Identification Subsystem Another WOZ experiment is conducted –Dialog crashes and can not recover when the estimation fails 6 pairs of subjects (4 male and 2 female pairs, average age: 22.7) The same tasks as the corpus collecting experiment, but only lecture-registration session is used because the location of the university 27

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Results of the Evaluation Experiment Mistakenly collected sound (e.g. voice from the other participant or the agent) make the accuracy degrade Similar accuracy to the corpus if the errors are removed AccuracyF-measure: agentF-measure: partner Errors excluded 83.00%0.7490.872 Errors included 68.11%0.7420.751 28

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Conclusions Proposed a method that identifies the addressee of user utterances in human-human-agent triadic information-requiring conversation by using nonverbal cues from empiric results. The accuracy is OK, say slightly over 80% Visual (duration of gazing at the participants, transitions of gazes) and audio (intensity, pitch, speech rate, duration) cues are used Real-time system is implemented and is evaluated. The accuracy is similar to the off-line analysis results

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Future Work Improve the reliability and robustness of input modality As long as the estimation is not 100% correct, there should be some method to detect the error and repair the conversation Combine other methods such as natural language understanding to improve the accuracy Handle uncertain number of users Analyze the appropriate timing for the agent to intervene inter-user conversation actively Investigate dialogs in other context and differences in varieties on different culture, gender, and age

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Thank you very much for your attention !! Any questions? Contact: hhhuang@acm.orghhhuang@acm.org

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Making Virtual.

Similar presentations

Presentation on theme: "R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Making Virtual."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Making Virtual.

Similar presentations

Presentation on theme: "R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Making Virtual."— Presentation transcript:

Similar presentations

About project

Feedback