WOZ acoustic data collection for interactive TV A. Brutti*, L. Cristoforetti*, W. Kellermann+, L. Marquardt+, M. Omologo* * Fondazione Bruno Kessler (FBK)

Slides:



Advertisements
Similar presentations
While you are waiting for this session to begin please make sure your audio is working. Go to the Tools menu, select Audio and then Audio setup wizard.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
ADABAS to RDBMS UsingNatQuery. The following session will provide a high-level overview of NatQuerys ability to automatically extract ADABAS data from.
With the BlueAnt S4 you no longer have any need to touch your mobile phone or car speakerphone while driving With the S4 you are truly hands-free. Use.
Page 1. Page 2 Virtual Speaker: A Virtual Studio The software: Virtual Speaker is a package that automatically creates your voice files, prompts or any.
Using Telemedicine Equipment. Overview The ‘Big’ Picture The ‘Bigger’ Picture Equipment Setting Up A Video Conference Conducting A Video Conference Common.
ELearning Solutions eLearning Solutions The business of education is learning.
Copyright © 2005 by Interwise, Inc.. Introduction We are pleased to announce the use of Interwise a VOIP conferencing software PurpleTrain.com together.
Defence Research and Development Canada Recherche et développement pour la défense Canada Canada Spatialized Audio in the Common Operating Perspective.
Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.
MUSCLE movie data base is a multimodal movie corpus collected to develop content- based multimedia processing like: - speaker clustering - speaker turn.
Zhengyou Zhang, Qin Cai, Jay Stokes
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
ICS 463, Intro to Human Computer Interaction Design: 8. Evaluation and Data Dan Suthers.
HIWIRE meeting ITC-irst Activity report Marco Matassoni, Piergiorgio Svaizer March Torino.
Click Add a Contact. Complete the “Add a Contact” form. Select “company name” to search for the contact on your corporate network. Click on Next. The Search.
Introducing Microsoft Lync 2010 Connect and Collaborate.
DIVA - University of Fribourg - Switzerland Seminar presentation, jan Lawrence Michel, MSc Student Portable Meeting Recorder.
Kinect Player Gender Recognition from Speech Analysis
Interactive Videoconferencing Review 3/30/2009. IVC Vocabulary Tanburg Polycom IVC Bridge and the ESD Codec Virtual meeting rooms Peripherials – Camera.
Twenty-First Century Automatic Speech Recognition: Meeting Rooms and Beyond ASR 2000 September 20, 2000 John Garofolo
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
September 29, 2002Ubicomp 021 NIST Meeting Data Collection Jean Scholtz National Institute of Standards and Technology Gaithersburg, MD USA.
Interfacing with the Machine Jay Desloge SENS Corporation Sumit Basu Microsoft Research.
Exploiting video information for Meeting Structuring ….
NM – LREC 2008 /1 N. Moreau 1, D. Mostefa 1, R. Stiefelhagen 2, S. Burger 3, K. Choukri 1 1 ELDA, 2 UKA-ISL, 3 CMU s:
Software Development Software Testing. Testing Definitions There are many tests going under various names. The following is a general list to get a feel.
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
CHAPTER TEN AUTHORING.
INTRO TO USABILITY Lecture 12. What is Usability?  Usability addresses the relationship between tools and their users. In order for a tool to be effective,
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
New Fitter News Volume 2, Number 3 This Month’s Topic for the New Fitter Is… Beltone AVE. Have you taken a trip down Beltone AVE.? Beltone AVE. is a multi-media.
Audio Check 1. Wait for the support person to call your name 2. Say “Hello”. To talk, either: Click the TALK button on the screen, OR Press CTRL+F2 (COMMAND+F2.
A methodology for the creation of a forensic speaker recognition database to handle mismatched conditions Anil Alexander and Andrzej Drygajlo Swiss Federal.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
“Show me what you meant”: Mode-switching prompts in a multi-modal dialog system with distractions Thomas Harris & Hua Ai October 25, 2005.
Audio Location Accurate Low-Cost Location Sensing James Scott Intel Research Cambridge Boris Dragovic Intern in 2004 at Intel Research Cambridge Studying.
1 Dialogue, Speech and Images: The Companions Project Data Set Yorick Wilks, David Benyon, Christopher Brewster, Pavel Ircing, and Oli Mival
In-car Speech Recognition Using Distributed Microphones Tetsuya Shinde Kazuya Takeda Fumitada Itakura Center for Integrated Acoustic Information Research.
Roland Goecke Trent Lewis Michael Wagner 1Big ASC Meeting April 2010.
Theban Stanley, Julie Baca, Matt Elliott and Joseph Picone Human and Systems Engineering Center for Advanced Vehicular Systems Mississippi State University.
Usability Evaluation, part 2. REVIEW: A Test Plan Checklist, 1 Goal of the test? Specific questions you want to answer? Who will be the experimenter?
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Automatic Video Editing Stanislav Sumec. Motivation  Multiple source video data – several cameras in the meeting room, several meeting rooms in teleconference,
Introducing Microsoft Lync 2010 Connect and Collaborate.
ENTERFACE’08 Multimodal Communication with Robots and Virtual Agents mid-term presentation.
Getting started with Adobe Connect for presenters.
Scenario use cases Szymon Mueller PSNC. Agenda 1.General description of experiment use case. 2.Detailed description of use cases: 1.Preparation for observation.
OBSTETRIC EMERGENCY DRILLS Improve the quality of care for women having obstetric emergencies.
Medical Education Center
Wadley Medical Education Center
Health Professional Education Building
Heiner Löllmann and Christine Evers
Welcome to your first Online Class Session
Automatic Speech Recognition
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Medical Education Center
Fondazione Istituto Italiano di Tecnologia, Genoa, Italy
The next generation of collaboration
Welcome to the Session First things first, get set up…
Creating Interactive Assignments in BCPS One
An Overview of Simultaneous Remote Interpretation By Telephone
Speaker Localization: introduction to system evaluation
Pilar Orero, Spain Yoshikazu SEKI, Japan 2018
A HCL Proprietary Utility for
A maximum likelihood estimation and training on the fly approach
Presentation transcript:

WOZ acoustic data collection for interactive TV A. Brutti*, L. Cristoforetti*, W. Kellermann+, L. Marquardt+, M. Omologo* * Fondazione Bruno Kessler (FBK) - irst Via Sommarive 18, Povo (TN), ITALY + Multimedia Communications and Signal Processing, University of Erlangen-Nuremberg (FAU) Cauerstr. 7, Erlangen, GERMANY LREC 2008 – Marrakech, 28-30/05/08

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 The DICIT EU project  Distant-talking Interfaces for Control of Interactive TV  User-friendly human-machine interface to enable a speech-based interaction with a TV and related digital devices (STB)  Interaction in a natural and spontaneous way without a close-talk microphone 2

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 The DICIT EU project Distant-talking Interfaces for Control of Interactive TV 3

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 When did she speak? What does she say? What is the output from each loudspeaker? How is it at each microphone? Who is she? Robustness in a real reverberant environment Noise sources? Where is she? What is her head orientation? Other speakers? The DICIT EU project Distant-talking Interfaces for Control of Interactive TV

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 The DICIT Project  STREP Project – FP6  Strategic objective: – Multimodal Interfaces  Duration: October 2006 – September 2009

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 What is a Wizard of Oz (WOZ) experiment?  A subject is requested to complete specific tasks using an artificial system  The user is told that the system is fully functional and should try to use it in a intuitively way  The system is operated by a person (wizard) not visible to the subject  The wizard can react in a more comprehensive way and can create particular situations 6

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Why a WOZ data collection?  We needed to collect an acoustic database for testing pre- processing algorithms: –acoustic scene analysis –speaker ID and verification –echo cancelation –blind source separation –beamforming –speaker localization and tracking –distant automatic speech recognition  With a WOZ, realistic scenarios can be simulated at a preliminary stage, allowing for repeatable experiments  There is no need to have a full-working system in order to collect real data  Naïve users, do not have the same behavior as expert users, they use the system in a realistic way 7

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 The DICIT WOZ  Experiments were conducted in the laboratories at FBK and FAU  A room was used as living room with TV, loudspeakers and seats An adjacent room was used by the wizard and the simulation system, not visible from the users  Users watched the TV and had to interact with it by voice and remote control, to change channels and to retrieve information from the teletext pages  At some point, they had to move around and speak with the system 8

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Strategy for the recordings  Four users sit in the room, but one of them was the co- wizard, that ensured the regularity of the experiment and produced some acoustic events  Users were recorded by close talk and far microphones  Interactions will be recorded by 3 fixed cameras that allow the automatic tracking of users movements  Recordings were done on Italian/German/English groups 9

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Hints for wizard and co-wizard  Wizard: –simulate some recognition errors –don’t accept speech for 10” after a volume change (for convergence of the algorithms)  Co-wizard (in the room): –lead the first phase (users registration) –produce noises during teletext interaction (key jingle, cough, phone ring, etc.) –keep the situation under control (give hints to the real users) 10

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Script for the interaction (person A)  Enter in the room and sit on the seat marked A  Wait for user D to switch on the system and say your name, then read the four phonetically rich sentences on the screen  When user D gives you the remote control, try using it to change channels and volume (next/previous channel, volume up/down, mute)  Connect to the system using your voice “DICIT activate”  Use the voice to change channels and volume e.g., “I want to see CNN”  Select Euronews channel and start the teletext (using the voice or the remote control)  Use the teletext to obtain the requested news and weather info. Please move to different positions in the room when interacting with the system  Log off from the system “DICIT logoff” and give the remote control to user B 11

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 The FBK experimental room 12 A harmonic 15-electret-microphone array was developed on purpose and located over the TV

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Audio and video sensor setup  A harmonic 15-electret-microphone array was developed on purpose and located over the TV  A NIST MarkIII 64-electret-microphone linear array was used for comparison  A table microphone and 2 side mics were used (omnidirectional pattern)  Every participant wore a close-talk mic for reference  3 video cameras recorded the sessions for monitoring and to derive 3D reference positions 13

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Clip from a recorded session 14

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 WOZ preparation  12 video clips and 100 teletext pages were recorded from real TV, everything was available in 3 languages  Stereo audio channels were extracted and decorrelated (by FAU) for the echo canceller and clips were recreated to fit the simulation  The system was controlled by a PC running Elektrobit EB GUIDE Studio simulator tool  A remote control infrared receiver was integrated into the system and enabled the users to use a real remote control to pilot the TV 15

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Recording hardware setup 3 PCs to record all the data, 2 Linux and 1 Windows machines 16

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Recorded sessions  FBK and FAU recorded different sessions using a similar setup, in different languages  Each user interaction lasted about 10 minutes, in total 360 minutes of recordings  24 or 26 synchronous channels were recorded at 48kHz with 16-bit precision + 64 channels from the MarkIII array at 44.1kHz and 24 bits SiteLanguageNumber of sessions FBKItalian6 FAUGerman5 FAUEnglish1 17

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Data annotation  The 6 Italian sessions have been manually transcribed and segmented at word level, using Transcriber  An automatic segmentation was obtained with a tool based on energy of the close-talk signals, then adjusted when necessary  A stereo file was created, with two channels for close-talk and environment sounds to ease the annotation process  Annotation comprises the speaker ID, the transcription of uttered sentence and any noise included in the acoustic event list  Specific labels for acoustic events have been introduced, following a defined guideline  Video data has been used to derive 3D coordinates for the head of the speaker and reference files were created with a frame rate of 5 labels per second 18

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Data exploitation / testing  Data have been used for a preliminary evaluation of some FBK algorithms: –localization techniques, precision is around 30 cm – audio segments have been used for the acoustic event classification system, 92% of accuracy –data have been used to test the speaker verification and identification system, but close-talk is still better that beamformed signal  Room impulse response measurements have been carried out at both sites, in different positions. They are useful for i.e. speech contamination purposes 19

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Transcriber session

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08Conclusions  This collection of data has been the first of its kind and is of significant benefit to acoustic front-end algorithms and dialogue strategies  36 naïve persons have been recorded, leading to 360 minutes of signals, on different channels recorded in a synchronous way (125 GB of data)  Users enjoyed the system and tolerated some recognition errors, they preferred voice modality over remote control interaction 21

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Current status of the project  The project is in the second year  We just finished to integrate the first prototype  Ready to start the evaluation of the prototype  More information and demo clips can be found at 22

Luca Cristoforetti LREC 2008 – Marrakech, 28-30/05/08 Thank You!