Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003 Towards Dolphin Recognition.

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

By: Hossein and Hadi Shayesteh Supervisor: Mr J.Connan.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Topic 5 Instructional audio OWT 410. Instructional audio Digital audio Definition of podcast Type of podcast Steps for creating audio podcasts Tools for.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
Tanja Schultz Carnegie Mellon University Cairo, Egypt, May Data Recording, Transcription, and Speech Recognition for Egypt.
AN IMPROVED AUDIO Jenn Tam Computer Science Dept. Carnegie Mellon University SOAPS 2008, Pittsburgh, PA.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Learning in the Wild Satanjeev “Bano” Banerjee Dialogs on Dialog March 18 th, 2005 In the Meeting Room Scenario.
7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.
Stockholm 6. Feb -04Robust Methods for Automatic Transcription and Alignment of Speech Signals1 Course presentation: Speech Recognition Leif Grönqvist.
Jumping Off Points Ideas of possible tasks Examples of possible tasks Categories of possible tasks.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
Access to News Audio User Interaction in Speech Retrieval Systems by Jinmook Kim and Douglas W. Oard May 31, th Annual Symposium and Open House.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Carnegie Mellon © Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 1 Informedia 03/12/97.
The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.
2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.
Why is ASR Hard? Natural speech is continuous
By Jiazhi Ou Tal Blum Wild Dolphin Project Speech Final Project.
1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Introduction to Automatic Speech Recognition
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
Practical AT session 3 WP4-D4.2. Prepared by: Shams Eldin Mohamed Ahmed Hassan Speech, Text and Braille AT.
Real-Time Speech Recognition Subtitling in Education Respeaking 2009 Dr Mike Wald University of Southampton.
How Spread Works. Spread Spread stands for Speech and Phoneme Recognition as Educational Aid for the Deaf and Hearing Impaired Children It is a game used.
Applied Linguistics 665 Introduction. Some Fundamental Concepts Every language is complex. All languages are systematic. (not for NS) Speech is the primary.
As a conclusion, our system can perform good performance on a read speech corpus, but we will have to develop more accurate tools in order to model the.
Speech Recognition and Machine Translation Stephan Kanthak AIXPLAIN AG, Aachen, Germany.
1 CS 430 / INFO 430 Information Retrieval Lecture 23 Non-Textual Materials 2.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Dirk Van CompernolleAtranos Workshop, Leuven 12 April 2002 Automatic Transcription of Natural Speech - A Broader Perspective – Dirk Van Compernolle ESAT.
ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
© 2013 by Larson Technical Services
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
Luis Avila Tics. We have to recognize all the operating systems we have nowadays in the different smartphones Blackberry: Bb OS Iphone: iOS Nokia: symbian.
Unlocking Audio/Video Content with Speech Recognition Behrooz Chitsaz Director, IP Strategy Microsoft Research Frank Seide Lead.
1 CS 430 / INFO 430 Information Retrieval Lecture 17 Metadata 4.
Behrooz ChitsazLorrie Apple Johnson Microsoft ResearchU.S. Department of Energy.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.
Speech Recognition Created By : Kanjariya Hardik G.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Find International Driving Document Translator Online
Using the Automatic Captions Feature. Objectives Learn how to use the Automatic Captions feature in YouTube  Edit the generated captions  Extract the.
How can speech technology be used to help people with disabilities?
An introduction to Amazon AI
Teaching Listening Why teach listening?
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Digital Video Library - Jacky Ma.
3.0 Map of Subject Areas.
Advanced NLP: Speech Research and Technologies
Command Me Specification
A HCL Proprietary Utility for
COUNTRIES NATIONALITIES LANGUAGES.
SIDE: The Summarization IDE
Language Transfer of Audio Word2Vec:
Presentation transcript:

Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003 Towards Dolphin Recognition

Outline Ê Speech-to-Speech Recognition è Brief Introduction è Lab, Research è Data Requirements è Audio data è ‘Transcriptions’ Ë Towards Dolphin Recognition è Applications è Current Approaches è Preliminary Results

Part 1 Ê Speech-to-Speech Recognition è Brief Introduction è Lab, Research è Data Requirements è Audio data è ‘Transcriptions’ Ë Towards Dolphin Recognition è Applications è Current Approaches è Preliminary Results

Speech Processing Terms l Speech Recognition Converts spoken speech input into written text output l Natural Language Understanding (NLU) Derives the meaning of the spoken or written input l (Speech-to-speech) Translation Transforms text / speech from language A to text / speech of language B l Speech Synthesis (Text-To-Speech=TTS) Converts written text input into audible output

Speech Recognition hellohello Hello Hale Bob Hallo : TTS Speech Input - Preprocessing Decoding / Search Postprocessing - Synthesis

Fundamental Equation of SR hellohello P(W/x) = [ P(x/W) * P(W) ] / P(x) Am AE M Are A R I AI you J U we VE I am you are we are : Acoustic Model Pronunciation Language Model A-b A-m A-e

SR: Data Requirements Audio Data Sound set Units built from sounds Text Data  Am AE M Are A R I AI you J U we VE I am you are we are : Acoustic Model Pronunciation Language Model A-b A-m A-e

Janus Speech Recognition Toolkit (JRTk) l Unlimited and Open Vocabulary l Spontaneous and Conversational Human-Human Speech l Speaker-Independent l High Bandwidth, Telephone, Car, Broadcast l Languages: English, German, Spanish, French, Italian, Swedish, Portuguese, Korean, Japanese, Serbo-Croatian, Chinese, Shanghai, Arabic, Turkish, Russian, Tamil, Czech l Best Performance on Public Benchmarks Õ DoD, (English) DARPA Hub-5 Test ‘96, ‘97 (SWB-Task) Õ Verbmobil (German) Benchmark ’95-’00 (Travel-Task)

Mobil Device for Translation&Navigation

Multi-lingual Meeting Support The Meeting Browser is a powerful tool that allows us to record a new meeting, review or summarize an existing meeting or search a set of existing meetings for a particular speaker, topic, or idea.

Multilingual Indexing of Video View4You / Informedia: Automatically records Broadcast News and allows the user to retrieve video segments of news items for different topics using spoken language input Non-cooperative speaker on video Cooperative user Indexing requires only low quality translation

Part 2 Ê Speech-to-Speech Recognition è Brief Introduction è Lab, Research è Data Requirements è Audio data è ‘Transcriptions’ Ë Towards Dolphin Recognition è Applications è Current Approaches è Preliminary Results

Towards Dolphin Recognition ? ? Whose voice is this? ? ? ? Whose voice is it? Identification ? Is this Bob¡¯s voice? ? Is it Nippy’s voice? Verification/Detection Segmentation and Clustering Speaker B Speaker A Which segments are from the same speaker? Which segments are from the same speaker? Where are speaker changes? Where are speaker changes? Speaker B Speaker A Which segments are from the same speaker? Which segments are from the same dolphin? Where are speaker changes? Where are dolphins changing?

Applications ‘off-line’ applications (off the water, off the boat, off season) l Data Management and Indexing l Automatic Assignment/Labeling of already recorded (archived) data l Automatic Post-Processing (Indexing) for later retrieval l Towards Important/Meaningful Units = DOLPHONES l Segmentation and Clustering of similar sounds/units l Find out about unit frequencies l Find out about correlation between sounds and other events l Whistles correlated to Family Relationship l Who belongs to whom l Find out about the family tree? l Can we find out more about social structure?

Applications ‘on-line’ applications l Identification and Tracking l Who is currently speaking l Who is around l Towards Important/Meaningful Units l Find out about correlation between sounds and other events l Whistles correlated to Family Relationship l Who belongs to whom éWide-range identification, tracking, and observation (since sound travels longer distances than image)

Common Approaches Two distinct phases Feature extraction Model training Training speech for each dolphin Nippy Havana Model for each dolphin Havana Nippy Training Phase Feature extraction Detection decision Hypothesis: Havana Detection Phase ? xyz

Current Approaches l A likelihood ratio test is used for the detection decision Feature extraction Dolphin model Background model Background model   l p(X|dolph) is the likelihood for the dolphin model when the features X = (x1,x2,…) are given l p(X|dolph) is an alternative or so called background model trained on all data but that of the dolphin in question  = p(X|dolph) / p(X|dolph)

First Experiments - Setup l Take the data we got from Denise l Alan did the labeling of about 160 files l Labels: Õ dolphin sounds ~370 tokens Õ electric noise (machine, clicks, others) ~180 tokens Õ pauses ~ 220 tokens l Derive Dolphin ID from file name (educ. Guess) (Caroh, Havana, Lag, Lat, LG, LH, Luna, Mel, Nassau, Nippy) l Train one model per dolphin, one ‘garbage’ model for the rest l Recognize incoming audio file; hypotheses consist of list of dolphin and garbage models l Count number of models per audio file and return the name of dolphin with the highest count as the one being identified

First Experiments - Results

Next steps l Step 1: To build a ‘real’ system we need l MORE audio data MORE audio data MORE... l Labels (the more accurate the better) l Idea 1: Automatic labeling, live with the errors l Idea 2: Manual labeling l Idea 3: Automatic labeling and post-editing l Step 2: Given more data l Automatic clustering l Try first steps towards unit detection l Step 3: Build a working system, make it small and fast enough for deployment