Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
Multimedia Retrieval. Outline Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
Speech and Music Retrieval LBSC 796/CMSC828o Session 12, April 19, 2004 Douglas W. Oard.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
INFO 624 Week 3 Retrieval System Evaluation
INFORMATION RETRIEVAL WEEK 1 AND 2
T.Sharon 1 Internet Resources Discovery (IRD) Music IR.
Advance Information Retrieval Topics Hassan Bashiri.
Information Access Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies Design Understanding.
Presented by Zeehasham Rasheed
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,
2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.
Overview of Search Engines
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Information Retrieval in Practice
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
Search Engines and Information Retrieval Chapter 1.
Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne,
Multimedia Databases (MMDB)
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Interaction Design Session 12 LBSC 790 / INFM 718B Building the Human-Computer Interface.
1 CS 430 / INFO 430 Information Retrieval Lecture 23 Non-Textual Materials 2.
Producción de Sistemas de Información Agosto-Diciembre 2007 Sesión # 8.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Audient: An Acoustic Search Engine By Ted Leath Supervisor: Prof. Paul Mc Kevitt School of Computing and Intelligent Systems Faculty of Engineering University.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Speech and Music Retrieval INST 734 Doug Oard Module 12.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
1 Applications of video-content analysis and retrieval IEEE Multimedia Magazine 2002 JUL-SEP Reporter: 林浩棟.
MMDB-8 J. Teuhola Audio databases About digital audio: Advent of digital audio CD in Order of magnitude improvement in overall sound quality.
Image and Video Retrieval INST 734 Doug Oard Module 13.
Information Retrieval
Performance Comparison of Speaker and Emotion Recognition
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
1 CS 430 / INFO 430 Information Retrieval Lecture 17 Metadata 4.
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Speech and Music Retrieval INST 734 Doug Oard Module 12.
Searching the Web for academic information Ruth Stubbings.
Information Retrieval in Practice
Visual Information Retrieval
Search Engine Architecture
Introduction to Music Information Retrieval (MIR)
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
CSE 635 Multimedia Information Retrieval
Presentation transcript:

Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik

Agenda Questions Group thinking session Speech retrieval Music retrieval

Shoah Foundation Collection 52,000 interviews –116,000 hours (13 years) –32 languages Full description cataloging –14,000 term thesaurus –4,000 interviews for $8 million

Audio Retrieval We have already discussed three approaches –Controlled vocabulary indexing –Ranked retrieval based on associated captions –Social filtering based on other users’ ratings Today’s focus is on content-based retrieval –Analogue of content-based text retrieval

Audio Retrieval Retrospective retrieval applications –Search music and nonprint media collections –Electronic finding aids for sound archives –Index audio files on the web Information filtering applications –Alerting service for a news bureau –Answering machine detection for telemarketing –Autotuner for a car radio

The Size of the Problem 30,000 hours in the Maryland Libraries –Unique collections with limited physical access 116,000 hours in the Shoah collection Millions of hours of streaming audio each year –Becoming available worldwide on the web Broadcast news (audio/video) –Ex. Television archive

HotBot Audio Search Results

Audio Genres Speech-centered –Radio programs –Telephone conversations –Recorded meetings Music-centered –Instrumental, vocal Other sources –Alarms, instrumentation, surveillance, …

Detectable Speech Features Content –Phonemes, one-best word recognition, n-best Identity –Speaker identification, speaker segmentation Language –Language, dialect, accent Other measurable parameters –Time, duration, channel, environment

How Speech Recognition Works Three stages –What sounds were made? Convert from waveform to subword units (phonemes) –How could the sounds be grouped into words? Identify the most probable word segmentation points –Which of the possible words were spoken? Based on likelihood of possible multiword sequences All three stages are learned from training data –Using hill climbing (a “Hidden Markov Model”)

Using Speech Recognition Phone Detection Word Construction Word Selection Phone n-grams Phone lattice Words Transcription dictionary Language model One-best transcript Word lattice

Segment broadcasts into 20 second chunks Index phoneme n-grams –Overlapping one-best phoneme sequences –Trained using native German speakers Form phoneme trigrams from typed queries –Rule-based system for “open” vocabulary Vector space trigram matching –Identify ranked segments by time ETHZ Broadcast News Retrieval

Phoneme Trigrams Manage -> m ae n ih jh –Dictionaries provide accurate transcriptions But valid only for a single accent and dialect –Rule-base transcription handles unknown words Index every overlapping 3-phoneme sequence –m ae n –ae n ih –n ih jh

ETHZ Broadcast News Retrieval

Cambridge Video Mail Retrieval Added personal audio (and video) to –But subject lines still typed on a keyboard Indexed most probable phoneme sequences

Cambridge Video Mail Retrieval Translate queries to phonemes with dictionary –Skip stopwords and words with  3 phonemes Find no-overlap matches in the lattice –Queries take about 4 seconds per hour of material Vector space exact word match –No morphological variations checked –Normalize using most probable phoneme sequence Select from a ranked list of subject lines

Contrast of Approaches Rule-based transcription –Potentially errorful –Broad coverage, handles unknown words Dictionary-based transcription –Good for smaller settings –Accurate Both susceptible to the problem of variability

BBN Radio News Retrieval

AT&T Radio News Retrieval

IBM Broadcast News Retrieval Large vocabulary continuous speech recognition –64,000 word forms covers most utterances When suitable training data is available About 40% word error rate in the TREC 6 evaluation –Slow indexing (1 hour per hour) limits collection size Standard word-based vector space matching –Nearly instant queries –N-gram triage plus lattice match for unknown words Ranked list showing source and broadcast time

Comparison with Text Retrieval Detection is harder –Speech recognition errors Selection is harder –Date and time are not very informative Examination is harder –Linear medium is hard to browse –Arbitrary segments produce unnatural breaks

Speaker Identification Gender –Classify speakers as male or female Identity –Detect speech samples from same speaker –To assign a name, need a known training sample Speaker segmentation – Identify speaker changes –Count number of speakers

A Richer View of Speech Speaker identification –Known speaker and “more like this” searches –Gender detection for search and browsing Topic segmentation via vocabulary shift –More natural breakpoints for browsing Speaker segmentation –Visualize turn-taking behavior for browsing –Classify turn-taking patterns for searching

Other Possibly Useful Features Channel characteristics –Cell phone, landline, studio mike,... Accent –Another way of grouping speakers Prosody –Detecting emphasis could help search or browsing Non-speech audio –Background sounds, audio cues

Competing Demands on the Interface Query must result in a manageable set –But users prefer simple query interfaces Selection interface must show several segments –Representations must be compact, but informative Rapid examination should be possible –But complete access to the recordings is desirable

Iterative Prototyping Strategy Select a user group and a collection Observe information seeking behaviors –To identify effective search strategies Refine the interface –To support effective search strategies Integrate needed speech technologies Evaluate the improvements with user studies –And observe changes to effective search strategies

The VoiceGraph Project Exploring rich queries –Content-based, speaker-based, structure-based Multiple cues in the selection interface –Turn-taking, gender, query terms Flexible examination –Text transcript, audio skims

Depicting Turn Taking Behavior Time is depicted from left to right Speakers separated vertically within a depiction Depictions stacked vertically in rank order Actual recordings are more complex

Bootstrapping the Prototype Select a user population and a collection –Journalists and historians –Broadcast news from the 1960’s and 1970’s Mock up an interface –Pilot study to see if we’re on the right track Integrate “back end” speech processing –Recognition, identification, segmentation,... Observe information seeking behaviors

New Zealand Melody Index Index musical tunes as contour patterns –Rising, descending, and repeated pitch –Note duration as a measure of rhythm Users sing queries using words or la, da, … –Pitch tracking accommodates off-key queries Rank order using approximate string match –Insert, delete, substitute, consolidate, fragment Display title, sheet music, and audio

Contour Matching Example “Three Blind Mice” is indexed as: –*DDUDDUDRDUDRD * represents the first note D represents a descending pitch (U is ascending) R represents a repetition (detectable split, same pitch) My singing produces: –*DDUDDUDRRUDRR Approximate string match finds 2 substitutions

Muscle Fish Audio Retrieval Compute 4 acoustic features for each time slice –Pitch, amplitude, brightness, bandwidth Segment at major discontinuities –Find average, variance, and smoothness of segments Store pointers to segments in 13 sorted lists –Use a commercial database for proximity matching 4 features, 3 parameters for each, plus duration –Then rank order using statistical classification Display file name and audio

Muscle Fish Audio Retrieval

Summary Limited audio indexing is practical now –Audio feature matching, answering machine detection Present interfaces focus on a single technology –Speech recognition, audio feature matching –Matching technology is outpacing interface design

-

October 1, 2001LBSC 708R Speech-Based Retrieval Systems Douglas W. Oard College of Library and Information Services University of Maryland

The Size of the Problem 30,000 hours in the Maryland Libraries –Unique collections with limited physical access Over 100,000 hours in the National Archives –With new material arriving at an increasing rate Millions of hours broadcast each year –Over 2,500 radio stations are now Webcasting!

Outline Retrieval strategies Some examples Comparing speech and text retrieval Speech-based retrieval interface design

Global Internet Audio source: Mar 2001 Over 2500 Internet-accessible Radio and Television Stations

Shoah Foundation Collection 52,000 interviews –116,000 hours (13 years) –32 languages Full description cataloging –14,000 term thesaurus –4,000 interviews for $8 million

Speech Retrieval Approaches Controlled vocabulary indexing Ranked retrieval based on associated text  Automatic feature-based indexing Social filtering based on other users’ ratings

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict

HotBot Audio Search Results

ETH Zurich Radio News Retrieval

BBN Radio News Retrieval

AT&T Radio News Retrieval

MIT “Speech Skimmer”

Cambridge Video Mail Retrieval

CMU Television News Retrieval

Comparison with Text Retrieval Detection and ranking are harder –Because of speech recognition errors Selection is harder –Useful titles are sometimes hard to obtain –Date and time alone may not be informative Examination is harder –Browsing is harder in strictly linear media

A Richer View of Speech Speaker identification –Known speakers –Gender labeling –“More like this” searches Topic segmentation –Find natural breakpoints for browsing Speaker segmentation –Extract turn-taking behavior

Visualizing Turn-Taking

Other Available Features Channel characteristics –Cell phone, landline, studio mike,... Cultural factors –Language, accent, speaking rate Prosody –Emphasis detection Non-speech audio –Background sounds, audio cues

Competing Demands on the Interface Query must result in a manageable set –But users prefer simple query interfaces Selection interface must show several segments –Representations must be compact, but informative Rapid examination should be possible –But complete access to the recordings is desirable

The VoiceGraph Project Exploring rich queries –Content-based, speaker-based, structure-based Multiple cues in the selection interface –Turn-taking, gender, query terms Flexible examination –Text transcript, audio skims

Pilot Study Student focus groups –15 from Journalism, 3 from Library Science Preliminary drawing exercise Static screen shots and mock-ups Focused discussion User satisfaction questionnaire Structured interviews with domain experts –Journalism and Library Science faculty

Pilot Study Results Graphical speech representations appear viable –Expected to be useful for high level browsing When coupled with text transcripts and audio replay –Some training will be needed Suggested improvements –Adjust result set spacing to facilitate rapid selection –Identify categories (monologue, conversation, …) Potentially useful for search or browsing

For More Information Speech-based information retrieval – The VoiceGraph project –

Comparison with Text Retrieval Detection is harder –Speech recognition errors Selection is harder –Date and time are not very informative Examination is harder –Linear medium is hard to browse –Arbitrary segments produce unnatural breaks