Download presentation
Presentation is loading. Please wait.
Published byCornelius Stevenson Modified over 9 years ago
1
Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik
2
Agenda Questions Group thinking session Speech retrieval Music retrieval
3
Shoah Foundation Collection 52,000 interviews –116,000 hours (13 years) –32 languages Full description cataloging –14,000 term thesaurus –4,000 interviews for $8 million
4
Audio Retrieval We have already discussed three approaches –Controlled vocabulary indexing –Ranked retrieval based on associated captions –Social filtering based on other users’ ratings Today’s focus is on content-based retrieval –Analogue of content-based text retrieval
5
Audio Retrieval Retrospective retrieval applications –Search music and nonprint media collections –Electronic finding aids for sound archives –Index audio files on the web Information filtering applications –Alerting service for a news bureau –Answering machine detection for telemarketing –Autotuner for a car radio
6
The Size of the Problem 30,000 hours in the Maryland Libraries –Unique collections with limited physical access 116,000 hours in the Shoah collection Millions of hours of streaming audio each year –Becoming available worldwide on the web Broadcast news (audio/video) –Ex. Television archive
8
HotBot Audio Search Results
9
Audio Genres Speech-centered –Radio programs –Telephone conversations –Recorded meetings Music-centered –Instrumental, vocal Other sources –Alarms, instrumentation, surveillance, …
10
Detectable Speech Features Content –Phonemes, one-best word recognition, n-best Identity –Speaker identification, speaker segmentation Language –Language, dialect, accent Other measurable parameters –Time, duration, channel, environment
11
How Speech Recognition Works Three stages –What sounds were made? Convert from waveform to subword units (phonemes) –How could the sounds be grouped into words? Identify the most probable word segmentation points –Which of the possible words were spoken? Based on likelihood of possible multiword sequences All three stages are learned from training data –Using hill climbing (a “Hidden Markov Model”)
12
Using Speech Recognition Phone Detection Word Construction Word Selection Phone n-grams Phone lattice Words Transcription dictionary Language model One-best transcript Word lattice
13
Segment broadcasts into 20 second chunks Index phoneme n-grams –Overlapping one-best phoneme sequences –Trained using native German speakers Form phoneme trigrams from typed queries –Rule-based system for “open” vocabulary Vector space trigram matching –Identify ranked segments by time ETHZ Broadcast News Retrieval
14
Phoneme Trigrams Manage -> m ae n ih jh –Dictionaries provide accurate transcriptions But valid only for a single accent and dialect –Rule-base transcription handles unknown words Index every overlapping 3-phoneme sequence –m ae n –ae n ih –n ih jh
15
ETHZ Broadcast News Retrieval
16
Cambridge Video Mail Retrieval Added personal audio (and video) to email –But subject lines still typed on a keyboard Indexed most probable phoneme sequences
17
Cambridge Video Mail Retrieval Translate queries to phonemes with dictionary –Skip stopwords and words with 3 phonemes Find no-overlap matches in the lattice –Queries take about 4 seconds per hour of material Vector space exact word match –No morphological variations checked –Normalize using most probable phoneme sequence Select from a ranked list of subject lines
20
Contrast of Approaches Rule-based transcription –Potentially errorful –Broad coverage, handles unknown words Dictionary-based transcription –Good for smaller settings –Accurate Both susceptible to the problem of variability
21
BBN Radio News Retrieval
22
AT&T Radio News Retrieval
23
IBM Broadcast News Retrieval Large vocabulary continuous speech recognition –64,000 word forms covers most utterances When suitable training data is available About 40% word error rate in the TREC 6 evaluation –Slow indexing (1 hour per hour) limits collection size Standard word-based vector space matching –Nearly instant queries –N-gram triage plus lattice match for unknown words Ranked list showing source and broadcast time
24
Comparison with Text Retrieval Detection is harder –Speech recognition errors Selection is harder –Date and time are not very informative Examination is harder –Linear medium is hard to browse –Arbitrary segments produce unnatural breaks
25
Speaker Identification Gender –Classify speakers as male or female Identity –Detect speech samples from same speaker –To assign a name, need a known training sample Speaker segmentation – Identify speaker changes –Count number of speakers
26
A Richer View of Speech Speaker identification –Known speaker and “more like this” searches –Gender detection for search and browsing Topic segmentation via vocabulary shift –More natural breakpoints for browsing Speaker segmentation –Visualize turn-taking behavior for browsing –Classify turn-taking patterns for searching
27
Other Possibly Useful Features Channel characteristics –Cell phone, landline, studio mike,... Accent –Another way of grouping speakers Prosody –Detecting emphasis could help search or browsing Non-speech audio –Background sounds, audio cues
28
Competing Demands on the Interface Query must result in a manageable set –But users prefer simple query interfaces Selection interface must show several segments –Representations must be compact, but informative Rapid examination should be possible –But complete access to the recordings is desirable
29
Iterative Prototyping Strategy Select a user group and a collection Observe information seeking behaviors –To identify effective search strategies Refine the interface –To support effective search strategies Integrate needed speech technologies Evaluate the improvements with user studies –And observe changes to effective search strategies
30
The VoiceGraph Project Exploring rich queries –Content-based, speaker-based, structure-based Multiple cues in the selection interface –Turn-taking, gender, query terms Flexible examination –Text transcript, audio skims
31
Depicting Turn Taking Behavior Time is depicted from left to right Speakers separated vertically within a depiction Depictions stacked vertically in rank order Actual recordings are more complex 1 2 3 4
34
Bootstrapping the Prototype Select a user population and a collection –Journalists and historians –Broadcast news from the 1960’s and 1970’s Mock up an interface –Pilot study to see if we’re on the right track Integrate “back end” speech processing –Recognition, identification, segmentation,... Observe information seeking behaviors
35
New Zealand Melody Index Index musical tunes as contour patterns –Rising, descending, and repeated pitch –Note duration as a measure of rhythm Users sing queries using words or la, da, … –Pitch tracking accommodates off-key queries Rank order using approximate string match –Insert, delete, substitute, consolidate, fragment Display title, sheet music, and audio
36
Contour Matching Example “Three Blind Mice” is indexed as: –*DDUDDUDRDUDRD * represents the first note D represents a descending pitch (U is ascending) R represents a repetition (detectable split, same pitch) My singing produces: –*DDUDDUDRRUDRR Approximate string match finds 2 substitutions
38
Muscle Fish Audio Retrieval Compute 4 acoustic features for each time slice –Pitch, amplitude, brightness, bandwidth Segment at major discontinuities –Find average, variance, and smoothness of segments Store pointers to segments in 13 sorted lists –Use a commercial database for proximity matching 4 features, 3 parameters for each, plus duration –Then rank order using statistical classification Display file name and audio
39
Muscle Fish Audio Retrieval
40
Summary Limited audio indexing is practical now –Audio feature matching, answering machine detection Present interfaces focus on a single technology –Speech recognition, audio feature matching –Matching technology is outpacing interface design
41
-
42
October 1, 2001LBSC 708R Speech-Based Retrieval Systems Douglas W. Oard College of Library and Information Services University of Maryland
43
The Size of the Problem 30,000 hours in the Maryland Libraries –Unique collections with limited physical access Over 100,000 hours in the National Archives –With new material arriving at an increasing rate Millions of hours broadcast each year –Over 2,500 radio stations are now Webcasting!
44
Outline Retrieval strategies Some examples Comparing speech and text retrieval Speech-based retrieval interface design
45
Global Internet Audio source: www.real.com, Mar 2001 Over 2500 Internet-accessible Radio and Television Stations
46
Shoah Foundation Collection 52,000 interviews –116,000 hours (13 years) –32 languages Full description cataloging –14,000 term thesaurus –4,000 interviews for $8 million
47
Speech Retrieval Approaches Controlled vocabulary indexing Ranked retrieval based on associated text Automatic feature-based indexing Social filtering based on other users’ ratings
48
Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict
50
HotBot Audio Search Results
51
ETH Zurich Radio News Retrieval
52
BBN Radio News Retrieval
53
AT&T Radio News Retrieval
54
MIT “Speech Skimmer”
55
Cambridge Video Mail Retrieval
57
CMU Television News Retrieval
58
Comparison with Text Retrieval Detection and ranking are harder –Because of speech recognition errors Selection is harder –Useful titles are sometimes hard to obtain –Date and time alone may not be informative Examination is harder –Browsing is harder in strictly linear media
59
A Richer View of Speech Speaker identification –Known speakers –Gender labeling –“More like this” searches Topic segmentation –Find natural breakpoints for browsing Speaker segmentation –Extract turn-taking behavior
60
Visualizing Turn-Taking
61
Other Available Features Channel characteristics –Cell phone, landline, studio mike,... Cultural factors –Language, accent, speaking rate Prosody –Emphasis detection Non-speech audio –Background sounds, audio cues
62
Competing Demands on the Interface Query must result in a manageable set –But users prefer simple query interfaces Selection interface must show several segments –Representations must be compact, but informative Rapid examination should be possible –But complete access to the recordings is desirable
63
The VoiceGraph Project Exploring rich queries –Content-based, speaker-based, structure-based Multiple cues in the selection interface –Turn-taking, gender, query terms Flexible examination –Text transcript, audio skims
64
Pilot Study Student focus groups –15 from Journalism, 3 from Library Science Preliminary drawing exercise Static screen shots and mock-ups Focused discussion User satisfaction questionnaire Structured interviews with domain experts –Journalism and Library Science faculty
65
Pilot Study Results Graphical speech representations appear viable –Expected to be useful for high level browsing When coupled with text transcripts and audio replay –Some training will be needed Suggested improvements –Adjust result set spacing to facilitate rapid selection –Identify categories (monologue, conversation, …) Potentially useful for search or browsing
66
For More Information Speech-based information retrieval –http://www.clis.umd.edu/dlrg/speech/ The VoiceGraph project –http://www.clis.umd.edu/dlrg/voicegraph/
67
Comparison with Text Retrieval Detection is harder –Speech recognition errors Selection is harder –Date and time are not very informative Examination is harder –Linear medium is hard to browse –Arbitrary segments produce unnatural breaks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.