Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

1 Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik

2 Agenda Questions Group thinking session Speech retrieval Music retrieval

Shoah Foundation Collection 52,000 interviews –116,000 hours (13 years) –32 languages Full description cataloging –14,000 term thesaurus –4,000 interviews for $8 million

4 Audio Retrieval We have already discussed three approaches –Controlled vocabulary indexing –Ranked retrieval based on associated captions –Social filtering based on other users’ ratings Today’s focus is on content-based retrieval –Analogue of content-based text retrieval

5 Audio Retrieval Retrospective retrieval applications –Search music and nonprint media collections –Electronic finding aids for sound archives –Index audio files on the web Information filtering applications –Alerting service for a news bureau –Answering machine detection for telemarketing –Autotuner for a car radio

The Size of the Problem 30,000 hours in the Maryland Libraries –Unique collections with limited physical access 116,000 hours in the Shoah collection Millions of hours of streaming audio each year –Becoming available worldwide on the web Broadcast news (audio/video) –Ex. Television archive


8 HotBot Audio Search Results

9 Audio Genres Speech-centered –Radio programs –Telephone conversations –Recorded meetings Music-centered –Instrumental, vocal Other sources –Alarms, instrumentation, surveillance, …

10 Detectable Speech Features Content –Phonemes, one-best word recognition, n-best Identity –Speaker identification, speaker segmentation Language –Language, dialect, accent Other measurable parameters –Time, duration, channel, environment

11 How Speech Recognition Works Three stages –What sounds were made? Convert from waveform to subword units (phonemes) –How could the sounds be grouped into words? Identify the most probable word segmentation points –Which of the possible words were spoken? Based on likelihood of possible multiword sequences All three stages are learned from training data –Using hill climbing (a “Hidden Markov Model”)

12 Using Speech Recognition Phone Detection Word Construction Word Selection Phone n-grams Phone lattice Words Transcription dictionary Language model One-best transcript Word lattice

13 Segment broadcasts into 20 second chunks Index phoneme n-grams –Overlapping one-best phoneme sequences –Trained using native German speakers Form phoneme trigrams from typed queries –Rule-based system for “open” vocabulary Vector space trigram matching –Identify ranked segments by time ETHZ Broadcast News Retrieval

14 Phoneme Trigrams Manage -> m ae n ih jh –Dictionaries provide accurate transcriptions But valid only for a single accent and dialect –Rule-base transcription handles unknown words Index every overlapping 3-phoneme sequence –m ae n –ae n ih –n ih jh

15 ETHZ Broadcast News Retrieval

16 Cambridge Video Mail Retrieval Added personal audio (and video) to email –But subject lines still typed on a keyboard Indexed most probable phoneme sequences

17 Cambridge Video Mail Retrieval Translate queries to phonemes with dictionary –Skip stopwords and words with  3 phonemes Find no-overlap matches in the lattice –Queries take about 4 seconds per hour of material Vector space exact word match –No morphological variations checked –Normalize using most probable phoneme sequence Select from a ranked list of subject lines



20 Contrast of Approaches Rule-based transcription –Potentially errorful –Broad coverage, handles unknown words Dictionary-based transcription –Good for smaller settings –Accurate Both susceptible to the problem of variability

21 BBN Radio News Retrieval

22 AT&T Radio News Retrieval

23 IBM Broadcast News Retrieval Large vocabulary continuous speech recognition –64,000 word forms covers most utterances When suitable training data is available About 40% word error rate in the TREC 6 evaluation –Slow indexing (1 hour per hour) limits collection size Standard word-based vector space matching –Nearly instant queries –N-gram triage plus lattice match for unknown words Ranked list showing source and broadcast time

Comparison with Text Retrieval Detection is harder –Speech recognition errors Selection is harder –Date and time are not very informative Examination is harder –Linear medium is hard to browse –Arbitrary segments produce unnatural breaks

25 Speaker Identification Gender –Classify speakers as male or female Identity –Detect speech samples from same speaker –To assign a name, need a known training sample Speaker segmentation – Identify speaker changes –Count number of speakers

26 A Richer View of Speech Speaker identification –Known speaker and “more like this” searches –Gender detection for search and browsing Topic segmentation via vocabulary shift –More natural breakpoints for browsing Speaker segmentation –Visualize turn-taking behavior for browsing –Classify turn-taking patterns for searching

27 Other Possibly Useful Features Channel characteristics –Cell phone, landline, studio mike,... Accent –Another way of grouping speakers Prosody –Detecting emphasis could help search or browsing Non-speech audio –Background sounds, audio cues

28 Competing Demands on the Interface Query must result in a manageable set –But users prefer simple query interfaces Selection interface must show several segments –Representations must be compact, but informative Rapid examination should be possible –But complete access to the recordings is desirable

29 Iterative Prototyping Strategy Select a user group and a collection Observe information seeking behaviors –To identify effective search strategies Refine the interface –To support effective search strategies Integrate needed speech technologies Evaluate the improvements with user studies –And observe changes to effective search strategies

30 The VoiceGraph Project Exploring rich queries –Content-based, speaker-based, structure-based Multiple cues in the selection interface –Turn-taking, gender, query terms Flexible examination –Text transcript, audio skims

31 Depicting Turn Taking Behavior Time is depicted from left to right Speakers separated vertically within a depiction Depictions stacked vertically in rank order Actual recordings are more complex 1 2 3 4



34 Bootstrapping the Prototype Select a user population and a collection –Journalists and historians –Broadcast news from the 1960’s and 1970’s Mock up an interface –Pilot study to see if we’re on the right track Integrate “back end” speech processing –Recognition, identification, segmentation,... Observe information seeking behaviors

35 New Zealand Melody Index Index musical tunes as contour patterns –Rising, descending, and repeated pitch –Note duration as a measure of rhythm Users sing queries using words or la, da, … –Pitch tracking accommodates off-key queries Rank order using approximate string match –Insert, delete, substitute, consolidate, fragment Display title, sheet music, and audio

36 Contour Matching Example “Three Blind Mice” is indexed as: –*DDUDDUDRDUDRD * represents the first note D represents a descending pitch (U is ascending) R represents a repetition (detectable split, same pitch) My singing produces: –*DDUDDUDRRUDRR Approximate string match finds 2 substitutions


38 Muscle Fish Audio Retrieval Compute 4 acoustic features for each time slice –Pitch, amplitude, brightness, bandwidth Segment at major discontinuities –Find average, variance, and smoothness of segments Store pointers to segments in 13 sorted lists –Use a commercial database for proximity matching 4 features, 3 parameters for each, plus duration –Then rank order using statistical classification Display file name and audio

39 Muscle Fish Audio Retrieval

40 Summary Limited audio indexing is practical now –Audio feature matching, answering machine detection Present interfaces focus on a single technology –Speech recognition, audio feature matching –Matching technology is outpacing interface design

42 October 1, 2001LBSC 708R Speech-Based Retrieval Systems Douglas W. Oard College of Library and Information Services University of Maryland

43 The Size of the Problem 30,000 hours in the Maryland Libraries –Unique collections with limited physical access Over 100,000 hours in the National Archives –With new material arriving at an increasing rate Millions of hours broadcast each year –Over 2,500 radio stations are now Webcasting!

44 Outline Retrieval strategies Some examples Comparing speech and text retrieval Speech-based retrieval interface design

45 Global Internet Audio source:, Mar 2001 Over 2500 Internet-accessible Radio and Television Stations

46 Shoah Foundation Collection 52,000 interviews –116,000 hours (13 years) –32 languages Full description cataloging –14,000 term thesaurus –4,000 interviews for $8 million

47 Speech Retrieval Approaches Controlled vocabulary indexing Ranked retrieval based on associated text  Automatic feature-based indexing Social filtering based on other users’ ratings

48 Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict


50 HotBot Audio Search Results

51 ETH Zurich Radio News Retrieval

52 BBN Radio News Retrieval

53 AT&T Radio News Retrieval

54 MIT “Speech Skimmer”

55 Cambridge Video Mail Retrieval


57 CMU Television News Retrieval

58 Comparison with Text Retrieval Detection and ranking are harder –Because of speech recognition errors Selection is harder –Useful titles are sometimes hard to obtain –Date and time alone may not be informative Examination is harder –Browsing is harder in strictly linear media

59 A Richer View of Speech Speaker identification –Known speakers –Gender labeling –“More like this” searches Topic segmentation –Find natural breakpoints for browsing Speaker segmentation –Extract turn-taking behavior

60 Visualizing Turn-Taking

61 Other Available Features Channel characteristics –Cell phone, landline, studio mike,... Cultural factors –Language, accent, speaking rate Prosody –Emphasis detection Non-speech audio –Background sounds, audio cues

62 Competing Demands on the Interface Query must result in a manageable set –But users prefer simple query interfaces Selection interface must show several segments –Representations must be compact, but informative Rapid examination should be possible –But complete access to the recordings is desirable

63 The VoiceGraph Project Exploring rich queries –Content-based, speaker-based, structure-based Multiple cues in the selection interface –Turn-taking, gender, query terms Flexible examination –Text transcript, audio skims

64 Pilot Study Student focus groups –15 from Journalism, 3 from Library Science Preliminary drawing exercise Static screen shots and mock-ups Focused discussion User satisfaction questionnaire Structured interviews with domain experts –Journalism and Library Science faculty

65 Pilot Study Results Graphical speech representations appear viable –Expected to be useful for high level browsing When coupled with text transcripts and audio replay –Some training will be needed Suggested improvements –Adjust result set spacing to facilitate rapid selection –Identify categories (monologue, conversation, …) Potentially useful for search or browsing

66 For More Information Speech-based information retrieval – The VoiceGraph project –

67 Comparison with Text Retrieval Detection is harder –Speech recognition errors Selection is harder –Date and time are not very informative Examination is harder –Linear medium is hard to browse –Arbitrary segments produce unnatural breaks

