Speech and Music Retrieval LBSC 796/CMSC828o Session 12, April 19, 2004 Douglas W. Oard.

Speech and Music Retrieval LBSC 796/CMSC828o Session 12, April 19, 2004 Douglas W. Oard

Agenda Questions Speech retrieval Music retrieval

Spoken Word Collections Broadcast programming –News, interview, talk radio, sports, entertainment Scripted stories –Books on tape, poetry reading, theater Spontaneous storytelling –Oral history, folklore Incidental recording –Speeches, oral arguments, meetings, phone calls

Some Statistics 2,000 U.S. radio stations webcasting 250,000 hours of oral history in British Library 35 million audio streams indexed by SingingFish –Over 1 million searches per day ~100 billion hours of phone calls each year

Economics of the Web Text in 1995Speech in 2003 Storage (words per $) 300K1.5M Internet Backbone (simultaneous users) 250K30M Modem Capacity (% utilization) 100%20% Display Capability (% US population) 10%38% Search SystemsLycos Yahoo SpeechBot SingingFish

Audio Retrieval Retrospective retrieval applications –Search music and nonprint media collections –Electronic finding aids for sound archives –Index audio files on the web Information filtering applications –Alerting service for a news bureau –Answering machine detection for telemarketing –Autotuner for a car radio

The Size of the Problem 30,000 hours in the Maryland Libraries –Unique collections with limited physical access Over 100,000 hours in the National Archives –With new material arriving at an increasing rate Millions of hours broadcast each year –Over 2,500 radio stations are now Webcasting!

Speech Retrieval Approaches Controlled vocabulary indexing Ranked retrieval based on associated text  Automatic feature-based indexing Social filtering based on other users’ ratings

Supporting Information Access Source Selection Search Query Selection Ranked List Examination Recording Delivery Recording Query Formulation Search System Query Reformulation and Relevance Feedback Source Reselection

Description Strategies Transcription –Manual transcription (with optional post-editing) Annotation –Manually assign descriptors to points in a recording –Recommender systems (ratings, link analysis, …) Associated materials –Interviewer’s notes, speech scripts, producer’s logs Automatic –Create access points with automatic speech processing

HotBot Audio Search Results

Detectable Speech Features Content –Phonemes, one-best word recognition, n-best Identity –Speaker identification, speaker segmentation Language –Language, dialect, accent Other measurable parameters –Time, duration, channel, environment

How Speech Recognition Works Three stages –What sounds were made? Convert from waveform to subword units (phonemes) –How could the sounds be grouped into words? Identify the most probable word segmentation points –Which of the possible words were spoken? Based on likelihood of possible multiword sequences All three stages are learned from training data –Using hill climbing (a “Hidden Markov Model”)

Using Speech Recognition Phone Detection Word Construction Word Selection Phone n-grams Phone lattice Words Transcription dictionary Language model One-best transcript Word lattice

Segment broadcasts into 20 second chunks Index phoneme n-grams –Overlapping one-best phoneme sequences –Trained using native German speakers Form phoneme trigrams from typed queries –Rule-based system for “open” vocabulary Vector space trigram matching –Identify ranked segments by time ETHZ Broadcast News Retrieval

Phoneme Trigrams Manage -> m ae n ih jh –Dictionaries provide accurate transcriptions But valid only for a single accent and dialect –Rule-base transcription handles unknown words Index every overlapping 3-phoneme sequence –m ae n –ae n ih –n ih jh

ETHZ Broadcast News Retrieval

Cambridge Video Mail Retrieval Added personal audio (and video) to email –But subject lines still typed on a keyboard Indexed most probable phoneme sequences

Key Results from TREC/TDT Recognition and retrieval can be decomposed –Word recognition/retrieval works well in English Retrieval is robust with recognition errors –Up to 40% word error rate is tolerable Retrieval is robust with segmentation errors –Vocabulary shift/pauses provide strong cues

Cambridge Video Mail Retrieval Translate queries to phonemes with dictionary –Skip stopwords and words with  3 phonemes Find no-overlap matches in the lattice –Queries take about 4 seconds per hour of material Vector space exact word match –No morphological variations checked –Normalize using most probable phoneme sequence Select from a ranked list of subject lines

Visualizing Turn-Taking

MIT “Speech Skimmer”

BBN Radio News Retrieval

AT&T Radio News Retrieval

IBM Broadcast News Retrieval Large vocabulary continuous speech recognition –64,000 word forms covers most utterances When suitable training data is available About 40% word error rate in the TREC 6 evaluation –Slow indexing (1 hour per hour) limits collection size Standard word-based vector space matching –Nearly instant queries –N-gram triage plus lattice match for unknown words Ranked list showing source and broadcast time

Comparison with Text Retrieval Detection is harder –Speech recognition errors Selection is harder –Date and time are not very informative Examination is harder –Linear medium is hard to browse –Arbitrary segments produce unnatural breaks

Speaker Identification Gender –Classify speakers as male or female Identity –Detect speech samples from same speaker –To assign a name, need a known training sample Speaker segmentation – Identify speaker changes –Count number of speakers

A Richer View of Speech Speaker identification –Known speaker and “more like this” searches –Gender detection for search and browsing Topic segmentation via vocabulary shift –More natural breakpoints for browsing Speaker segmentation –Visualize turn-taking behavior for browsing –Classify turn-taking patterns for searching

Other Possibly Useful Features Channel characteristics –Cell phone, landline, studio mike,... Accent –Another way of grouping speakers Prosody –Detecting emphasis could help search or browsing Non-speech audio –Background sounds, audio cues

Competing Demands on the Interface Query must result in a manageable set –But users prefer simple query interfaces Selection interface must show several segments –Representations must be compact, but informative Rapid examination should be possible –But complete access to the recordings is desirable

Iterative Prototyping Strategy Select a user group and a collection Observe information seeking behaviors –To identify effective search strategies Refine the interface –To support effective search strategies Integrate needed speech technologies Evaluate the improvements with user studies –And observe changes to effective search strategies

Broadcast News Retrieval Study NPR Online  Manually prepared transcripts  Human cataloging SpeechBot  Automatic Speech Recognition  Automatic indexing

NPR Online

SpeechBot

Study Design Seminar on visual and sound materials –Recruited 5 students After training, we provided 2 topics –3 searched NPR Online, 2 searched SpeechBot All then tried both systems with a 3 rd topic –Each choosing their own topic Rich data collection –Observation, think aloud, semi-structured interview Model-guided inductive analysis –Coded to the model with QSR NVivo

Criterion-Attribute Framework Relevance Criteria Associated Attributes NPR OnlineSpeechBot Topicality Story Type Authority Story title Brief summary Audio Detailed summary Speaker name Audio Detailed summary Short summary Story title Program title Speaker name Speaker’s affiliation Detailed summary Brief summary Audio Highlighted terms Audio Program title

Shoah Foundation’s Collection Enormous scale –116,000 hours; 52,000 interviews; 180 TB Grand challenges –32 languages, accents, elderly, emotional, … Accessible –$100 million collection and digitization investment Annotated –10,000 hours (~200,000 segments) fully described Users –A department working full time on dissemination

English ASR Accuracy Training: 200 hours from 800 speakers

ASR Game Plan HoursWord LanguageTranscribedError Rate English20039.6% Czech8439.4% Russian20 (of 100)66.6% Polish Slovak As of May 2003

Who Uses the Collection? History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use DisciplineProducts Based on analysis of 280 access requests

Question Types Content –Person, organization –Place, type of place (e.g., camp, ghetto) –Time, time period –Event, subject Mode of expression –Language –Displayed artifacts (photographs, objects, …) –Affective reaction (e.g., vivid, moving, …) Age appropriateness

Observational Studies 8 independent searchers –Holocaust studies (2) –German Studies –History/Political Science –Ethnography –Sociology –Documentary producer –High school teacher 8 teamed searchers –All high school teachers Thesaurus-based search Rich data collection –Intermediary interaction –Semi-structured interviews –Observational notes –Think-aloud –Screen capture Qualitative analysis –Theory-guided coding –Abductive reasoning

“Old Indexing” SubjectPersonLocation-Time Berlin-1939 Employment Josef Stein Berlin-1939 Family life Gretchen Stein Anna Stein Dresden-1939 Schooling Gunter Wendt Maria Dresden-1939 Relocation Transportation-rail interview time

Table 5. Mentions of relevance criteria by searchers Relevance Criteria Number of Mentions All (N=703) Think-Aloud Relevance Judgment (N=300) Query Form. (N=248) Topicality 535 (76%)219234 Richness 39 (5.5%) 14 0 Emotion 24 (3.4%) 7 0 Audio/Visual Expression 16 (2.3%) 5 0 Comprehensibility 14 (2%) 1 10 Duration 11 (1.6%) 9 0 Novelty 10 (1.4%) 4 2 Workshops 1 and 2

Topicality Workshops 1 and 2 Total mentions by 8 searchers

Person, by name Person, by characteristics Date of birth Gender Occupation, interviewee Country of birth Religion Social status, interviewee Social status, parents Nationality, language Family status Address Occupation, parents Immigration history Level of education Political affiliation Marital status Workshops 1 and 2 Personal event/experience Resistance Deportation Suicide Immigration Liberation Escaping Forced labor Hiding Abortion Wedding Murder/Death Adaptation Abandonment Incarceration Historical event/experience November Pogrom Beautification Polish Pogrom Olympic games

7The extent to which the segment provides a useful basis for learning and internalizing vocabulary 8The extent to which the clip can be used in several subjects, for example, English, history, and art. 9Ease of comprehension, mainly clarity of enunciation 10Expressive power, both body language and voice 11Length of the segment, in relation to what it contributes 12Does the segment communicate enough context? Relevance Criteria 1Relationship to theme 2Specific topic was less important (different from many researchers for whom specific topic, such as name of camp or physician as aid giver, is important) 3Match of the demographic characteristics; specifically: age of the interviewee at the time of the events recounted as related to the age of the students for whom a lesson is intended. 4Language of clip. Relates to student audience and teaching purpose. English first, Spanish second 5Age-appropriateness of the material 6Acceptability to the other stakeholders (parents, administrators, etc.) Workshop 3

MALACH Overview Automatic Search Boundary Detection Interactive Selection Content Tagging Speech Recognition Query Formulation

MALACH Test Collection Automatic Search Boundary Detection Speech Recognition Query Formulation Topic Statements Ranked Lists Evaluation Relevance Judgments Mean Average Precision Interviews Comparable Collection Content Tagging

Design Decision #1: Search for Segments “Old indexing” used manual segmentation 4,000 indexed interviews (10,000 hours) 1,514 have been digitized (3,780 hours) 800 of those were used to train ASR 714 remain for (fair) IR testing 246 were selected (625 hours) 199 complete interviews, 47 partial interviews 9,947 segments resulted (17 MB of text) –Average length: 380 words (~3 minutes)

Design Decision #2: Topic Construction 600 written requests, in folders at VHF –From scholars, teachers, broadcasters, … 280 topical requests –Others just requested a single interview 70 recast a TREC-like format –Some needed to be “broadened” 50 selected for use in the collection 30 assessed during Summer 2004 28 yielded at least 5 relevant segments

Design Decision #3: Defining “Relevance” Topical relevance  DirectThey saw the plane crash  IndirectThey saw the engine hit the building Confounding factors  ContextStatistics on causes of plane crashes  ComparisonSaw a different plane crash  PointerList of witnesses at the investigation

Design Decision #4: Selective Assessment Exhaustive assessment is impractical –300,000 judgments (~20,000 hours)! Pooled assessment is not yet possible –Requires a diverse set of ASR and IR systems Search-guided assessment is viable –Iterate topic research/search/assessment –Augment with review, adjudication, reassessment –Requires an effective interactive search system

Quality Assurance 14 topics independently assessed –0.63 topic-averaged kappa (over all judgments) –44% topic-averaged overlap (relevant judgments) –Assessors later met to adjudicate 14 topics assessed and then reviewed –Decisions of the reviewer were final

Comparing Index Terms Topical relevance, adjudicated judgments, Inquery

Leveraging Automatic Categorization Motivation: MAP on thesaurus terms, title queries: 28.3% (was 7.5% on text) Highlight: 18pt Arial Italic, teal R045 | G182 | B179 Text slide with subheading and highlight Subheading: 20pt Arial Italics teal R045 | G182 | B179 text (transcripts) keywords training segments text (ASR output) keywords test segments automatic categorization index Categorizer: k Nearest Neighbors trained on 3,199 manually transcribed segments micro-averaged F1 = 0.192

ASR-Based Search Mean Average Precision Title queries, topical relevance, adjudicated judgments +30%

Topical relevance, adjudicated judgments, title queries, Inquery Uninterpolated Average Precision Comparing ASR and Manual

Error Analysis Somewhere in ASR Results (bold occur in <35 segments) in ASR Lexicon Only in Metadata witeichmann jewvolkswagen labor campig farben slave labortelefunkenaeg minsk ghetto underground wallenberg eichmann bomb birkeneneau sonderkommando auschwicz liber buchenwald dachau jewish kapo kindertransport ghetto life fort ontario refugee camp jewish partisan poland jew shanghai bulgaria save jew Title queries, adjudicated judgments, Inquery ASR % of Metadata

Correcting Relevant Segments Title+Description+Narrative queries

relevant segments occurrences per segment misses per segment miss %false alarms per segment false alarm % fort131.20.856900 Ontario131.20.776300 refugee21.00000 refugees62.00.67330.178.3 camp213.61.203300 camps21.00000 Jewish82.10.75350.3818 kapo92.62.007800 kapos72.72.7010000 Minsk91.71.7010000 ghetto193.21.10350.3210 ghettos31.01.0010000 underground31.30.332500

What Have We Learned? IR test collection yields interesting insights –Real topics, real ASR, ok assessor agreement Named entities are important to real users –Word error rate can mask key ASR weaknesses Knowledge structures seem to add value –Hand-built thesaurus + text classification

For More Information The MALACH project –http://www.clsp.jhu.edu/research/malach/ NSF/EU Spoken Word Access Group –http://www.dcs.shef.ac.uk/spandh/projects/swag/ Speech-based retrieval –http://www.glue.umd.edu/~dlrg/speech/

New Zealand Melody Index Index musical tunes as contour patterns –Rising, descending, and repeated pitch –Note duration as a measure of rhythm Users sing queries using words or la, da, … –Pitch tracking accommodates off-key queries Rank order using approximate string match –Insert, delete, substitute, consolidate, fragment Display title, sheet music, and audio

Contour Matching Example “Three Blind Mice” is indexed as: –*DDUDDUDRDUDRD * represents the first note D represents a descending pitch (U is ascending) R represents a repetition (detectable split, same pitch) My singing produces: –*DDUDDUDRRUDRR Approximate string match finds 2 substitutions

Muscle Fish Audio Retrieval Compute 4 acoustic features for each time slice –Pitch, amplitude, brightness, bandwidth Segment at major discontinuities –Find average, variance, and smoothness of segments Store pointers to segments in 13 sorted lists –Use a commercial database for proximity matching 4 features, 3 parameters for each, plus duration –Then rank order using statistical classification Display file name and audio

Muscle Fish Audio Retrieval

Summary Limited audio indexing is practical now –Audio feature matching, answering machine detection Present interfaces focus on a single technology –Speech recognition, audio feature matching –Matching technology is outpacing interface design

Speech and Music Retrieval LBSC 796/CMSC828o Session 12, April 19, 2004 Douglas W. Oard.

Similar presentations

Presentation on theme: "Speech and Music Retrieval LBSC 796/CMSC828o Session 12, April 19, 2004 Douglas W. Oard."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Speech and Music Retrieval LBSC 796/CMSC828o Session 12, April 19, 2004 Douglas W. Oard.

Similar presentations

Presentation on theme: "Speech and Music Retrieval LBSC 796/CMSC828o Session 12, April 19, 2004 Douglas W. Oard."— Presentation transcript:

Similar presentations

About project

Feedback