Speech and Music Retrieval LBSC 796/CMSC828o Session 12, April 19, 2004 Douglas W. Oard.

Slides:



Advertisements
Similar presentations
PROQUEST SIRS ISSUES RESEARCHER INSIGHT INTO TODAYS LEADING ISSUES Online Tutorial sks.sirs.com | proquestk12.com.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Search Engines and Information Retrieval
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
MALACH Multilingual Access to Large spoken ArCHives Survivors of the Shoah Visual History Foundation Human Language Technologies IBM T. J. Watson Research.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Access to News Audio User Interaction in Speech Retrieval Systems by Jinmook Kim and Douglas W. Oard May 31, th Annual Symposium and Open House.
Information Access Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies Design Understanding.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Data Analysis for Evaluation Eric Graig, Ph.D.. Slide 2 Innovation Network, Inc. Purpose of this Training To increase your skills in analysis and interpretation.
Multilingual Access to Large Spoken Archives Douglas W. Oard University of Maryland, College Park, MD, USA.
Search Engines and Information Retrieval Chapter 1.
Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne,
Connecticut History Online A digital library? By Todd Vandenbark.
Real-Time Speech Recognition Subtitling in Education Respeaking 2009 Dr Mike Wald University of Southampton.
Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.
Interaction Design Session 12 LBSC 790 / INFM 718B Building the Human-Computer Interface.
Meta-Knowledge Computer-age study skill or What kids need to know to be effective students Graham Seibert Copyright 2006.
Developing a Good Research Question (for Project #4)
November 15, 2003CLIS Alumni Chapter Talking to the Future: The MALACH Project Douglas W. Oard Joanne Archer, Ammie Feijoo, Xiaoli Huang College of Information.
SEARCH ENGINES Jaime Ma, Vancy Truong & Victoria Fry.
Producción de Sistemas de Información Agosto-Diciembre 2007 Sesión # 8.
MULTIMEDIA DEFINITION OF MULTIMEDIA
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Introduction ESDS Qualidata John Southall ESDS Creating and delivering re-usable qualitative data 24 June 2004.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
인지구조기반 마이닝 소프트컴퓨팅 연구실 박사 2 학기 박 한 샘 2006 지식기반시스템 응용.
Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.
Directions for Hypertext Research: Exploring the Design Space for Interactive Scholarly Communication John J. Leggett & Frank M. Shipman Department of.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
User Interfaces 4 BTECH: IT WIKI PAGE:
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Speech and Music Retrieval INST 734 Doug Oard Module 12.
CLEF-2005 CL-SR at Maryland: Document and Query Expansion Using Side Collections and Thesauri Jianqiang Wang and Douglas W. Oard College of Information.
Jane Reid, AMSc IRIC, QMUL, 30/10/01 1 Information seeking Information-seeking models Search strategies Search tactics.
Information Retrieval
Massachusetts Recommended Standards for PreK – 12 Information Literacy Skills Valerie Diggs Standards Committee Chair.
Performance Comparison of Speaker and Emotion Recognition
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
September 16, 2004CLEF 2004 CLEF-2005 CL-SDR: Proposing an IR Test Collection for Spontaneous Conversational Speech Gareth Jones (Dublin City University,
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Colby Smart, E-Learning Specialist Humboldt County Office of Education
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
A POCKET GUIDE TO PUBLIC SPEAKING 4 TH EDITION Chapter 9 Locating Supporting Material.
Design Evaluation Overview Introduction Model for Interface Design Evaluation Types of Evaluation –Conceptual Design –Usability –Learning Outcome.
Maya Sharsheeva, reference-librarian AUCA Library Effective information search in the Library e-Resources.
Searching the Web for academic information Ruth Stubbings.
Information Retrieval in Practice
Visual Information Retrieval
Large Digital Oral History Archives
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Lab 2: Information Retrieval
Presentation transcript:

Speech and Music Retrieval LBSC 796/CMSC828o Session 12, April 19, 2004 Douglas W. Oard

Agenda Questions Speech retrieval Music retrieval

Spoken Word Collections Broadcast programming –News, interview, talk radio, sports, entertainment Scripted stories –Books on tape, poetry reading, theater Spontaneous storytelling –Oral history, folklore Incidental recording –Speeches, oral arguments, meetings, phone calls

Some Statistics 2,000 U.S. radio stations webcasting 250,000 hours of oral history in British Library 35 million audio streams indexed by SingingFish –Over 1 million searches per day ~100 billion hours of phone calls each year

Economics of the Web Text in 1995Speech in 2003 Storage (words per $) 300K1.5M Internet Backbone (simultaneous users) 250K30M Modem Capacity (% utilization) 100%20% Display Capability (% US population) 10%38% Search SystemsLycos Yahoo SpeechBot SingingFish

Audio Retrieval Retrospective retrieval applications –Search music and nonprint media collections –Electronic finding aids for sound archives –Index audio files on the web Information filtering applications –Alerting service for a news bureau –Answering machine detection for telemarketing –Autotuner for a car radio

The Size of the Problem 30,000 hours in the Maryland Libraries –Unique collections with limited physical access Over 100,000 hours in the National Archives –With new material arriving at an increasing rate Millions of hours broadcast each year –Over 2,500 radio stations are now Webcasting!

Speech Retrieval Approaches Controlled vocabulary indexing Ranked retrieval based on associated text  Automatic feature-based indexing Social filtering based on other users’ ratings

Supporting Information Access Source Selection Search Query Selection Ranked List Examination Recording Delivery Recording Query Formulation Search System Query Reformulation and Relevance Feedback Source Reselection

Description Strategies Transcription –Manual transcription (with optional post-editing) Annotation –Manually assign descriptors to points in a recording –Recommender systems (ratings, link analysis, …) Associated materials –Interviewer’s notes, speech scripts, producer’s logs Automatic –Create access points with automatic speech processing

HotBot Audio Search Results

Detectable Speech Features Content –Phonemes, one-best word recognition, n-best Identity –Speaker identification, speaker segmentation Language –Language, dialect, accent Other measurable parameters –Time, duration, channel, environment

How Speech Recognition Works Three stages –What sounds were made? Convert from waveform to subword units (phonemes) –How could the sounds be grouped into words? Identify the most probable word segmentation points –Which of the possible words were spoken? Based on likelihood of possible multiword sequences All three stages are learned from training data –Using hill climbing (a “Hidden Markov Model”)

Using Speech Recognition Phone Detection Word Construction Word Selection Phone n-grams Phone lattice Words Transcription dictionary Language model One-best transcript Word lattice

Segment broadcasts into 20 second chunks Index phoneme n-grams –Overlapping one-best phoneme sequences –Trained using native German speakers Form phoneme trigrams from typed queries –Rule-based system for “open” vocabulary Vector space trigram matching –Identify ranked segments by time ETHZ Broadcast News Retrieval

Phoneme Trigrams Manage -> m ae n ih jh –Dictionaries provide accurate transcriptions But valid only for a single accent and dialect –Rule-base transcription handles unknown words Index every overlapping 3-phoneme sequence –m ae n –ae n ih –n ih jh

ETHZ Broadcast News Retrieval

Cambridge Video Mail Retrieval Added personal audio (and video) to –But subject lines still typed on a keyboard Indexed most probable phoneme sequences

Key Results from TREC/TDT Recognition and retrieval can be decomposed –Word recognition/retrieval works well in English Retrieval is robust with recognition errors –Up to 40% word error rate is tolerable Retrieval is robust with segmentation errors –Vocabulary shift/pauses provide strong cues

Cambridge Video Mail Retrieval Translate queries to phonemes with dictionary –Skip stopwords and words with  3 phonemes Find no-overlap matches in the lattice –Queries take about 4 seconds per hour of material Vector space exact word match –No morphological variations checked –Normalize using most probable phoneme sequence Select from a ranked list of subject lines

Visualizing Turn-Taking

MIT “Speech Skimmer”

BBN Radio News Retrieval

AT&T Radio News Retrieval

IBM Broadcast News Retrieval Large vocabulary continuous speech recognition –64,000 word forms covers most utterances When suitable training data is available About 40% word error rate in the TREC 6 evaluation –Slow indexing (1 hour per hour) limits collection size Standard word-based vector space matching –Nearly instant queries –N-gram triage plus lattice match for unknown words Ranked list showing source and broadcast time

Comparison with Text Retrieval Detection is harder –Speech recognition errors Selection is harder –Date and time are not very informative Examination is harder –Linear medium is hard to browse –Arbitrary segments produce unnatural breaks

Speaker Identification Gender –Classify speakers as male or female Identity –Detect speech samples from same speaker –To assign a name, need a known training sample Speaker segmentation – Identify speaker changes –Count number of speakers

A Richer View of Speech Speaker identification –Known speaker and “more like this” searches –Gender detection for search and browsing Topic segmentation via vocabulary shift –More natural breakpoints for browsing Speaker segmentation –Visualize turn-taking behavior for browsing –Classify turn-taking patterns for searching

Other Possibly Useful Features Channel characteristics –Cell phone, landline, studio mike,... Accent –Another way of grouping speakers Prosody –Detecting emphasis could help search or browsing Non-speech audio –Background sounds, audio cues

Competing Demands on the Interface Query must result in a manageable set –But users prefer simple query interfaces Selection interface must show several segments –Representations must be compact, but informative Rapid examination should be possible –But complete access to the recordings is desirable

Iterative Prototyping Strategy Select a user group and a collection Observe information seeking behaviors –To identify effective search strategies Refine the interface –To support effective search strategies Integrate needed speech technologies Evaluate the improvements with user studies –And observe changes to effective search strategies

Broadcast News Retrieval Study NPR Online  Manually prepared transcripts  Human cataloging SpeechBot  Automatic Speech Recognition  Automatic indexing

NPR Online

SpeechBot

Study Design Seminar on visual and sound materials –Recruited 5 students After training, we provided 2 topics –3 searched NPR Online, 2 searched SpeechBot All then tried both systems with a 3 rd topic –Each choosing their own topic Rich data collection –Observation, think aloud, semi-structured interview Model-guided inductive analysis –Coded to the model with QSR NVivo

Criterion-Attribute Framework Relevance Criteria Associated Attributes NPR OnlineSpeechBot Topicality Story Type Authority Story title Brief summary Audio Detailed summary Speaker name Audio Detailed summary Short summary Story title Program title Speaker name Speaker’s affiliation Detailed summary Brief summary Audio Highlighted terms Audio Program title

Shoah Foundation’s Collection Enormous scale –116,000 hours; 52,000 interviews; 180 TB Grand challenges –32 languages, accents, elderly, emotional, … Accessible –$100 million collection and digitization investment Annotated –10,000 hours (~200,000 segments) fully described Users –A department working full time on dissemination

English ASR Accuracy Training: 200 hours from 800 speakers

ASR Game Plan HoursWord LanguageTranscribedError Rate English % Czech8439.4% Russian20 (of 100)66.6% Polish Slovak As of May 2003

Who Uses the Collection? History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use DisciplineProducts Based on analysis of 280 access requests

Question Types Content –Person, organization –Place, type of place (e.g., camp, ghetto) –Time, time period –Event, subject Mode of expression –Language –Displayed artifacts (photographs, objects, …) –Affective reaction (e.g., vivid, moving, …) Age appropriateness

Observational Studies 8 independent searchers –Holocaust studies (2) –German Studies –History/Political Science –Ethnography –Sociology –Documentary producer –High school teacher 8 teamed searchers –All high school teachers Thesaurus-based search Rich data collection –Intermediary interaction –Semi-structured interviews –Observational notes –Think-aloud –Screen capture Qualitative analysis –Theory-guided coding –Abductive reasoning

“Old Indexing” SubjectPersonLocation-Time Berlin-1939 Employment Josef Stein Berlin-1939 Family life Gretchen Stein Anna Stein Dresden-1939 Schooling Gunter Wendt Maria Dresden-1939 Relocation Transportation-rail interview time

Table 5. Mentions of relevance criteria by searchers Relevance Criteria Number of Mentions All (N=703) Think-Aloud Relevance Judgment (N=300) Query Form. (N=248) Topicality 535 (76%) Richness 39 (5.5%) 14 0 Emotion 24 (3.4%) 7 0 Audio/Visual Expression 16 (2.3%) 5 0 Comprehensibility 14 (2%) 1 10 Duration 11 (1.6%) 9 0 Novelty 10 (1.4%) 4 2 Workshops 1 and 2

Topicality Workshops 1 and 2 Total mentions by 8 searchers

Person, by name Person, by characteristics Date of birth Gender Occupation, interviewee Country of birth Religion Social status, interviewee Social status, parents Nationality, language Family status Address Occupation, parents Immigration history Level of education Political affiliation Marital status Workshops 1 and 2 Personal event/experience Resistance Deportation Suicide Immigration Liberation Escaping Forced labor Hiding Abortion Wedding Murder/Death Adaptation Abandonment Incarceration Historical event/experience November Pogrom Beautification Polish Pogrom Olympic games

7The extent to which the segment provides a useful basis for learning and internalizing vocabulary 8The extent to which the clip can be used in several subjects, for example, English, history, and art. 9Ease of comprehension, mainly clarity of enunciation 10Expressive power, both body language and voice 11Length of the segment, in relation to what it contributes 12Does the segment communicate enough context? Relevance Criteria 1Relationship to theme 2Specific topic was less important (different from many researchers for whom specific topic, such as name of camp or physician as aid giver, is important) 3Match of the demographic characteristics; specifically: age of the interviewee at the time of the events recounted as related to the age of the students for whom a lesson is intended. 4Language of clip. Relates to student audience and teaching purpose. English first, Spanish second 5Age-appropriateness of the material 6Acceptability to the other stakeholders (parents, administrators, etc.) Workshop 3

MALACH Overview Automatic Search Boundary Detection Interactive Selection Content Tagging Speech Recognition Query Formulation

MALACH Test Collection Automatic Search Boundary Detection Speech Recognition Query Formulation Topic Statements Ranked Lists Evaluation Relevance Judgments Mean Average Precision Interviews Comparable Collection Content Tagging

Design Decision #1: Search for Segments “Old indexing” used manual segmentation 4,000 indexed interviews (10,000 hours) 1,514 have been digitized (3,780 hours) 800 of those were used to train ASR 714 remain for (fair) IR testing 246 were selected (625 hours) 199 complete interviews, 47 partial interviews 9,947 segments resulted (17 MB of text) –Average length: 380 words (~3 minutes)

Design Decision #2: Topic Construction 600 written requests, in folders at VHF –From scholars, teachers, broadcasters, … 280 topical requests –Others just requested a single interview 70 recast a TREC-like format –Some needed to be “broadened” 50 selected for use in the collection 30 assessed during Summer yielded at least 5 relevant segments

Design Decision #3: Defining “Relevance” Topical relevance  DirectThey saw the plane crash  IndirectThey saw the engine hit the building Confounding factors  ContextStatistics on causes of plane crashes  ComparisonSaw a different plane crash  PointerList of witnesses at the investigation

Design Decision #4: Selective Assessment Exhaustive assessment is impractical –300,000 judgments (~20,000 hours)! Pooled assessment is not yet possible –Requires a diverse set of ASR and IR systems Search-guided assessment is viable –Iterate topic research/search/assessment –Augment with review, adjudication, reassessment –Requires an effective interactive search system

Quality Assurance 14 topics independently assessed –0.63 topic-averaged kappa (over all judgments) –44% topic-averaged overlap (relevant judgments) –Assessors later met to adjudicate 14 topics assessed and then reviewed –Decisions of the reviewer were final

Comparing Index Terms Topical relevance, adjudicated judgments, Inquery

Leveraging Automatic Categorization Motivation: MAP on thesaurus terms, title queries: 28.3% (was 7.5% on text) Highlight: 18pt Arial Italic, teal R045 | G182 | B179 Text slide with subheading and highlight Subheading: 20pt Arial Italics teal R045 | G182 | B179 text (transcripts) keywords training segments text (ASR output) keywords test segments automatic categorization index Categorizer: k Nearest Neighbors trained on 3,199 manually transcribed segments micro-averaged F1 = 0.192

ASR-Based Search Mean Average Precision Title queries, topical relevance, adjudicated judgments +30%

Topical relevance, adjudicated judgments, title queries, Inquery Uninterpolated Average Precision Comparing ASR and Manual

Error Analysis Somewhere in ASR Results (bold occur in <35 segments) in ASR Lexicon Only in Metadata witeichmann jewvolkswagen labor campig farben slave labortelefunkenaeg minsk ghetto underground wallenberg eichmann bomb birkeneneau sonderkommando auschwicz liber buchenwald dachau jewish kapo kindertransport ghetto life fort ontario refugee camp jewish partisan poland jew shanghai bulgaria save jew Title queries, adjudicated judgments, Inquery ASR % of Metadata

Correcting Relevant Segments Title+Description+Narrative queries

relevant segments occurrences per segment misses per segment miss %false alarms per segment false alarm % fort Ontario refugee refugees camp camps Jewish kapo kapos Minsk ghetto ghettos underground

What Have We Learned? IR test collection yields interesting insights –Real topics, real ASR, ok assessor agreement Named entities are important to real users –Word error rate can mask key ASR weaknesses Knowledge structures seem to add value –Hand-built thesaurus + text classification

For More Information The MALACH project – NSF/EU Spoken Word Access Group – Speech-based retrieval –

New Zealand Melody Index Index musical tunes as contour patterns –Rising, descending, and repeated pitch –Note duration as a measure of rhythm Users sing queries using words or la, da, … –Pitch tracking accommodates off-key queries Rank order using approximate string match –Insert, delete, substitute, consolidate, fragment Display title, sheet music, and audio

Contour Matching Example “Three Blind Mice” is indexed as: –*DDUDDUDRDUDRD * represents the first note D represents a descending pitch (U is ascending) R represents a repetition (detectable split, same pitch) My singing produces: –*DDUDDUDRRUDRR Approximate string match finds 2 substitutions

Muscle Fish Audio Retrieval Compute 4 acoustic features for each time slice –Pitch, amplitude, brightness, bandwidth Segment at major discontinuities –Find average, variance, and smoothness of segments Store pointers to segments in 13 sorted lists –Use a commercial database for proximity matching 4 features, 3 parameters for each, plus duration –Then rank order using statistical classification Display file name and audio

Muscle Fish Audio Retrieval

Summary Limited audio indexing is practical now –Audio feature matching, answering machine detection Present interfaces focus on a single technology –Speech recognition, audio feature matching –Matching technology is outpacing interface design