Multilingual Access to Large Spoken Archives Douglas W. Oard University of Maryland, College Park, MD, USA.

Slides:



Advertisements
Similar presentations
Common Core Standards (What this means in computer class)
Advertisements

PROQUEST SIRS ISSUES RESEARCHER INSIGHT INTO TODAYS LEADING ISSUES Online Tutorial sks.sirs.com | proquestk12.com.
History Study Center Primary and secondary sources documenting global history 2010.
ESDS Qualidata Libby Bishop, ESDS Qualidata Economic and Social Data Service UK Data Archive ESDS Awareness Day Friday 5 December 2003Royal Statistical.
Data gathering. Overview Four key issues of data gathering Data recording Interviews Questionnaires Observation Choosing and combining techniques.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
TrIn 3102: Consecutive Interpreting Week 5 2/15/06.
Lesson Two Versions of One Narrative
Consistency of Assessment
Publishing qualitative studies H Maisonneuve April 2015 Edinburgh, Scotland.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
Speech and Music Retrieval LBSC 796/CMSC828o Session 12, April 19, 2004 Douglas W. Oard.
MALACH Multilingual Access to Large spoken ArCHives Survivors of the Shoah Visual History Foundation Human Language Technologies IBM T. J. Watson Research.
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
Access to News Audio User Interaction in Speech Retrieval Systems by Jinmook Kim and Douglas W. Oard May 31, th Annual Symposium and Open House.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 1: Introduction to Decision Support Systems Decision Support.
Preparing for the Verbal Reasoning Measure. Overview Introduction to the Verbal Reasoning Measure Question Types and Strategies for Answering General.
Foreign language and English as a Second Language: Getting to the Common Core of Communication. Are we there yet? Marisol Marcin
ASSESSMENTS IN SOCIAL WORK: THE BIO-PSYCHO-SOCIAL MODEL
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Paper versus speech versus poster: Different formats for communicating research.
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne,
Practical Ideas On Alternative Assessment For ESL Students Jo-Ellen Tannenbaum, Montgomery County Public Schools (MD)
Exploring a topic in depth... From Reading to Writing The drama Antigone was written and performed 2,500 years ago in a society that was very different.
Real-Time Speech Recognition Subtitling in Education Respeaking 2009 Dr Mike Wald University of Southampton.
Chapter 1: The What and the Why of Statistics
Easy-to-Understand Tables RIT Standards Key Ideas and Details #1 KindergartenGrade 1Grade 2 With prompting and support, ask and answer questions about.
Data gathering. Overview Four key issues of data gathering Data recording Interviews Questionnaires Observation Choosing and combining techniques.
ELA Common Core Shifts. Shift 1 Balancing Informational & Literary Text.
Chapter 11: Qualitative and Mixed-Method Research Design
New Teachers’ Induction January 20, 2011 Office of Curriculum and Instruction.
The New English Curriculum September The new programme of study for English is knowledge-based; this means its focus is on knowing facts. It is.
November 15, 2003CLIS Alumni Chapter Talking to the Future: The MALACH Project Douglas W. Oard Joanne Archer, Ammie Feijoo, Xiaoli Huang College of Information.
The What and the Why of Statistics The Research Process Asking a Research Question The Role of Theory Formulating the Hypotheses –Independent & Dependent.
Chapter 15 Qualitative Data Collection Gay, Mills, and Airasian
Virtual University - Human Computer Interaction 1 © Imran Hussain | UMT Imran Hussain University of Management and Technology (UMT) Lecture 40 Observing.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
인지구조기반 마이닝 소프트컴퓨팅 연구실 박사 2 학기 박 한 샘 2006 지식기반시스템 응용.
Business Project Nicos Rodosthenous PhD 09/12/ /12/2014Dr Nicos Rodosthenous1.
TV-Anytime & DMB MAF 31 Oct Jin Woo Hong ETRI.
Software Engineering User Interface Design Slide 1 User Interface Design.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Inquiry-Based Learning How It Looks, Sounds and Feels.
Speech and Music Retrieval INST 734 Doug Oard Module 12.
Intellectual Works and their Manifestations Representation of Information Objects IR Systems & Information objects Spring January, 2006 Bharat.
Assessment. Levels of Learning Bloom Argue Anderson and Krathwohl (2001)
Georgia will lead the nation in improving student achievement. 1 Georgia Performance Standards Day 3: Assessment FOR Learning.
Information Retrieval
Mr. P’s Class Term Paper All the Steps on the Path to an “A” Term Paper in World History.
AVI/Psych 358/IE 340: Human Factors Data Gathering October 3, 2008.
PET Examination OVERVIEW John Scullion Guadalajara 1.
Family Classroom Museum Suzanne Hutchins Lonna Sanderson.
September 16, 2004CLEF 2004 CLEF-2005 CL-SDR: Proposing an IR Test Collection for Spontaneous Conversational Speech Gareth Jones (Dublin City University,
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
Colby Smart, E-Learning Specialist Humboldt County Office of Education
Data gathering (Chapter 7 Interaction Design Text)
Unit 11: Use observation, assessment and planning
Rhetorical Modes of Delivery AKA Patterns of Development.
User Needs Session 6 INST 301 Introduction to Information Science.
1 Dr. Cord Pagenstecher Testimonies on Nazi Forced Labor and the Holocaust Building Digital Environments for Research and Education Dr. Cord Pagenstecher.
A POCKET GUIDE TO PUBLIC SPEAKING 4 TH EDITION Chapter 9 Locating Supporting Material.
ELA - 3 Common Core Vs Kansas Standards. DOMAIN Standards For Literature (RL)
To my presentation about:  IELTS, meaning and it’s band scores.  The tests of the IELTS  Listening test.  Listening common challenges.  Reading.
Stages of Research and Development
Chapter 8 Research: Gathering and Using Information.
Assessment.
Assessment.
Evaluating and Interpreting Oral History
Large Digital Oral History Archives
RESEARCH BASICS What is research?.
Presentation transcript:

Multilingual Access to Large Spoken Archives Douglas W. Oard University of Maryland, College Park, MD, USA

MALACH Project’s Goal Dramatically improve access to large multilingual spoken word collections … by capitalizing on the unique characteristics of the Survivors of the Shoah Visual History Foundation's collection of videotaped oral history interviews.

Spoken Word Collections Broadcast programming –News, interview, talk radio, sports, entertainment Scripted stories –Books on tape, poetry reading, theater Spontaneous storytelling –Oral history, folklore Incidental recording –Speeches, oral arguments, meetings, phone calls

Some Statistics 2,000 U.S. radio stations webcasting 250,000 hours of oral history in British Library 35 million audio streams indexed by SingingFish –Over 1 million searches per day ~100 billion hours of phone calls each year

Economics of the Web in 1995 Affordable storage –300,000 words/$ Adequate backbone capacity –25,000 simultaneous transfers Adequate “last mile” bandwidth –1 second/screen Display capability –10% of US population Effective search capabilities –Lycos, Yahoo

Spoken Word Collections Today Affordable storage –300,000 words/$ Adequate backbone capacity –25,000 simultaneous transfers Adequate “last mile” bandwidth –1 second/screen Display capability –10% of US population Effective search capabilities –Lycos, Yahoo 1.5 million words/$ 30 million 20% of capacity 38% recent use

Research Issues Acquisition Segmentation Description Synchronization Rights management Preservation MALACH

Description Strategies Transcription –Manual transcription (with optional post-editing) Annotation –Manually assign descriptors to points in a recording –Recommender systems (ratings, link analysis, …) Associated materials –Interviewer’s notes, speech scripts, producer’s logs Automatic –Create access points with automatic speech processing

Key Results from TREC/TDT Recognition and retrieval can be decomposed –Word recognition/retrieval works well in English Retrieval is robust with recognition errors –Up to 40% word error rate is tolerable Retrieval is robust with segmentation errors –Vocabulary shift/pauses provide strong cues

Supporting Information Access Source Selection Search Query Selection Ranked List Examination Recording Delivery Recording Query Formulation Search System Query Reformulation and Relevance Feedback Source Reselection

Broadcast News Retrieval Study NPR Online  Manually prepared transcripts  Human cataloging SpeechBot  Automatic Speech Recognition  Automatic indexing

NPR Online

SpeechBot

Study Design Seminar on visual and sound materials –Recruited 5 students After training, we provided 2 topics –3 searched NPR Online, 2 searched SpeechBot All then tried both systems with a 3 rd topic –Each choosing their own topic Rich data collection –Observation, think aloud, semi-structured interview Model-guided inductive analysis –Coded to the model with QSR NVivo

Criterion-Attribute Framework Relevance Criteria Associated Attributes NPR OnlineSpeechBot Topicality Story Type Authority Story title Brief summary Audio Detailed summary Speaker name Audio Detailed summary Short summary Story title Program title Speaker name Speaker’s affiliation Detailed summary Brief summary Audio Highlighted terms Audio Program title

Some Useful Insights Recognition errors may not bother the system, but they do bother the user! Segment-level indexing can be useful

Shoah Foundation’s Collection Enormous scale –116,000 hours; 52,000 interviews; 180 TB Grand challenges –32 languages, accents, elderly, emotional, … Accessible –$100 million collection and digitization investment Annotated –10,000 hours (~200,000 segments) fully described Users –A department working full time on dissemination

Example Video

Existing Annotations 72 million untranscribed words –From ~4,000 speakers Interview-level ground truth –Pre-interview questionnaire (names, locations, …) –Free-text summary Segment-level ground truth –Topic boundaries: average ~3 min/segment –Labels: Names, topic, locations, year(s) –Descriptions: summary + cataloguer’s scratchpad

Annotated Data Example SubjectPersonLocation-Time Berlin-1939 Employment Josef Stein Berlin-1939 Family life Gretchen Stein Anna Stein Dresden-1939 Schooling Gunter Wendt Maria Dresden-1939 Relocation Transportation-rail interview time

MALACH Overview Automatic Search Boundary Detection Interactive Selection Content Tagging Speech Recognition Query Formulation ASR Spontaneous Accented Language switching NLP Components Multi-scale segmentation Multilingual classification Entity normalization Prototype Evidence integration Translingual search Spatial/temporal User Needs Observational studies Formative evaluation Summative evaluation

MALACH Overview Automatic Search Boundary Detection Interactive Selection Content Tagging Speech Recognition Query Formulation ASR Spontaneous Accented Language switching

ASR Research Focus Accuracy –Spontaneous speech –Accented/multilingual/emotional/elderly –Application-specific loss functions Affordability –Minimal transcription –Replicable process

Application-Tuned ASR Acoustic model –Transcribe short segments from many speakers –Unsupervised adaptation Language model –Transcribed segments –Interpolation

ASR Game Plan HoursWord LanguageTranscribedError Rate English % Czech8439.4% Russian20 (of 100)66.6% Polish Slovak As of May 2003

English Transcription Time ~2,000 hours to manually transcribe 200 hours from 800 speakers Hours to transcribe 15 minutes of speech Instances (N=830)

English ASR Error Rate Training: 65 hours (acoustic model)/200 hours (language model)

MALACH Overview Automatic Search Boundary Detection Interactive Selection Content Tagging Speech Recognition Query Formulation User Needs Observational studies Formative evaluation Summative evaluation

Who Uses the Collection? History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use DisciplineProducts Based on analysis of 280 access requests

Question Types Content –Person, organization –Place, type of place (e.g., camp, ghetto) –Time, time period –Event, subject Mode of expression –Language –Displayed artifacts (photographs, objects, …) –Affective reaction (e.g., vivid, moving, …) Age appropriateness

Observational Studies Four searchers –History/Political Science –Holocaust studies –Documentary filmmaker Sequential observation Rich data collection –Intermediary interaction –Semi-structured interviews –Observational notes –Think-aloud –Screen capture Four searchers –Ethnography –German Studies –Sociology –High school teacher Simultaneous observation Opportunistic data collection –Intermediary interaction –Semi-structured interviews –Observational notes –Focus group discussions Workshop 1 (June)Workshop 2 (August)

Segment Viewer

Observed Selection Criteria Topicality (57%)  Judged based on: Person, place, … Accessibility (23%)  Judged based on: Time to load video Comprehensibility (14%)  Judged based on: Language, speaking style

References to Named Entities Attributes Mentions SelectionReformulation Person (N=138) Gender Country of birth Nationality Date of birth Status, interviewee Status, parents Place (N=116) Camp Country Ghetto

Functionality Needed FunctionBoolean Search and Ranked Retrieval (13) Testimony summary (12) Pre-Interview Questionnaire search/viewer (9) Rapid access (7) Related/Alternative search terms (3) Adding multiple search terms at once (2) Keywords linked to segment number for easy access(1) Multi-tasking (1) Searching testimonies by places under ‘Experience Search’ (1) Extensive editing within ‘My Project’ (1) Desired FunctionTemporary saving of selected testimonies (4) Remote access (3) Integrated user tools for note taking (3) Map presentation (2) Reference tool (1) More repositories (1) Introductory video of system tutorial (1) Help (1)

MALACH Overview Automatic Search Boundary Detection Interactive Selection Content Tagging Speech Recognition Query Formulation NLP Components Multi-scale segmentation Multilingual classification Entity normalization

“True” segmentation: transcripts aligned with scratchpad-based boundaries HoursWordsSentencesSegments Training177.51,555,914210,4972,856 Test7.558,9137, Topic Segmentation cataloguer

Effect of ASR Errors

Rethinking the Problem Segment-then-label models planned speech well –Producers assemble stories to create programs –Stories typically have a dominant theme The structure of natural speech is different –Creation: digressions, asides, clarification, … –Use: intended use may affect desired granularity Documentary film: brief snippet to illustrate a point Classroom teacher: longer self-contextualizing story

OntoLog: Labeling Unplanned Speech Manually assigned labels; start and end at any time –Ontology-based aggregation helps manage complexity

Goal Use available data to estimate the temporal extent of labels in a way that optimizes the utility of the resulting estimates for interactive searching and browsing

Multi-Scale Segmentation Time

Characteristics of the Problem Clear sequential dependencies –Living in Dresden negates living in Berlin Heuristic basis for class models –Persons, based on type of relationship –Date/Time, based on part-whole relationship –Topics, based on a defined hierarchy Heuristic basis for guessing without training –Text similarity between labels and spoken words Heuristic basis for smoothing –Sub-sentence retrieval granularity is unlikely

Manually Assigned Onset Marks SubjectPersonLocation-Time Berlin-1939 Dresden-1939 EmploymentJosef Stein Gretchen Stein Anna Stein Relocation Transportation-rail Schooling Gunter Wendt Family Life Maria interview time

Some Additional Results Named entity recognition –F > 0.8 (on manual transcripts) Cross-language ranked retrieval (on news) –Czech/English similar to other language pairs

Looking Forward: 2003 Component development –ASR, segmentation, classification, retrieval Ranked retrieval test collection –1,000 hours of English recognition –25 judged topics in English and Czech Interactive retrieval –Integrating free text and thesaurus-based search

Relevance Categories Overall relevance Assessment is informed by the assessments for the individual reasons for relevance (categories of relevance), but the relationship is not straightforward Provides direct evidence Provides indirect / circumstantial evidence Provides context (e.g., causes for the phenomenon of interest) Provides comparison (similarity or contrast, same phenomenon in different environment, similar phenomenon) Provides pointer to source of information

Scale for overall relevance Strictly from the point of view of finding out about the topic, how useful is this segment for the requester? This judgment is made independently of whether another segment (or 25 other segments) give the same information. 4Makes an important contribution to the topic, right on target 3Makes an important contribution to the topic 2Should be looked at for an exhaustive treatment of the topic 1Should be looked at if the user wants to leave no stone unturned 0No need to look at this at all

Direct relevance Direct evidence for what the user asks for Directly on topic, direct aboutness. The information describes the events or circumstances asked for or otherwise speaks directly to what the user is looking for. First-hand accounts are preferred, e.g., the testimony contains a report on the interviewee's own experience, or an eye-witness account on what happened, or self-report on how a survivor felt. Second-hand accounts (hearsay) are acceptable, such as a report on what an eyewitness told the interviewee or a report on how somebody else felt. * Direct Evidence *- Evidence that stands on its own to prove an alleged fact, such as testimony of a witness who says she saw a defendant pointing a gun at a victim during a robbery. Direct proof of a fact, such as testimony by a witness about what that witness personally saw or heard or did. ('Lectric Law Library's Lexicon)

Indirect relevance Provides indirect evidence on the topic, indirect aboutness (data from which one could infer, with some probability, something about the topic, what in law is known as circumstantial evidence) Such evidence often deals with events or circumstances that could not have happened or would not normally have happened unless the event or circumstance of interest (to be proven) has happened. It may also deal with events or circumstances that precede the events or circumstances of interest, either enabling them (establishing their possibility) or establishing their impossibility. This category takes precedence over context. One could say that provides indirect evidence also provides context (but the reverse is not true). * Circumstances, Circumstantial Evidence * Circumstantial evidence is best explained by saying what it is not - it is not direct evidence from a witness who saw or heard something. Circumstantial evidence is a fact that can be used to infer another fact.

Context Provides background / context for topic, sheds additional light on a topic, facilitates understanding that some piece of information is directly on topic. So this category covers a variety of things. Things that influence, set the stage, or provide the environment for what the user asks for. (To take the law analogy again any things in the history of a person who has committed a crime that might explain why he committed it). Includes support for or hindrance of an activity that is the topic of the query and activities or circumstances that immediately follow on the activity or circumstance of interest. In a way, this category is broader than indirect If a context element can serve as indirect evidence, indirect takes precedence.

Comparison Provides information on similar / parallel situations or on a contrasting situation for comparison The basic theme of what the user is interested in, but played out in a different place or time or type of situation. Comparable segments will be those segments that provide information either on similar/parallel topics, or on contrasting topics. This type of relevance relationship identifies items that can aid understanding of the larger framework, perhaps contributing to identification of query terms or revision of search strategies. An example would be a segment in which an interviewee describes activities like activities described in a topic description, but which occurred at a different place or time than the topic description

Pointer Provides pointers to a source of more information. This could be a person, group, another segment, etc Pointers will be segments that provide suggestions or explicit evidence of where to find more relevant information. An example of a pointer segment would be one in which an interviewee identifies another interviewee who had personal experiences directly associated with the topic. The value of these segments is in identifying other relevant segments, particularly but not limited to segments about a topic.

Quality Assurance 20 topics were redone, 10 were reviewed. Redo: A second assessor did a topic from scratch Review: A second assessor reviewed the first assessors work and did additional searches when needed. Assessors would then get together and discuss their interpretation of the topic and resolved differences in relevance judgments. Assessors kept notes on the process.

Looking Forward: 2006 Working systems in five languages –Real users searching real data Rich experience beyond broadcast news –Frameworks, components, systems Affordable application-tuned systems –Oral history, lectures, speeches, meetings, …

For More Information The MALACH project – NSF/EU Spoken Word Access Group – Speech-based retrieval –