Large Digital Oral History Archives

Slides:



Advertisements
Similar presentations
Common Core Standards (What this means in computer class)
Advertisements

DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Performance Tasks for English Language Arts
GCSE Crossover Coursework Pre1914 texts: Shakespeare and the Prose Study.
Language Technology Research Serving eHumanities New Ways of Accessing the USC Shoah Foundation Archive in the Center for Visual History Malach Jan Hajič.
[Title of meeting] [Name of sponsor] [Date] For guidance on working with PowerPoint and reformatting slides, click on Help, then Microsoft PowerPoint Help,
 Assessment Type 1: Text Analysis (35%)  three or four responses  at least one oral (maximum of 5 minutes), or multimodal form of equivalent length.
Inquiry Design Model: Session 2 Tasks
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
MALACH Multilingual Access to Large spoken ArCHives Survivors of the Shoah Visual History Foundation Human Language Technologies IBM T. J. Watson Research.
Foreign language and English as a Second Language: Getting to the Common Core of Communication. Are we there yet? Marisol Marcin
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
“Knowledge” Do Now: As a teacher, what does this statement make think about or feel: “He Who Can Does He Who cannot Teaches” George Bernard Shaw.
Cognitive Interviewing for Question Evaluation Kristen Miller, Ph.D. National Center for Health Statistics
CELEBRATIONS: HOLIDAY CELEBRATIONS AROUND THE WORLD Lesson Plan on PowerPoint Bethany Barnhart Texas Woman’s University.
Comprehensible Input SIOP Component #3.
Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne,
Real-Time Speech Recognition Subtitling in Education Respeaking 2009 Dr Mike Wald University of Southampton.
Thursday 9 th September 2010 Welcome to AS Language & Literature Success criteria: I understand the structure of the course. I know what will be expected.
ELA Common Core Shifts. Shift 1 Balancing Informational & Literary Text.
November 15, 2003CLIS Alumni Chapter Talking to the Future: The MALACH Project Douglas W. Oard Joanne Archer, Ammie Feijoo, Xiaoli Huang College of Information.
Producción de Sistemas de Información Agosto-Diciembre 2007 Sesión # 8.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Elaine Ménard & Margaret Smithglass School of Information Studies McGill University [Canada] July 5 th, 2011 Babel revisited: A taxonomy for ordinary images.
The Digital Library for Earth System Science: Contributing resources and collections GCCS Internship Orientation Holly Devaul 19 June 2003.
Crossing the Bridge Day 3 Heidi Elmoustakim & Eileen Hartwell Secondary Literacy Specialists.
September 16, 2004CLEF 2004 CLEF-2005 CL-SDR: Proposing an IR Test Collection for Spontaneous Conversational Speech Gareth Jones (Dublin City University,
Colby Smart, E-Learning Specialist Humboldt County Office of Education
1 Dr. Cord Pagenstecher Testimonies on Nazi Forced Labor and the Holocaust Building Digital Environments for Research and Education Dr. Cord Pagenstecher.
INTRODUCTION TO THE WIDA FRAMEWORK Presenter Affiliation Date.
Objectives of session By the end of today’s session you should be able to: Define and explain pragmatics and prosody Draw links between teaching strategies.
1 CLASS Lesson Planning System and Teachers’ Collaboratory Dagobert Soergel With Katy Lawley, Tandeep Sidhu, Ryen White, and David Doermann College of.
ELA - 3 Common Core Vs Kansas Standards. DOMAIN Standards For Literature (RL)
The Little Rock Nine An Examination of Perspectives: The Civil Rights Movement 8 th Grade Social Studies/Information Literacy Unit By Colleen Tierney Graduate.
Design Evaluation Overview Introduction Model for Interface Design Evaluation Types of Evaluation –Conceptual Design –Usability –Learning Outcome.
Researching Holocaust survivors in Greece through VHA PhDr. Kateřina Králová, Ph.D.
Perspectives on Information Course Introduction January 25, 2016.
CLARIN ERIC Franciska de Jong Oxford April 2016
Common Core.  Find your group assignment.  As a group, read over the descriptors for mastery of this standard. (The writing standards apply to more.
Using Victorian Curriculum to plan learning in Visual & Media Arts F - 6 Webinar, 23 November 2016.
Subject Analysis: An Introduction
4TH Grade ELA Standards.
Professional Development: Imagine Difference Shapes and Sizes
Best Practices in Implementing the 2010 ELA Standards
Reading with KS2 children
Digital Video Library - Jacky Ma.
AQA GCSE French and German
Evaluating and Interpreting Oral History
Why bother – is this not the English Department’s job?
The Curriculum of Writing (for writers)
Performance Tasks for English Language Arts
The Literacy Hub Introduction Literacy Toolkit
AICE AS English Language (9093)
Performance Tasks for English Language Arts
A Level English Language
Connecticut Core Standards for English Language Arts & Literacy
*Play in presentation mode to access all URLs.
THE GENERAL JONAS ŽEMAITIS MILITARY ACADEMY OF LITHUANIA
Making a Change.
Chapter Two: Review of the Literature
Section VI: Comprehension
Classroom Applications
Using the 7 Step Lesson Plan to Enhance Student Learning
FRBR and FRAD as Implemented in RDA
LITERATURE REVIEW by Moazzam Ali.
Chapter Two: Review of the Literature
Planning a cross- curricular topic
Welcome to ‘Planning for Media Arts activities for the classroom (F-6)
A Migration Museum Project scheme of work
EHRI Vocabularies and Linked Open Data : An enrichment?
Presentation transcript:

Large Digital Oral History Archives Supporting Access to Large Digital Oral History Archives Dagobert Soergel College of Information Studies, University of Maryland

Part 1. The Shoah Visual History Foundation Digital Library of Holocaust Survivor Testimonies The collection Cataloging Search Part 2. MALACH: Research on Access to Speech Data Part 3. The development of a MALACH test collection Or The concept of relevance in history

Context: Shoah Foundation tasks collecting and preserving survivor and witness testimony of the Holocaust cataloging these testimonies to make them available disseminating the testimonies for educational purposes to fight intolerance enabling others to collect testimonies of other atrocities and historical events or perhaps do so itself

The collection Survivors of the Shoah Visual History Foundation Digital Archive Established 1994 by Steven Spielberg after filming Schindler’s List 52,000 Nazi Holocaust survivors, liberators, and witnesses from 57 countries 116,000 hours of speech in 32 languages (180 TB of digital video, 60 years of listening) In the process of being manually cataloged World’s largest and most complex single topic archive of digitized videotaped oral history

Sample spoken text Spoken Words It wasn't everybody living in one in one one ghetto you know was a little like the in this street a was a house ghetto in this street it had ghetto but people couldn't people wasn't allowed to go out in the streets when they came in the Nazis came in he wanted they made a Jewish committee the Jewish committee have to help him take where to live and took out the furniture from from the from the Jewish people and so and Jewish committee had eighteen people with me also I helped the Jewish committee I mean the reason is they had eighteen people we walked the street everyday two two people two friends we walked on each street as people doesn't go out on the street we had a very very bad very very Nazi Thesaurus terms Conditions under German Occupation, Ghetto Procedures, Jewish Committee

VHF system architecture

Levels of cataloging Testimony-level cataloging:| Pre-Interview Questionnaire (PIQ) Passage- level / segment-level cataloging

Key sections of the PIQ Survivor information (name variants, vital dates, languages, education, occupations, military service, political identity, religious identity) Prewar life (prewar address, affiliations) War time (ghettos, camps, hiding, resistance, refugees, death marches) Postwar (displaced persons camps) Family background (parents, children, relatives, data about them)

Manual cataloging, original style The cataloger identifies topically coherent segments (beginning and end to the second, avg. 3.5 min.) and assigns to each segment: persons mentioned a set of descriptors to identify one or more concepts or events, locations, and time periods a structured summary, 2-3 sentences objects such as maps, still images, and other segments (e.g., from documentary video) to provide context Cataloger prepares summary of whole testimony 15 hrs per hour of video, 4000 testimonies cataloged

Insert example here

Manual cataloging, supporting structures VHF Thesaurus, 3,000 concepts, 30,000 place names. Isa, whole-part, and associative relationships A person database, which gives for each person mentioned in any PIQ or testimony all names and aliases; information about the pre-war, wartime, and postwar experiences of the person; and any other information that was provided about the person in either the PIQ or the testimony

Real-time cataloging Streamlined cataloging process Cataloger assigns descriptors as she listens to the testimony. Time-aligned to 1 minute No segment boundaries No segment summaries No testimony summary 1.1 hours per hour of testimony, to be completed within 5 years

Search Two interacting levels the whole-testimony-level supported by Pre-Interview Questionnaire (PIQ) data the within-testimony level supported by cataloging data, which enable both browsing within testimonies and retrieval access to specific places within testimonies

Solutions to high cost of providing access Streamlined cataloging Automatic speech recognition and metadata creation through natural language processing using already cataloged testimonies as training data using the whole collection as a test bed

on Access to Speech Data Part 1. The Shoah Visual History Foundation Digital Library of Holocaust Survivor Testimonies Part 2. A Research Agenda on Access to Speech Data User requirements for oral history data Access system architecture Automatic speech recognition Metadata creation from speech recognition data Retrieval algorithms The MALACH project

Multilingual Access to Large spoken ArCHives MALACH: NSF grant, 2002-2006 The objective of MALACH is to dramatically improve access to large multilingual spoken archives by capitalizing on the unique characteristics of the Survivors of the Shoah Visual History Foundation's multimedia digital archive of oral histories

MALACH Partners Survivors of the Shoah Visual History Foundation (VHF) Cataloging, user workshops, project management IBM T. Watson Res. Ctr Human Language Technologies Automatic Speech Recognition (ASR) in English (French), Natural Language Processing: Segmentation, Classification Johns Hopkins University, Center for Language and Speech processing ASR in Czech Charles University, Prague, University of West Bohemia ASR in Czech, Hungarian, Slovak, Russian University of Maryland User requirements, use, information retrieval interfaces, usability

Access system architecture User requirements Thesaurus and linguistic databases Person, place, event databases Automatic Speech Recognition (ASR), segmentation, summarization, categorization, named entity recognition Transcription Manual cataloging Professional catalogers Users (collaborative) Metadata store User interface Search interface Examination intf. Information retrieval algorithms

Automatic speech recognition - challenges spontaneous, emotional, disfluent speech whispered speech heavily accented speech speech from elders speech with background noise and frequent interruptions speech that switches between languages words, such as names, obscure locations, unknown events, etc., that are outside the recognizer lexicon

Automatic speech recognition - approaches Methods for acoustic segmentation  dividing the acoustic signal into segments by the categories of speech (emotional speech, different languages, etc., see above) in order to adapt the acoustic model and perhaps the language model Methods for rapidly adjusting the acoustic model to the speaker Methods for optimizing the language model for retrieval, for example, by giving higher weight to words that are important for searching and/or for automatic classification

Automatic speech recognition - resources Manual transcriptions as training data English 200 hours Czech 50 hours Other languages planned Lexical resources (to be discussed later)

Approaches to metadata creation Named entity recognition: Names and places (closely tied to ASR) Automatic segmentation Automatic assignment of subject descriptors to segments Automatic summarization of segments, possibly using a template-filling approach Automatic assignment of descriptors by accu-mulating evidence as the ASR transcription is read Automatic creation of time-aligned themes Automatic derivation of descriptors and a summary for testimonies as a whole

Retrieval Types of evidence used Metadata available from the pre-interview questionnaires Whole testimony summaries Time-aligned phonemes and terms Time-aligned subject descriptors from the thesaurus Segment boundaries Segment summaries, time-aligned themes Will experiment with different combinations and weighting schemes Cross-language retrieval

Lexical resources Thesaurus: subject descriptors, place names Person database with names and aliases Speech recognizer lexicon and language model Classifier lexicon, linking terms with descriptors Translation resources: dictionaries and probabilistic language-to-language mapping tables

User interface Focus on three issues: Help users formulate their query, in particular finding good thesaurus descriptors Help users interact with testimonies in an interlinked network of information on people, places, and events Integrate search, reading testimonies, and the user’s own work

Some results

User requirements analysis methods Discount requirements analysis Consult experts and literature on potential users and the nature of their work. Users as informants Talk to curators about intended use of collection Informed intuition Request analysis – insights from actual use 280 “Advance Access” requests Coded by discipline, end product, access points, pieces of information required, etc. intuition Observe users – cognitive processes, usability User workshops: Scholars, teachers

User requirements for oral history A wide variety of users and uses Arts, humanities, and social sciences History Social sciences Literature and linguistics Publishing and journalism Material and non-material culture Education Science Psychology Law enforcement

User requirements for oral history Access by person, place, and time Access by abstract concepts such as Jewish-Gentile relations reasons for post-war emigration to non-European countries psychological processing of Holocaust memories material suitable for fourth-graders. The VHF Thesaurus, built by subject experts, includes many such abstract concepts that support searches of this type

User requirements for oral history For history and education: Importance of context More info on this person Interview mentions Person Place Time Event More info on event at time More info on this place More info on this time More info on related event More info on related policy More info on this event

Teacher Workshop July 2003 Eight teachers, grades 6-12 Goal: Develop an “edited reel” for use in the classroom and associated lesson plans and educational activities Themes (determined before the workshop, refined during the workshop) Character-defining moments Difference (e.g., can one person make a difference?) The power of language

Workshop process Two groups of teachers alternating between group lesson planning sessions (2 hours) Individual or paired searching (2 hours) “Free write” Patio chats Plenary sessions, including screening of a very rough cut edited reel

Research questions How do lesson planning and searching the collection/ finding interesting material interact? How do teachers approach the searching task? What knowledge do they bring to it? What knowledge are they missing? What criteria do they use in judging relevance? What functionality does a retrieval system need to support material selection for curricular uses? 4 How well do the VHA system and the VHF thesaurus support searching for curricular materials?

Methods Qualitative study based on the following data Data on searching Observation notes of selected searches (five observers) Think-aloud protocol, intermittent, but capturing interactions with intermediaries Worksheets on segments Interaction logs Data on lesson planning (groups and plenary) Observer notes Audio and video tapes Lesson planning worksheets Teacher’s notes and free-writes, group chat tapes &notes Three structured interviews (15/15/45 min.) (tapes + interviewer notes) Five brief evaluation questionnaires

Searching – lesson planning interaction

Teachers' relevance criteria 1 (1) Relationship to theme (2) Specific topic less important (different from many researchers for whom specific topic, such as name of camp or physician as aid giver, is important) (3) Match of the demographic characteristics of the interviewee or of a people about whom the interview tells a story with the demographic characteristics of the intended audience; specifically: age of the interviewee at the time of the events recounted as related to the age of the students for whom a lesson is intended. Language of the segment (which is not always the predominant language of the testimony).

Teachers' relevance criteria 2 (5) Age-appropriateness of the material (6) Acceptability to the other stakeholders (parents, administrators, etc.) The extent to which the segment provides a useful basis for learning and internalizing vocabulary (8) The extent to which the segments can be used in several subjects, for example, English, history, and art. (9) Ease of comprehension, mainly clarity of enunciation (10) Expressive power, both body language and voice (11) Length of the segment, in relation to what it contributes (12) Does the segment communicate enough context? (This criterion and the previous one are specific to clips to be included in a reel. Such clips can be edited from several different pieces of a testimony, yet teachers sometimes appeared to base their assessment of these criteria on segments as they were created in cataloguing rather than imagining what could be made from a segment through editing, possibly augmented by short pieces from earlier in the testimony to provide context.)

Teachers' relevance criteria 3 (11) Length of the segment, in relation to what it contributes (12) Does the segment communicate enough context? (This criterion and the previous one are specific to segments to be included in a reel. Such clips can be edited from several different pieces of a testimony, yet teachers sometimes appeared to base their assessment of these criteria on segments as they were created in cataloguing rather than imagining what could be made from a segment through editing, possibly augmented by short pieces from earlier in the testimony to provide context.)

S n H

S n H

S n H

www.vhf.org www.clsp.jhu.edu/research/malach/ NSF grant IIS-0122466