Download presentation
Presentation is loading. Please wait.
1
Large Digital Oral History Archives
Supporting Access to Large Digital Oral History Archives Dagobert Soergel College of Information Studies, University of Maryland
2
Part 1. The Shoah Visual History Foundation Digital Library
of Holocaust Survivor Testimonies The collection Cataloging Search Part 2. MALACH: Research on Access to Speech Data Part 3. The development of a MALACH test collection Or The concept of relevance in history
3
Context: Shoah Foundation tasks
collecting and preserving survivor and witness testimony of the Holocaust cataloging these testimonies to make them available disseminating the testimonies for educational purposes to fight intolerance enabling others to collect testimonies of other atrocities and historical events or perhaps do so itself
4
The collection Survivors of the Shoah Visual History Foundation Digital Archive Established 1994 by Steven Spielberg after filming Schindler’s List 52,000 Nazi Holocaust survivors, liberators, and witnesses from 57 countries 116,000 hours of speech in 32 languages (180 TB of digital video, 60 years of listening) In the process of being manually cataloged World’s largest and most complex single topic archive of digitized videotaped oral history
5
Sample spoken text Spoken Words
It wasn't everybody living in one in one one ghetto you know was a little like the in this street a was a house ghetto in this street it had ghetto but people couldn't people wasn't allowed to go out in the streets when they came in the Nazis came in he wanted they made a Jewish committee the Jewish committee have to help him take where to live and took out the furniture from from the from the Jewish people and so and Jewish committee had eighteen people with me also I helped the Jewish committee I mean the reason is they had eighteen people we walked the street everyday two two people two friends we walked on each street as people doesn't go out on the street we had a very very bad very very Nazi Thesaurus terms Conditions under German Occupation, Ghetto Procedures, Jewish Committee
6
VHF system architecture
7
Levels of cataloging Testimony-level cataloging:| Pre-Interview Questionnaire (PIQ) Passage- level / segment-level cataloging
8
Key sections of the PIQ Survivor information (name variants, vital dates, languages, education, occupations, military service, political identity, religious identity) Prewar life (prewar address, affiliations) War time (ghettos, camps, hiding, resistance, refugees, death marches) Postwar (displaced persons camps) Family background (parents, children, relatives, data about them)
9
Manual cataloging, original style
The cataloger identifies topically coherent segments (beginning and end to the second, avg. 3.5 min.) and assigns to each segment: persons mentioned a set of descriptors to identify one or more concepts or events, locations, and time periods a structured summary, 2-3 sentences objects such as maps, still images, and other segments (e.g., from documentary video) to provide context Cataloger prepares summary of whole testimony 15 hrs per hour of video, 4000 testimonies cataloged
10
Insert example here
11
Manual cataloging, supporting structures
VHF Thesaurus, 3,000 concepts, 30,000 place names. Isa, whole-part, and associative relationships A person database, which gives for each person mentioned in any PIQ or testimony all names and aliases; information about the pre-war, wartime, and postwar experiences of the person; and any other information that was provided about the person in either the PIQ or the testimony
12
Real-time cataloging Streamlined cataloging process
Cataloger assigns descriptors as she listens to the testimony. Time-aligned to 1 minute No segment boundaries No segment summaries No testimony summary 1.1 hours per hour of testimony, to be completed within 5 years
13
Search Two interacting levels
the whole-testimony-level supported by Pre-Interview Questionnaire (PIQ) data the within-testimony level supported by cataloging data, which enable both browsing within testimonies and retrieval access to specific places within testimonies
14
Solutions to high cost of
providing access Streamlined cataloging Automatic speech recognition and metadata creation through natural language processing using already cataloged testimonies as training data using the whole collection as a test bed
15
on Access to Speech Data
Part 1. The Shoah Visual History Foundation Digital Library of Holocaust Survivor Testimonies Part 2. A Research Agenda on Access to Speech Data User requirements for oral history data Access system architecture Automatic speech recognition Metadata creation from speech recognition data Retrieval algorithms The MALACH project
16
Multilingual Access to Large spoken ArCHives
MALACH: NSF grant, The objective of MALACH is to dramatically improve access to large multilingual spoken archives by capitalizing on the unique characteristics of the Survivors of the Shoah Visual History Foundation's multimedia digital archive of oral histories
17
MALACH Partners Survivors of the Shoah Visual History Foundation (VHF) Cataloging, user workshops, project management IBM T. Watson Res. Ctr Human Language Technologies Automatic Speech Recognition (ASR) in English (French), Natural Language Processing: Segmentation, Classification Johns Hopkins University, Center for Language and Speech processing ASR in Czech Charles University, Prague, University of West Bohemia ASR in Czech, Hungarian, Slovak, Russian University of Maryland User requirements, use, information retrieval interfaces, usability
18
Access system architecture
User requirements Thesaurus and linguistic databases Person, place, event databases Automatic Speech Recognition (ASR), segmentation, summarization, categorization, named entity recognition Transcription Manual cataloging Professional catalogers Users (collaborative) Metadata store User interface Search interface Examination intf. Information retrieval algorithms
21
Automatic speech recognition - challenges
spontaneous, emotional, disfluent speech whispered speech heavily accented speech speech from elders speech with background noise and frequent interruptions speech that switches between languages words, such as names, obscure locations, unknown events, etc., that are outside the recognizer lexicon
22
Automatic speech recognition - approaches
Methods for acoustic segmentation dividing the acoustic signal into segments by the categories of speech (emotional speech, different languages, etc., see above) in order to adapt the acoustic model and perhaps the language model Methods for rapidly adjusting the acoustic model to the speaker Methods for optimizing the language model for retrieval, for example, by giving higher weight to words that are important for searching and/or for automatic classification
23
Automatic speech recognition - resources
Manual transcriptions as training data English 200 hours Czech 50 hours Other languages planned Lexical resources (to be discussed later)
24
Approaches to metadata creation
Named entity recognition: Names and places (closely tied to ASR) Automatic segmentation Automatic assignment of subject descriptors to segments Automatic summarization of segments, possibly using a template-filling approach Automatic assignment of descriptors by accu-mulating evidence as the ASR transcription is read Automatic creation of time-aligned themes Automatic derivation of descriptors and a summary for testimonies as a whole
25
Retrieval Types of evidence used
Metadata available from the pre-interview questionnaires Whole testimony summaries Time-aligned phonemes and terms Time-aligned subject descriptors from the thesaurus Segment boundaries Segment summaries, time-aligned themes Will experiment with different combinations and weighting schemes Cross-language retrieval
26
Lexical resources Thesaurus: subject descriptors, place names
Person database with names and aliases Speech recognizer lexicon and language model Classifier lexicon, linking terms with descriptors Translation resources: dictionaries and probabilistic language-to-language mapping tables
27
User interface Focus on three issues:
Help users formulate their query, in particular finding good thesaurus descriptors Help users interact with testimonies in an interlinked network of information on people, places, and events Integrate search, reading testimonies, and the user’s own work
29
Some results
30
User requirements analysis methods
Discount requirements analysis Consult experts and literature on potential users and the nature of their work. Users as informants Talk to curators about intended use of collection Informed intuition Request analysis – insights from actual use 280 “Advance Access” requests Coded by discipline, end product, access points, pieces of information required, etc. intuition Observe users – cognitive processes, usability User workshops: Scholars, teachers
31
User requirements for oral history
A wide variety of users and uses Arts, humanities, and social sciences History Social sciences Literature and linguistics Publishing and journalism Material and non-material culture Education Science Psychology Law enforcement
32
User requirements for oral history
Access by person, place, and time Access by abstract concepts such as Jewish-Gentile relations reasons for post-war emigration to non-European countries psychological processing of Holocaust memories material suitable for fourth-graders. The VHF Thesaurus, built by subject experts, includes many such abstract concepts that support searches of this type
33
User requirements for oral history
For history and education: Importance of context More info on this person Interview mentions Person Place Time Event More info on event at time More info on this place More info on this time More info on related event More info on related policy More info on this event
34
Teacher Workshop July 2003 Eight teachers, grades 6-12
Goal: Develop an “edited reel” for use in the classroom and associated lesson plans and educational activities Themes (determined before the workshop, refined during the workshop) Character-defining moments Difference (e.g., can one person make a difference?) The power of language
35
Workshop process Two groups of teachers alternating between
group lesson planning sessions (2 hours) Individual or paired searching (2 hours) “Free write” Patio chats Plenary sessions, including screening of a very rough cut edited reel
36
Research questions How do lesson planning and searching the collection/ finding interesting material interact? How do teachers approach the searching task? What knowledge do they bring to it? What knowledge are they missing? What criteria do they use in judging relevance? What functionality does a retrieval system need to support material selection for curricular uses? 4 How well do the VHA system and the VHF thesaurus support searching for curricular materials?
37
Methods Qualitative study based on the following data
Data on searching Observation notes of selected searches (five observers) Think-aloud protocol, intermittent, but capturing interactions with intermediaries Worksheets on segments Interaction logs Data on lesson planning (groups and plenary) Observer notes Audio and video tapes Lesson planning worksheets Teacher’s notes and free-writes, group chat tapes ¬es Three structured interviews (15/15/45 min.) (tapes + interviewer notes) Five brief evaluation questionnaires
38
Searching – lesson planning interaction
39
Teachers' relevance criteria 1
(1) Relationship to theme (2) Specific topic less important (different from many researchers for whom specific topic, such as name of camp or physician as aid giver, is important) (3) Match of the demographic characteristics of the interviewee or of a people about whom the interview tells a story with the demographic characteristics of the intended audience; specifically: age of the interviewee at the time of the events recounted as related to the age of the students for whom a lesson is intended. Language of the segment (which is not always the predominant language of the testimony).
40
Teachers' relevance criteria 2
(5) Age-appropriateness of the material (6) Acceptability to the other stakeholders (parents, administrators, etc.) The extent to which the segment provides a useful basis for learning and internalizing vocabulary (8) The extent to which the segments can be used in several subjects, for example, English, history, and art. (9) Ease of comprehension, mainly clarity of enunciation (10) Expressive power, both body language and voice (11) Length of the segment, in relation to what it contributes (12) Does the segment communicate enough context? (This criterion and the previous one are specific to clips to be included in a reel. Such clips can be edited from several different pieces of a testimony, yet teachers sometimes appeared to base their assessment of these criteria on segments as they were created in cataloguing rather than imagining what could be made from a segment through editing, possibly augmented by short pieces from earlier in the testimony to provide context.)
41
Teachers' relevance criteria 3
(11) Length of the segment, in relation to what it contributes (12) Does the segment communicate enough context? (This criterion and the previous one are specific to segments to be included in a reel. Such clips can be edited from several different pieces of a testimony, yet teachers sometimes appeared to base their assessment of these criteria on segments as they were created in cataloguing rather than imagining what could be made from a segment through editing, possibly augmented by short pieces from earlier in the testimony to provide context.)
42
S n H
43
S n H
44
S n H
45
NSF grant IIS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.