Download presentation
Presentation is loading. Please wait.
Published byJune Sherman Modified over 9 years ago
1
Language Technology Research Serving eHumanities New Ways of Accessing the USC Shoah Foundation Archive in the Center for Visual History Malach Jan Hajič Institute of Formal and Applied Linguistics Computer Science School Charles University in Prague, Czech Republic malach@knih.mff.cuni.cz | http://www.malach-centrum.cz malach@knih.mff.cuni.czhttp://www.malach-centrum.cz
2
21.11.2012 J. Hajic: CVHM & Language Technology 2 From Testimonies to Flexible Access The USC VHI Archive Testimonies of Holocaust survivors Center for Visual History Malach Access Point to the USC Archive Activities of CVHM Access using New Technology Fulltext (transcript) Search Cross-lingual Access Thesaurus Translation Status and Future Plans
3
21.11.2012 J. Hajic: CVHM & Language Technology 3 Center for Visual History Malach Access Point to the USC VHI’s Archive, http://www.usc.edu/vhihttp://www.usc.edu/vhi
4
21.11.2012 J. Hajic: CVHM & Language Technology 4 Contents of the Archive Testimonies recorded in the 1990s
5
21.11.2012 J. Hajic: CVHM & Language Technology 5 Recording the Testimonies Visual History Foundation California, USA (Universal Studios) 1990s Analog video recording technology, 30 minute tapes Teams of 3 people (moderator, video, audio) Volume 56 countries, over 105,000 hours of video 32 languages, ~52,000 testimonies Half of them in English
6
21.11.2012 J. Hajic: CVHM & Language Technology 6 Archiving the Testimonies Digitization 100s of terabytes of data (NTSC/PAL quality) Catalogization (indexing) Thesaurus (55000 keywords) hierarchical, timeline, places Goal: Access (search) Material for projects
7
21.11.2012 J. Hajic: CVHM & Language Technology 7 Access (Search) Search by keywords At 1-minute segments, beginning of topic Search by particular people, relations Filter search by Language spoken Country of survivor Experience (survivor/liberator/...) Not possible: “fulltext” search Video access: locally available, or on order Player: usual controls, also by segment, search within video
8
21.11.2012 J. Hajic: CVHM & Language Technology 8 Access Points Internet: only limited access so far Throughput (technical limitations), legal & ethical issues,... → Access Points ~30 worldwide (USA; EU: Berlin, Budapest, Prague, Warsaw; secondary access available) 2 - 20% of full archive locally Fast “Internet2” connection Additional Services Search and view: standard Internet browser
9
21.11.2012 J. Hajic: CVHM & Language Technology 9 Center for Visual History Malach Charles University in Prague, est. 2009, coordinator: Jakub Mlynář
10
21.11.2012 J. Hajic: CVHM & Language Technology 10 Center for Visual History Malach Supported by Charles University Faculty of Mathematics and Physics, CS School CS School Library & Institute of Formal and Applied Linguistics Part of LINDAT-Clarin, Language Data Infrastructure Clarin ERIC – Pan-European network of LTH Centers 12 workplaces, AV technology, materials Technology (by Inst. of Formal and Applied Linguistics): 1 Gbit network locally, dataserver (for video cache) 2000 testimonies locally (all Czech, Slovak, Polish, many in English) Geant connection, 5-10 min. for 30 min. video from USC
11
21.11.2012 J. Hajic: CVHM & Language Technology 11 Center for Visual History Malach: Activities Seminars Anniversary seminar (January) Seminars for students, teachers Also: foreign visitors (Ukraine – summer 2012) Workshops Co-organization of Raoul Wallenberg 100 th Anniversary workshop, Nov. 2012 w/Czech Parliament, Jewish Museum in Prague, Embassies Tutorials Using the Archive, How-To-...; Research on Language Technology (with Institute of Formal and Applied Linguistics)
12
21.11.2012 J. Hajic: CVHM & Language Technology 12 Center for Visual History Malach: Activities Newsletter Web:
13
21.11.2012 J. Hajic: CVHM & Language Technology 13 Center for Visual History Malach Visitors Students Teachers, Researchers Journalists, Writers, Filmmakers Other (personal reasons, etc.) mid-2010 – fall 2012
14
21.11.2012 J. Hajic: CVHM & Language Technology 14 Why “Malach”? Technology and UI Research Project 2002-2007 Multilingual Access to Large Audio arCHives malach – “angel” in Hebrew Support: NSF (National Science Foundation) Visual History Foundation (predecessor of SFI/USC) IBM Research, Yorktown Heights, NY, USA Johns Hopkins Univ., Baltimore, MD, USA Univ. of Maryland, College Park, MD, USA Charles University in Prague, CZ (IFAL MFF UK) Univ. of West Bohemia, Pilsen, CZ (Dept. of Cybernetics)
15
21.11.2012 J. Hajic: CVHM & Language Technology 15 Research in the Malach Project Research in the area of Automatic Speech Recognition (of the testimonies) English, Czech, Slovak, Russian, Polish, Hungarian Automatic Translation of Thesaurus Keyword translation Czech, English Cross-lingual Audio/Voice Search Part of the world-wide CLEF 2006, 2007 competition User interfaces → current VHA search interface
16
21.11.2012 J. Hajic: CVHM & Language Technology 16 Research in the Malach Project Research in the area of Automatic Speech Recognition (of the testimonies) English, Czech, Slovak, Russian, Polish, Hungarian Automatic Translation of Thesaurus Keyword translation Czech, English Cross-lingual Audio/Voice Search Part of the world-wide CLEF 2006, 2007 competition User interfaces → current VHA search interface
17
21.11.2012 J. Hajic: CVHM & Language Technology 17 Automatic Speech Recognition Core “Front-end” Technology Current State-of-the-Art: 95% in controlled conditions Problems: English: non-native speakers (virtually all 26,000!) Czech: colloquial speech All: emotions, elderly people, imperfect recording Technology issues: not enough in-domain texts Some improvement reached by 2007
18
21.11.2012 J. Hajic: CVHM & Language Technology 18 Automatic Speech Recognition Core “Front-end” Technology Current State-of-the-Art: 95% in controlled conditions Problems: English: non-native speakers (virtually all 26,000!) Czech: colloquial speech All: emotions, elderly people, imperfect recording Technology issues: not enough in-domain texts Some improvement reached by 2007
19
21.11.2012 J. Hajic: CVHM & Language Technology 19 The AMALACH Project Applied research project, 2012-2015 Implement and integrate (some) MALACH project results Czech National Cultural Heritage Funding Partners: Charles Univ., Univ. of West Bohemia (and USC) Selling point: improved access for local (Czech) researchers USC Archive: 558 Czech-language testimonies only a fraction (~ 12%) of 4613 Czech survivors! Rest: mostly English spoken Also: 12500 segments containing keyword “Czech” Solution: cross-lingual fulltext-like search Needs speech recognition, automatic translation, thesaurus
20
21.11.2012 J. Hajic: CVHM & Language Technology 20 Cross-lingual Search Scheme Mar. 7, 2012UFAL Intro20 Query in E Archive transcript & query translation Query in A Translation to E Monolingual Search The archive: all audio ASR in multiple lang. Transcr.Z Transcr.A C B... Seg. 1 in A Seg. 2 in A … Seg. N in A Seg. 1 in B Seg. 2 in B … USER QUERY PROCESSING [OFFLINE] Archive Transcript, E Translation to E
21
21.11.2012 J. Hajic: CVHM & Language Technology 21 Phonetic and Word Search (monolingual) Automatic Speech Recognition (Univ. of WB) Automatic Speech Recognition Transcript Database VHF04106-0047.18 VHF04167-0146.32 VHF05103-0192.98 ……………… Search System Word and Phonetic Lattice
22
21.11.2012 J. Hajic: CVHM & Language Technology 22 Machine Translation State-of-the-Art Cf. Google (currently best for most language pairs) Still imperfect (applications need varying levels of quality) Machine translation of speech transcripts Big challenge: VERY noisy input - Speech recognition errors Ungrammatical, non-native, emotional language Good news Used in search only (will probably never be shown to users)
23
21.11.2012 J. Hajic: CVHM & Language Technology 23 Statistical Machine Translation Technology The idea (1940s/1990s) - imagine this: Translation by the reverse process: “decoding” Probabilistic model of the translation process And probabilistic model of the target language Probabilities learned from (human) translations Czech textEnglish text “Coding”
24
21.11.2012 J. Hajic: CVHM & Language Technology 24 Speech and Language Technology in Search Mar. 7, 2012UFAL Intro24 Query in E Query in A Translation to E Monolingual Search The archive: all audio ASR in multiple lang. Transcr.Z Transcr.A C B... Seg. 1 in A Seg. 2 in A … Seg. N in A Seg. 1 in B Seg. 2 in B … USER QUERY PROCESSING [OFFLINE] Archive Transcript, E Translation to E
25
21.11.2012 J. Hajic: CVHM & Language Technology 25 Status and Future Plans Czech testimonies Monolingual Fulltext Search System operational in CVHM, users can use both VHA and the UWB UI English speech recognition of the testimonies Work has started: data preparation ongoing Translation to Czech Thesaurus: manually (high quality necessary) Will be used in the current interface as well Data: work ongoing, data preparation “Lattice” translation experiments underway Cross-lingual search: work starts in 2013
26
21.11.2012 J. Hajic: CVHM & Language Technology 26 Thank you! VHI http://www.usc.edu/vhi Institute of formal and applied linguistics http://ufal.mff.cuni.cz http://ufal.mff.cuni.cz Center for Visual History Malach http://malach-centrum.cz Dept. of Cybernetics, Univ. of West Bohemia, Pilsen, CZ http://www.kky.zcu.cz The project “Malach” http://malach.umiacs.umd.edu
27
21.11.2012 J. Hajic: CVHM & Language Technology 27 Closing Presented at Preserving Survivors’ Memories Digital Testimony Collections about Nazi Persecution History, Education and Media Wednesday, Nov 21, 2012 11:00 Section A http://www.preserving-survivors-memories.org
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.