Download presentation
Presentation is loading. Please wait.
Published byNathan Hodges Modified over 9 years ago
1
1 Archiving LingDy 16 Feb 2012 TUFS, Tokyo David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London
2
2 What is an archive?
3
3
4
4 What is a digital language archive? a trusted repository created and maintained by an institution with a commitment to the long-term preservation of archived material has policies and processes for materials acquisition, cataloguing, preservation, dissemination, migration to new digital formats a platform for building and conducting relationships between data providers and data users
5
5 Why is language archiving different? what is a language? the data is not conventionalised (like $, age, year of publication etc) – what and how to code? varying and competing expectations
6
6 And endangered languages archiving? extremely diverse context – languages, cultures, communities, individuals, projects typical source - fieldworkers typical materials - documentation difficult for archive staff to manage sensitivities and restrictions extremely high priority
7
7 What can a language archive offer? Security - keep your electronic materials safe Preservation - store your materials for the long term Discovery - help others to find out about your materials, and you to find out about users Protocols - respect and implement sensitivities, restrictions Sharing - share results of your work, if appropriate Acknowledgement - create citable acknowledgement Mobilisation - create usable language materials Quality and standards - advice for assuring your materials are of the highest quality and robust standards
8
8 Different kinds of language archives different contexts, systems, methods, collection policies you should consider placing your materials in more than one …
9
9 Why digital? preservation: digitisation is the only way that audio and video (non-symbolic material) can be preserved for the future … because it can be copied and transmitted with zero loss cataloguing, sharing, dissemination, repurposing
10
10 Digital disadvantages digital data is fragile and ephemeral cost (human, equipment, maintenance) requires strategy and luck to get right preservation depends on file and data formats depend on tools and software depends on formats (prefer standard, open, explicit, long-lasting) materials may have to be converted and migrated some formats require particular software (can we archive the software?)
11
11 What is archiving of language materials? preparing materials selecting structuring suitable encodings and formats well-documented depositing them in a suitable archive(s) curation and accession by the archive ongoing management, dissemination new focus on form, presentation and user interaction/feedback
12
12 Users and potential users depositors – deposit, access or update materials speakers and their descendants (“majority of users of Berkeley Language Center archive are community members”) other researchers - comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc other “stakeholders”, eg educationalists journalists and the wider public
13
13 Archives networks and bodies foundation concepts and technologies from library initiatives, eg. D-LIB http://www.dlib.org/ OAI (Open Archives Initiative) OAIS Open Archival Information Systems (NASA and space agencies incl JAXA) Open Language Archives Community (OLAC) Digital Endangered Languages and Archives Network (DELAMAN) ELAR, DOBES, ANLC, Paradisec, EMELD, LACITO, AIATSIS, AMPM (Maori)
14
14 Citation examples from Heidi Johnson of AILLA Collection: Sherzer, Joel. "Kuna Collection." The Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Media: audio, text, image. Access: 0% restricted. www.ailla.utexas.org File/resource: Sherzer, Joel (Researcher). (1970). "Report of a curing specialist." Kuna Collection. Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Type: transcription&translation. Media: text. Access: public. Resource ID: CUK001R001.www.ailla.utexas.org
15
15 Endangered Languages ARchive (ELAR) one of 3 programs of the Hans Rausing Endangered Languages Project develop policies, preservation infrastructure, cataloguing and dissemination, facilities, training, advice, materials development and publishing
16
16 ELAR facts and figures archived collections: 110 online (published) collections: 50 average collection size about 60 GB online data bundles: 9523 total number of files held: around 200,000 total volume of files held: around 10 TB online data bundles unrestricted access: 5298 registered users: >500 annual downloads: >1,000 annual number of website "hits": 230,000
17
17 ELAR facts and figures – user accounts increasing number of community members, including Aleut (Canada), Tai-Ahom, Wadar (India), Burushaski (Pakistan), Serrano, Cahuilla, Arapaho (USA), Iraqi Jewish (Iraq), Saami (Finland), Wabena (Tanzania), Torwali (Pakistan), Hani, Bai (China), Irish comments: “I found your site while looking up my grandmother, and i found her on your site speaking our language. and i would love for my children her great grandchildren to hear our language coming from her". many interdisciplinary researchers, particularly archivists and anthropologists
18
18 Archiving and data management most data-related issues are really part of linguistic data/corpus management there are now few data-related issues that are archive-specific metadata formats video presentation/exhibition of material
19
19 What can you archive (at ELAR)? media - sound, video graphics - images, scans texts - fieldnotes, grammars, description, analysis structured data - aligned and annotated transcriptions, databases, lexica metadata - contextual information about the materials, structured and unstructured
20
20 Archive objects an “object” could be a file, a set of files, a directory, or a set of files with their relationships explicitly defined these are often called “sessions”or “bundles” they should be made explicit through metadata our future catalogue system will provide the ability for depositors to directly create, label and update bundles See bundles at ELAR
21
21 Archive material should be selected example: Depositor’s question: How much video can I archive? answer:...
22
22 resource(s) for an endangered language it could be just one file inventory / metadata deposit form viewview existing deposits can also be updated, added to, and metadata added/modified What is required to make a deposit?
23
23 How can I deliver data? hard disks we return them we send them out email good for samples for evaluation OK for most text materials Dropbox etc flash cards and USB sticks a web upload facility may be provided one day we download from your server
24
24 What about CDs and DVDs? we have found CDs, and especially DVDs, to be very unreliable DVD fail rate > 10% cause confusion as files are allocated to fit on disks, not according to corpus structure create a lot of work for depositors and for ELAR
25
25 Protocol the sensitivities and access restrictions associated with EL resources need to be discussed, collected and recorded in the field global protocol (the overall, typical value) is entered into the deposit form specific protocol (for files, bundles) is entered via metadata (or any other explicit way)
26
26 Protocol and access control principles: granularity – file, bundle or collection access is a relation between object and user protocol values can be changed over time ELAR’s URCS system User Researcher Community member Subscriber
27
27 “I have images” what kinds of images? what are their sources? what is their documentation value? what role do they play in the collection? … these should be reflected in the data structures/metadata
28
28 Metadata for images at least captions what else? … in what form? narrative tabular fields keywords
29
29 get a list of image files command (DOS) window in directory type “dir > list.txt” open text file (in Notepad++ or MS Word) change font to Courier get a “vertical selection” (or use a file listing utility!) paste into spreadsheet Integrating images into metadata
30
30 Integrating images into metadata make a new sheet for images paste in image file list (see previous) add an ID column type “1” in first cell select from first to last cell in ID column Edit>Fill>Series>OK add other columns now you can refer to your images anywhere!
31
31 Using spreadsheet to access data you can turn a filename into a link to access files directly from a spreadsheet have the filename in cells use the formula =HYPERLINK(file, “Message") examples =HYPERLINK("E:\archiving\images\"&A2, "click here") =HYPERLINK(A1&A2, "click here") =HYPERLINK(A1&A2, A2)
32
32 My cells have multiple values! example: keywords this is probably OK, as keywords are atomic just consistently use a suitable delimiter e.g. use comma - if data values cannot have commas ELAR recommends double pipe “||”
33
33 My cells have multiple values! example: speakers in a recording speakers are probably not atomic – they have other attributes create a separate “speakers” sheet give each speaker an ID (number or initials) use the IDs in the original sheet, with delimiter (implements one to many) (advanced) or make another sheet to associate recordings with speakers (implements many to many)
34
34 Expressing “Relation” in spreadsheets one column is usually insufficient “relationship” has 2-parts the target of the relationship description of the relationship how would this work for images?
35
35 How can I tell if it’s Unicode? use a browser or Notepad++ paste text in examine the encoding (before and after)
36
36 Can I still use MS Word? ELAR no longer accepts MS Word files but Word is still useful quicker to type up useful tables, functions, macros etc solutions think “text only” tables as spreadsheets (are they bad too?) (advanced) complex materials formatted as styles, then export as marked up PDF/A – but not a perfect solution
37
37 End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.