DocLing 2016 David Nathan ELAR and Digital Archiving for Documentation of Endangered Languages.

Slides:



Advertisements
Similar presentations
Introducing the ELAR information system architecture
Advertisements

Current design issues for digital archives Robert Munro (presented by David Nathan) Endangered Languages Archive (ELAR), School of Oriental and African.
1 of 18 Information Dissemination New Digital Opportunities IMARK Investing in Information for Development Information Dissemination New Digital Opportunities.
Doi> DOI – new applications panel IDF Annual Members meeting Bologna 2005.
WDL Technical Architecture Working Group (TAWG) June 2010 Achievements and Recommendations Co-chaired by Noha Adly, Bibliotheca Alexandrina Babak Hamidzadeh,
IRCS Workshop on Open Language Archives IMDI & Endangered Languages Archives Heidi Johnson / AILLA.
LSA Archiving Tutorial January 2005 Archives, linguists, and language speakers.
ESDS Qualidata Libby Bishop, ESDS Qualidata Economic and Social Data Service UK Data Archive ESDS Awareness Day Friday 5 December 2003Royal Statistical.
28 March 2003e-MapScholar: content management system The e-MapScholar Content Management System (CMS) David Medyckyj-Scott Project Director.
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
The UM Libraries’ Frost Concert Archive Documenting the Performance History of the University of Miami Frost School of Music Amy Strickland University.
An Leabharlann UCD Órna Roche UCD James Joyce Library Metadata Documenting your data
Documenting the Resource Malcolm Polfreman
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
Rethinking language documentation & support for the 21st century David Nathan Endangered Languages Archive SOAS University of London.
Agenda Overview 2.What is SharePoint? 3.NCDOT Websites 4.Roles 5.Search 6.SharePoint Interface.
DB Primary Welcome to Our World of Learning. 1.What is a learning platform?What is a learning platform? 2.What is DB Primary?What is DB Primary? 3.The.
1 David Nathan ELDP Training Workshop March 2010 Archiving.
Current Trends in Language Documentation and the Hans Rausing Endangered Languages Project Lenore A. Grenoble Dartmouth College Lenore A. Grenoble Linguistics.
July 11, 2003E-MELD 2003 E-MELD “School” of Best Practice Helen Aristar-Dry & Gayathri Sriram The LINGUIST List Eastern Michigan University.
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
Resource Discovery (metadata and searching) Working Group Report.
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
Service Charters [Nome del progetto] [Nome del relatore]
DE&T (QuickVic) Reporting Software Overview Term
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
June 20, 2006E-MELD 2006, MSU1 Toward Implementation of Best Practice: Anthony Aristar, Wayne State University Other E-MELD Outcomes.
WORKFLOWS AND OTHER CONSIDERATIONS FOR DIGITIZATION  Steve Bingo  Processing Archivist Washington State University Libraries  Alex Merrill  Assistant.
DIGIT Directorate-General for Informatics DIGIT Directorate-General for Informatics EUSURVEY Creating online surveys DIGIT EUSURVEY SUPPORT.
Content Strategy.
The Archive of the Indigenous Languages of Latin America Goals and Visions.
Outcome Based Evaluation for Digital Library Projects and Services
1 Archiving LingDy 16 Feb 2012 TUFS, Tokyo David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London.
AILLA:The Archive of the Indigenous Languages of Latin America Heidi Johnson / The University of Texas at Austin.
Data Management David Nathan & Peter Austin & Robert Munro.
Tech Terminology for non-technical people Tim Bornholtz 2006 Annual Conference.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
CHAPTER TEN AUTHORING.
IBIS-Admin New Mexico’s Web-based, Public Health Indicator, Content Management System.
1 David Nathan Endangered Languages Archive SOAS University of London LingDy Feb 15, 2013 ELAR and Digital Archiving for Documentation of Endangered Languages.
Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)
1 Language Documentation in West Africa July Winneba, Ghana David Nathan & Sophie Salffner Endangered Languages Archive Hans Rausing Endangered.
AILLA:The Archive of the Indigenous Languages of Latin America Heidi Johnson The University of Texas at Austin Latin American Digital Library Initiative,
Journalism & Media Studies Graduate Student Culminating Work : Steps for Submitting to the Campus Digital Archive at USFSP November 21, 2011 by Carol Hixson.
Virtual Platform for Adult Learning Hindi Portal In Brief PRIA, DVV and ASPBAE.
Training by the Office of Library and Information Services Contact for more information: karen.gardner- or
Introduction to metadata
1 LingDy February 13, 2012 TUFS, Tokyo David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London Data.
Alternative Architecture for Information in Digital Libraries Onno W. Purbo
1 LingDy February 14, 2012 TUFS, Tokyo David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London Data.
Automated Assessment Management System. The Assessment Cycle Trainee | Learner Dashboard Trainer Dashboard Employer Dashboard Verifier Dashboard Assessor.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
AILLA:The Archive of the Indigenous Languages of Latin America Heidi Johnson / The University of Texas at Austin.
CSC USI Class Meeting 10 November 9, 2010.
Eurostat 4. SDMX: Main objects for data exchange 1 Raynald Palmieri Eurostat Unit B5: “Central data and metadata services” SDMX Basics course, October.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
The ELAR Metadata Set David Evans, ELAR 3 November 2006.
□ archiving in context □ principles & processes □ examples DocLing 2016 David Nathan Archiving.
HETUS Pilot Group 8 Privacy procedures and ethical issues Kimberly Fisher, Centre for Time Use Research – co-ordinator External consultant Kai Ludwigs.
1 February 2012 ILCAA, TUFS, Tokyo program David Nathan and Peter Austin Hans Rausing Endangered Languages Project SOAS, University of London Language.
Unit 13 – Website Development FEATURES OF WEBSITES.
Learning Aim A.  Websites are constructed on many different features.  It can be useful to think about these when designing your own websites.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Heidi Johnson The University of Texas at Austin
W. Christopher Lenhardt
ESS.VIP VALIDATION An ESS.VIP project for mutual benefits
Introducing the ELAR information system architecture
Presentation transcript:

DocLing 2016 David Nathan ELAR and Digital Archiving for Documentation of Endangered Languages

2

3

4 What is different about ELAR?  implementation of protocol-driven access management  first archive to use social-networking principles  a platform for building and supporting relationships between data providers and data users  “level playing field” between researchers and community members/others  encourage, recognise and cater for diversity

5 OAIS model  OAIS archives define three types of ‘packages’ ingestion, archive, dissemination: ArchiveDissemination afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds IngestionProducersDesignated communities

6 ELAR - architecture  reduced boundaries between depositors, users and archive:  users add, update content; negotiate access Archive afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds & UsersProducers request give access contribute edit

7 Redefining the digital EL archive  a platform for developing and conducting relationships between knowledge producers and knowledge users – a social networking archive  level the playing field between researchers and community members/other stakeholders  encourage, recognise and cater for diversity

8 Data management and archiving  use good data management practices whether or not you plan to archive materials  document decisions, steps, conventions, structures, encodings  appropriate and conventional data encoding methods (e.g. Unicode)  be explicit and consistent  plan for flowing data, working with others, across different systems (cf Bird and Simons, ‘Seven Dimensions of Portability’)  good data management practices will make a future archiving process easier and better

9 Users and potential users  depositors – deposit, access or update materials  speakers and their descendants  other researchers - comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc  other “stakeholders”, eg educationalists, funders  journalists and the wider public

10 ELAR facts and figures  archived collections: ~200  online (published) collections: 150  average collection size about 80 GB  online data bundles: ~25,000  online bundles access: unrestricted 10,000, restricted 15,000  total number of files held: around 200,000  total volume of files held: around 10 TB  registered users: ~800  annual number of website "hits": 230,000

11 ELAR facts and figures – users  increasing number of community members, including Aleut (Canada), Tai-Ahom, Wadar (India), Burushaski (Pakistan), Serrano, Cahuilla, Arapaho (USA), Iraqi Jewish (Iraq), Saami (Finland), Wabena (Tanzania), Torwali (Pakistan), Hani, Bai (China), Irish  comments: “I found your site while looking up my grandmother, and i found her on your site speaking our language. and i would love for my children her great grandchildren to hear our language coming from her".  many interdisciplinary researchers, particularly archivists and anthropologists

12 Our task  … to preserve and disseminate documentation of endangered languages

13 Why is this important?  over 50% of the world’s 7000 languages:  are endangered  likely to cease to be spoken this century  little or nothing known about the majority of them  language documentations and the archives that support, preserve, and disseminate them, will become the means of transmission of many languages

14 A perfect storm? documentation methods expose sensitivities & vulnerabilities documentation performed by and for linguists and “others” “big data” – resources channeled to analysis, broader audiences “open data” – push for unmoderated access

15 Protocol  the sensitivities and access restrictions associated with EL resources  need to be discussed, collected and recorded in the field

16 Protocol and access control  principles:  granularity – file, bundle or collection  access is a relation between object and user  protocol values can be changed over time  ELAR’s URCS system  User  Researcher  Community member  Subscriber

17 ELAR’s protocol values  U – resource available to all registered users  R – resource available to users registered as researchers  C – resource available to users endorsed as members of relevant language community  S – resource available to users who have been given individual access rights for that resource

18

19

20

21 User xx has just applied for access to restricted material in the deposit solega The following message was attached to the application: "Hi [depositor], Please delegate me for access to the material on Solegas." Subscription application: formal

22 This is to inform you that user xx's application for access to restricted material in the deposit musgrave2007tulehu has just been approved. The depositor included the following note to the user: "The researcher is known to me personally and I know that his interest is legitimate." Subscription response: formal

23 User xx has just applied for access to restricted material in the deposit budd2008beirebo. The following message was attached to the application: "I'm xx. I like to learn Bislama language, but never heard what it sounds like. Am very curious " Subscription application: “curious”

24 User xx has just applied for access to restricted material in the deposit verstraete2010paman. The following message was attached to the application: "I am currently doing my masters in Linguistics and I'm researching on an endangered language in Malaysia. I would like to see a sample of the data from the fieldwork since I'm not use to this yet. I hope that I can gain more understanding in carrying out the fieldwork." Subscription application: establish credentials and reason

25 This is to inform you that user xx's application for access to restricted material in the deposit verstraete2010paman has just been rejected. The depositor included the following note: "Dear xx, I am sorry we cannot give you access to this deposit. The Lamalama community has asked us to restrict access to community members. With best wishes, [depositor]" Subscription response: rejected, with reason

26 This is to inform you that user xx’s application for access to restricted material in the deposit caballero2009raramuri has just been approved. The depositor included the following note to the user: "Please let me know if you're looking for any specific materials or if you have any questions." Subscription response: offering further help

27 This is to inform you that user xx's application for access to restricted material in the deposit kunbarlang-389 has just been approved. The depositor included the following note to the user: "Hi xx I've approved your access to this collection, but you should know that there is an update in the material I've just deposited, with much more information on both music and texts. I'd be happy to give you access to that when it is processed. Next time I come to London (October or November this year) I'd be happy to meet up if you would like to discuss." Response: further info and offer to meet

28 What can you archive (at ELAR)?  media - audio, video  graphics - images, scans  texts - fieldnotes, grammars, description, analysis  structured data - aligned and annotated transcriptions, databases, lexica  metadata, metadocumentation - contextual information about the materials, both structured and unstructured

29 Archive objects  an “object” could be a file, a set of files, a directory, or a set of files with their relationships explicitly defined  like other archives, ELAR uses a set principle, we call “bundles” (like DoBeS’ sessions) See bundles at ELAR

30 Archive objects ELAR Collection Bundle File

31  resource(s) for an endangered language  it could be just one file  catalogue / metadata  deposit form viewview  existing deposits can also be updated, added to, and metadata added/modified What is required to make a deposit?

32 Archive material should be selected  example: Depositor’s question: How much video can I archive?  answer:...

33 How can I deliver data?  hard disks  we return them  we also send them out  flash cards and USB sticks   good for samples for evaluation  OK for most text materials  Dropbox etc  a web upload facility may be provided one day  we can download from your server

34 What about CDs and DVDs?  we have found CDs, and especially DVDs, to be very unreliable  DVD fail rate > 10%  cause confusion as files are allocated to fit on disks, not according to corpus structure  create a lot of work for depositors and for ELAR

35 Express yourself - Metadata  metadata is  data about data containers  data about data  its functions for identification, management, retrieval of data provides the context and understanding of that data  carries those understandings into the future, and to others

36 Express yourself - Metadata  metadata reflects the knowledge and practices of data providers  … and therefore defines and constrains audiences and usages for the data  all value-adding to recordings of events (annotations transcriptions, translations, glosses, comments, interpretations, part of speech tagging etc) can be considered metadata  data and metadata lie on a spectrum and depend on how they are used rather than being absolutely different things

37 Express yourself - Metadata  distinguish between  metadata scheme (eg set of categories) and  the way that scheme is expressed

IDaudiotranscription 1TRS00065.wavbjt_02.txt 2TRS00066.wavkrs_43.txt TRS00065.wav bjt_02.txt TRS00066.wav krs_43.txt tagged relational filename: sessions.xls filename: sessions.xmlsessions.xml

39 Express yourself - Metadata  example  you could choose categories from OLAC, IMDI etc schemes or formulate your own  this would be a scheme of logical categories (speaker, location, date etc)  you could express these in different language(s)  you could structure the categories and values in different ways, eg as spreadsheet, database, XML

40 Express yourself - Metadata  you need to choose  a set of metadata categories applying across whole collection +  metadata categories that apply to particular types of objects (eg transcriptions, video), or to individual objects +  ways of expressing and encoding all that metadata

41

42

43 Example  Ju|’hoan (Biesele) Ju|’hoan (Biesele)

44 Potential sources of metadata  deposit form  spreadsheets  MS Word tables, CSV etc  IMDI and OLAC XML files  custom XML  notes, correspondence and reports  filenames  direct input to ELAR interface  audio files  images (/captions)  meta-metadata files

45 A survey  we collected information from about 50 ELAR deposits

About 80% of most frequently occurring categories can be mapped to OLAC 20language Subject.language 17dateDate 17descriptionDescription 16idIdentifier 16speakerContributor 16titleTitle 15formatFormat 13typeType 12creatorCreator 12file nameIdentifier 12notes 11rightsRights 10durationCoverage 9contentDescription 9contributorContributor 9nameContributor 9relationRelation 8age 8comment 8genreType.linguistic 8subject.language Subject.language 7date recordedDate 7document 1 7gender 7placeCoverage 6directoryIdentifier 5locationCoverage 5rec_dateDate 5recorderContributor termOLAC

47 Depositors also add categories such as:  detailed locations  metadata in Spanish  indigenous genres and titles (eg of songs)  parents’ and spouse’s mother tongues, birthplaces  number of children, their language competence  L2, L3 and competencies  languages heard  clan/moiety  occupation  education level

48 … more metadata:  date left home country  photos (/captions) of consultants, field sessions etc  equipment  microphone  workflow status  naming and organisational codes and principles  recorder/linguist experience level  biography and project description (“meta- documentation”)

49 What is the distribution?

50 Term frequencyNumber of terms

51

52 A visualisation

53

54

55

56 Discussion and conclusions  for endangered language documentation, the metadata framework is to be discovered, not predefined (cf Jeff Wallman, TBRC)

57 MD and resource discovery  “discovery” is not neutral:  what is emphasized/distilled?  who gains?  who does the work?  MD is also about the distribution of labor and resources

58 MD and users  MD is more responsible for the form, presentation, and usage of documentation than generally acknowledged  MD should be equally accessible to and relevant for community members – it may even be more relevant to them than any “linguistic” data

59  OLAC: Open Language Archives Community:  IMDI: ISLE Metadata Initiative more categories, software specific  ELAR: for endangered language documentation, metadata framework is to be discovered, not predefined Common metadata standards Title Identifier Creator Contributor Language Subject.language Date Description Format Type Rights Coverage Relation

60 Types of metadata  people metadata – creator’s / participants’ details  descriptive metadata – content of data  administrative metadata – eg. who did what when, relationships between objects, IPR and permissions  structural metadata – how collection and its objects are organised, associated, formatted  preservation metadata – character encoding, file format  access and usage protocols

61 Examples  example - XLS example  example - XML example  example – key example  example – key XML example  example – summary and requests example  example - notes example

62 Meta-documentation  Nathan (2010): “think of metadata as meta- documentation, the documentation of your data itself, and the conditions (linguistic, social, physical, technical, historical, biographical) under which it was produced. Such meta-documentation should be as rich and appropriate as the documentary materials themselves.”

63 Meta-documentation  identity of stakeholders involved, and their roles  attitudes of language consultants, towards their languages and towards the documenter and documentation project  relationships with consultants and community (Good 2010 mentions what he called ‘the 4 Cs’: ‘contact, consent, compensation, culture’);  goals and methodology of researcher, including research methods and tools, corpus theorisation (Woodbury 2011), theoretical assumptions behind annotation, potential for revitalisation

64 Meta-documentation  project and researcher biography: knowledge and experience of the researcher and consultants (eg. researcher’s knowledge at beginning of project, what training researcher and consultants received)  for funded projects: grant application, reports, communications  agreements entered into – formal or informal (eg. Memorandum of Understanding, compensation arrangements), and promises made to stakeholders  relationships between this and other projects

65 Formats/encoding  format choices at these levels:  representation of information  representation of characters  how characters are assembled into files (file formats)

66 Characters  use UTF-8 (aka Unicode ISO 10646)  be aware of using characters outside ASCII (common US keyboard characters) – these can break if UTF-8 is not used  distinguish character encoding and fonts (a font is simply a set of images for a “character set”)  something may be coded perfectly in UTF-8 but there is no suitable font applied  some fonts may display special characters correctly but this does not mean that encoding is correct

67 File formats  audio  WAV  (what if original is not WAV??)  resolution: 16 bit, 44.1KHz, stereo or better  video  changing frequently  MPEG4 or MTS/H264/AVCH  aspect, resolution: depends on project  get advice from achive before depositing

68 File formats  images  TIFF **OR** original from device  resolution: archive quality is 300dpi or better

69 File formats  text  best is plain text  PDF/A often acceptable, may pose problem  if MS-Word or ODF, check with archive  structured data (spreadsheets, databases  original format should be supplied  provide a preservable derivative as well (eg csv, PDF)  common linguistic software (ELAN, Transcriber, Toolbox, Praat etc)  their file formats are generally preservable

70 Can I still use MS Word?  ELAR no longer accepts MS Word files  but Word is still useful  quicker to type up  useful tables, functions, macros etc  solutions  think “text only”  tables as spreadsheets (are they bad too?)  (advanced) complex materials formatted as styles, then export as marked up  PDF/A – but not a perfect solution

71 My cells have multiple values!  example: keywords  this is probably OK, as keywords are atomic  just consistently use a suitable delimiter  e.g. use comma - if data values cannot have commas  ELAR recommends double pipe “||”

72 My cells have multiple values!  example: speakers in a recording  speakers are probably not atomic – they have other attributes  create a separate “speakers” sheet  give each speaker an ID (number or initials)  use the IDs in the original sheet, with delimiter (implements one to many)  (advanced) or make another sheet to associate recordings with speakers (implements many to many)

73 Standards  we have already mentioned some standards – UTF-8, WAV etc  there are other relevant standards, eg  ISO (language/dialect names)  metadata systems  you can also establish project-local standards, eg  to handle special characters (eg \e = schwa)  data field names  document them! – for your usage and for correspondence to wider standards

74

75

76 THANK YOU! David