Crowdsourced Manuscript Transcription Ben Brumfield Roots and Routes 2012
Not just crowdsourcing... Collaborative work Off-site solo work Private work
Not just manuscripts... Maps Textiles Music Flawed OCR
Not just transcription... Indexing Editing Identification Counting seals on Arctic ice caps.
What it isn't We'll concentrate on web-based tools for extracting text from images, not addressing: Oral History Video Audio Transcription Image Manipulation Transcription/Facsimile Display Tools exist for these tasks, nevertheless.
Break What materials are you working with outside of modern, printed books and websites?
Origins (Approaches) Two Approaches and one Dead End Indexing Editing Tagging
Indexing Structured Data Extracts from Text vs. Representing Text Databases for Search and Analysis Granular Quality Control Gamification
Editing Books, Diaries, Letters, Articles Representing Text Traditional Editorial Workflow Digital or Print Editions
Tagging Too small Too imprecise
Origins (Traditions) OCR Correction Documentary Editing Genealogy Natural Science Astronomy Split this into 5 slides
Online Tools Recent (none older than 2005) Influenced by origin Still pretty raw Most require tech expertise for set-up and customization All require making trade-offs
Lab Session 1: Breadth NYPL What's on the Menu Indexing Wikisource Editing
Selection Factors Source Material Transcript Purpose Organizational/Project Management Fit Financial and Technical Resources
Source Material Evaluating your source material: Is it of interest to anyone else? Is it under copyright? Does it need restricted access? Is it composed of documents or records? Is it non-textual? How complex is the layout? How important is that layout?
Purpose How will you be using the transcribed data? Traditional print editions Searchable online editions Do you want to use the system to analyze the text? How do you want to analyze the text? Is public engagement a goal? Should the transcripts be open?
Organizational/Project Management Fit How important is traditional editorial workflow? Will you rely on volunteers? How will you motivate them? What is the duration of the project? Is there a "final version"? Is TEI a mandate?
Financial and Technical Resources Do you have or need: System administrators to install non-hosted software? Money to pay hosting costs? Programming skills to customize a tool? Money to pay programmers for customization? Support for on-going costs to keep the site running, however small?
Lab Session 2: Markup Options FromThePage TranscribeBentham
Technical Questions to Answer Where are the images now? How do images get into the system? How do transcripts get out of the system? How mature is the underlying technology? How configurable is the technology? How does the system work with the public face of your project? Where does the metadata live? Who will maintain this? How long? How many sites are using this system?
Wikisource Pro: Mediawiki plus its add-on modules (e.g. print-on-demand, export). Wikimedia community. Incredibly mature. Con: Wikimedia policy. Public editing. Limited mark-up.
Bentham Transcription Desk Pro: MediaWiki is very mature. TEI Toolbar (can also be used on other systems) Deployed outside original project. Con: Development efforts halted.
Scripto Pro: Team at CHNM has a great track record. Your CMS is your public face. MediaWiki is very mature. Deployed and under active development. Con: Your CMS handles all metadata. Mark-up is extremely limited.
FromThePage Pro: Designed for intensive editing and indexing. Semantic mark-up and analysis. Hosting available. Con: Single developer (me). No TEI mark-up.
Islandora TEI Editor Caveat: I don't know much about this tool or this team. Based on Drupal and Fedora Supports TEI via friendly interface Many Drupal-based projects considering it.
T-PEN Caveat: I don't know much about this tool. Designed for medieval manuscripts. Supports TEI natively. Line-by-line interface. Hosted version available.
Scribe Pro: Excellent for complex layout or non- documentary transcription. Zooniverse team is large, well-funded, experienced. Configurable. Con: No automated tool for loading images or viewing transcript database (yet!) No concept of image-as-a-text.
Pybossa Caveat: I don't know much about this tool or this team. Open Knowledge Foundation's crowdsourcing task management tool. Designed for tabular data. Google Spreadsheet data entry. Extremely young.
TextLab Caveat: I don't know much about this tool or this team. Melville Electronic Library. Direct addition of TEI tags to image.
Lab Session 3: Configuration Scribe Old Weather, What's the Score, Development deployments
Find me Ben Brumfield