A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski IGeLU 2008, September 9, 2008.

Slides:



Advertisements
Similar presentations
Preservation of the Texas Agricultural Experiment Station Bulletin in the Digital Repository By Dr. Rob McGeachin Texas A&M University Libraries June,
Advertisements

Home-Grown Digital Library System Built Upon Open Source XML Technologies and Metadata Standards David Lacy Villanova University
Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010.
HATHI TRUST A Shared Digital Repository Building A Future By Preserving Our Past The Preservation Infrastructure of HathiTrust Digital Library Jeremy York.
IRRA DSpace April 2006 Claire Knowles University of Edinburgh.
How Bad is Good Enough? Mass Digitization of Photographic Archives James Eason The Bancroft Library University of California at Berkeley.
E-Content Service Group Virtual Meeting Digital Preservation: How to Get Started.
1 Metadata Tools for JISC Digitisation Projects of still images and text Ed Fay BOPCRIS, Hartley Library University of Southampton.
©2011 Quest Software, Inc. All rights reserved.. Andrei Polevoi, Tatiana Golubovich Program Management Group ActiveRoles Add-on Manager Overview.
September 1 st 2010 Igelu Ghent The on-the-fly conversion circus Matthias Gross (Bavarian State Library)
MS Exchange and MS SharePoint Connectors Version
MacKenzie Smith Associate Director for Technology MIT Libraries.
Left marginRight margin Bottom margin To duplicate this slide click on: Insert (top menu) Duplicate Slide Insert image here. Align to the top left of this.
Traditional Core & Advanced Capture Techniques. Agenda The Capture Process What’s New in Capture Workflow? Core and optional capture features Imports.
The UM Libraries’ Frost Concert Archive Documenting the Performance History of the University of Miami Frost School of Music Amy Strickland University.
Enterprise Integration Solutions SharePoint Imaging.
Implementing Rosetta while integrating with a DAM.
Extending Primo beyond your ILS data source : including EAD and Graphic Sources Janet Lute ILS Coordinator Princeton University Library IGeLU 2014Oxford,
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
BCAD Architecture 2009 British Cartoon Archive. Projects A project to digitise and catalogue the Carl Giles Archive to current international standards.
Depositing e-material to The National Library of Sweden.
1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,
A New Learning Tools. Topic Maps is a standard for the representation and interchange of knowledge, with an emphasis on the findability of information.
Resource Discovery Module DigiTool Version 3.0. Resource Discovery 2 Deposit Approval Search & Index Dispatcher & Viewers Single & Bulk Web Services DigiTool.
Ingest and Loading DigiTool Version 3.0. Ingest and Loading 2 Ingest Agenda Ingest Overview and Introduction Ingest activity steps Transformers Task Chains.
Supporting Customized Archival Practices Using the Producer-Archive Workflow Network (PAWN) Mike Smorul, Mike McGann, Joseph JaJa.
Introducing Symposia : “ The digital repository that thinks like a librarian”
Developing PANDORA Mark Corbould Director, IT Business Systems.
Advanced Workgroup System. RED Advanced Workgroup Systems: Scan Features Copy Print Scan DNSG Software Our Customers Documents Our Customers Documents.
OCLC Online Computer Library Center CONTENTdm 4.3 Claire Cocco Global Product Manager CONTENTdm October 3, 2007.
Harvard’s Digital Repository Service (DRS) Architecture Harvard University Library (HUL) Andrea Goethals, Randy Stern December 10, 2009.
Putting it all together for Digital Assets Jon Morley Beck Locey.
ViciDocs for BPO Companies Creating Info repositories from documents.
Pro Imager A complete image and workflow management solution for the professional lab.
Computer Science : Information Systems Design and Development Unit Web Sites - National 4 / 5 St Andrew’s High School-Revised January 2013 Slide 1 St Andrew’s.
WorkPlace Pro Utilities.
Adventures in Digital Asset Management: Fedora at the National Library of Wales Glen Robson National Library of Wales
Wrangling DigiTool Data For LOCKSS Brian Meuse - Digital Collections Systems Analyst University Libraries Boston College MetaArchive Cooperative Annual.
Web-based workflow software to support book digitization and dissemination The Mounting Books project books.northwestern.edu Open Repositories 2009 Meeting,
Tech Talk Introducing… Z O T E R O Lilly Ramin Virtual Reference Coordinator Research and Instructional Services (RIS) University.
What’s New in VRS? GUGM May 15, 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University
Archiving and Presenting Journals with Rosetta Matthias Groß, Bavarian State Library, Munich, Germany 10th IGeLU Conference, Budapest, September 2 nd 2015.
NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine.
From Creation to Dissemination A Case Study in the Library of Congress’s use Open Source Software DLF Spring Forum Corey Keith
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
The DigiTool to FDA Program Lydia Motyka Florida Center for Library Automation.
Metadata Normalisation in Europeana The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing.
Web Indexing and Searching By Florin Zidaru. Outline Web Indexing and Searching Overview Swish-e: overview and features Swish-e: set-up Swish-e: demo.
Let VRS Work for You! ELUNA Conference 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University
Choosing Delivery Software for a Digital Library Jody DeRidder Digital Library Center University of Tennessee.
DRS 2 Orientation Harvard University Library September 30, 2010 DRS = Digital Repository Service.
ISpheresImage iSpheresImage Feature Overview and Progress Summary.
METS Case Study: The NYU Digital Library Team METS Opening Day 27 October, 2003 Leslie Myrick.
1 By: Suman Negi, Technical Officer ‘B’ DESIDOC, DRDO, Delhi Presentation at NACLIN 14 (During 9-11 December 2014, Pondicherry) Design and Development.
NMNH EMu DAMS Integration Project Rebecca Snyder Smithsonian, NMNH.
HATHI TRUST A Shared Digital Repository Use of PREMIS for Internet Archive AIPs September 22, 2010.
ARROW Institutional Repositories for Managing e-Theses Presentation to ETD September 2005 Geoff Payne, ARROW Project Manager.
A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008.
1 Resource Management: Resource Management Fundamentals.
1 IX. Managing Funds & Payments Invoicing Workflows.
The library is open Digital Assets Management & Institutional Repository Russian-IUG November 2015 Tomsk, Russia Nabil Saadallah Manager Business.
Implementing PREMIS in DigiTool Michael Kaplan ALA 2007 Update.
Where are my files? Discoveries in establishing a digital archive workflow Sally McDonald Archivist/Librarian Western History/Genealogy, Denver Public.
Integrating and Extending Workflow 8 AA301 Carl Sykes Ed Heaney.
Here are some things you can do while you wait 1.Open your omeka.net site in your browser (e.g. 2.Open.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Post-ALA Annual July 11, 2008 Pre-Conference Workshop: The Care and Feeding of Compound Objects Geri Ingram OCLC Digital Collection Services Manager, User.
Building Search Systems for Digital Library Collections
Presentation transcript:

A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski IGeLU 2008, September 9, 2008

Talking Points: Scope / Background Why? Major hurdles Manual / automated workflows Outcomes What can we share? –Results –Methodologies –Tools, etc. IGeLU Conference 2008, September 9, 2008

Alfred P. Sloan Foundation Getty Research Institute Archaeology and antiquities Boston Public Library John Adams collection Johns Hopkins Anti-slavery materials The Metropolitan Museum of Art Museum Publications Bancroft Library Gold Rush and westward expansion IGeLU Conference 2008, September 9, 2008

Scope of the Digitization Project 2,000,000 pages or approx. 5,000 books Self-evident collection Public domain pre-1923 for works published in U.S. pre-1909 for works published outside of U.S. IGeLU Conference 2008, September 9, 2008

Internet Archive Scribe Station 1 Pod = 10 Scribe Stations IGeLU Conference 2008, September 9, 2008

Why Do it? Internet Archive issues –Response/search time –Metadata only searching –No control Full-text searching Use in metasearch More control! IGeLU Conference 2008, September 9, 2008

Major Hurdles Getting the files Disk space issues – for general storage and for DTL What/how to process all the files Abbyy OCR vs. ALTO OCR Thumbnail generation Handle configuration/synchronization IGeLU Conference 2008, September 9, 2008

List of OCRd books received from Internet Archive Processed by GRI URLs from Internet Archive Link to URLs Download files from Internet Archive Zipped or tar files: *_orig_jp2 *_jp2 *_raw_jp2 high & low resolution PDFs *abbyy.gz *meta.xml *marc.xml Process downloaded files Ready for Digitool Ingest IGeLU Conference 2008, September 9, 2008

Disk Space Issues Each digitized book = 500MB to 1.5 GB of raw files Further untarring and processing consume even more disk! DTL scratch/processing space, permanent storage space, and Oracle tablespace – including full text indexing consumes even more disk space 3000 books in the queue will require TB for this project alone. IGeLU Conference 2008, September 9, 2008

DTL ingest package = –Archive = raw jpeg2000 (renamed to *.j2k) –View = use copy jpeg2000 (*.jp2) –Index = ALTO files –Thumbnail = appropriate thumb of title page for display of the complex object –PDF = high res PDF as additional manifestation –MARCXML record for IE level metadata No TIF files from IA – everything is jpeg2000 Mapping file same for every ingest CSV file is produced automatically IGeLU Conference 2008, September 9, 2008

Abbyy to ALTO IA scanning produces one huge OCR file in Abbyy proprietary XML Discussions with / proposal from CCS Real need to open source approach –Abbyy XSD can morph in future –Desire to share Contract with Ex Libris to produce tool –Java based –Includes jar and class files –Free to share and redistribute Tool transforms single ABBYY file to ALTO-file-per-page XML files IGeLU Conference 2008, September 9, 2008

Thumbnail Creation Initial ingest flow created complex object thumbnail from first page of PDF manifestation Boring! Ghostscript/PDF/ImageMagick problems Decision to go semi-manual with script/cgi that: –creates thumbnails for first 15 jpeg2000 page images –sends URL in for each separate ingest –creates web page for page image viewing and thumbnail selection –adds chosen thumbnail to staging directory, cleans up, and sends confirmation IGeLU Conference 2008, September 9, 2008

Handle Generation Setup per DTL docs Firewall tweaks Ingest flow tweaks –Handle for IE –Handles for all archive jpeg2000 images DTL errors with mass publication of Handles –Fixed in SP21 IGeLU Conference 2008, September 9, 2008

Ingest Summary Get/process/stage files Generate ALTO OCR files Web CGI for thumbnail selection Load.sh script moves all files to locations DTL expects Activate saved Ingest Flow from DTL Web Ingest client Wait IGeLU Conference 2008, September 9, 2008

Outstanding Issues Ingest speed –Remedied somewhat in SP21 –Digitized books are just darn big! –Low number of ingests per day Handles –Manual publishing process –Need to populate Voyager bib record METS viewer performance issues IGeLU Conference 2008, September 9, 2008

Success Factors !! Code to share –Get/process/staging scripts –Abbyy/ALTO transform code –Web cgi thumbnail code –YMMV Handles provide true persistent IDs – Full-text multilingual searching –MetaLib QuickSet for metasearch of all local repositories IGeLU Conference 2008, September 9, 2008

Demo and Thanks IGeLU Conference 2008, September 9, 2008