UNT Libraries TRAIL Processing Mark Phillips April 26, 2016
There are currently two processes for digitizing technical reports with TRAIL
The bulk of content goes to the University of Michigan for scanning by Google and inclusion in Hathitrust
There are some formats that are not sent to UMich
Reports with foldouts
Reports with Maps
Reports with other random parts
Microforms
Microfiche
Microcard
Maps
These items are scanned at the UNT Libraries in the Digital Projects Unit
The workflow
UNT receives boxes of new items from Arizona for scanning.
These arrive at the DPU and are processed by Lee Fulton and his students.
All reports come to UNT with a unique identifier assigned to them. metadc303203
We remove the binding for the items that have been donated for destructive scanning
Items loaned that can't be cut are set aside in a different workflow
Lee and his students scan all of the pages of an item and all foldouts and oddly shaped pages
600 DPI bitonal 400 DPI grayscale 400 DPI color
All uncompressed TIFF files
They align the pagination with the sequence of files
0001.tif = Front Cover 0002.tif = Front Inside Cover 0003.tif = blank 0004.tif = blank 0005.tif = title page 0006.tif = blank 0007.tif = Page 1 0008.tif = Page 2
This is done so you can “jump to page 4, not image sequence 4”
000100fc.tif = Front Cover 000200fi.tif = Front Inside Cover 00030000.tif = blank 00040000.tif = blank 000500tp.tif = title page 00060000.tif = blank 00070001.tif = Page 1 00080002.tif = Page 2
We use a local naming convention called “magicknumbers” for this.
Which also helps with the QC of the items.
Each report is verified to have all of the pages scanned.
A descriptive metadata record using the UNTL metadata schema is created partly from the supplied MARC record from Arizona and augmented with additional information.
PrimeOCR by Prime Recognition Each tiff image is processed with Optical Character Recognition (OCR) software. PrimeOCR by Prime Recognition
The RAW output is used so we have the coordinates of the words for highlighting later
A searchable PDF is created for each page along with the OCR output.
These pdfs are combined into a single PDF document for the report.
A finished report looks like this on the disc.
metadc303234 01_tif 000100fc.tif 000200fi.tif 00030000.tif 00040000.tif ... 02_pdf metadc303234.pdf metadata.xml
Reports are ingested into the UNT Libraries Digital Infrastructure in batches
Web scale images, pdf, and metadata are added to the UNT Digital Library into the TRAIL collection
Master files are added to the Coda Repository for preservation.
Once online physical reports are inventoried and discarded once verified to be online.
Loaned reports are returned to Arizona or the loaning insitution.
UNT Digital Library
2015 TRAC Self-Audit
In 2015 the UNT Libraries completed a self-audit using the Trusted Repository Audit & Certification (TRAC) framework
Full documentation for the self-audit is available via the UNT Libraries Website Includes Policies related to preservation, access, user feedback and usage Content and partnership agreements Detailed workflows and technical documentation
https://www. library. unt https://www.library.unt.edu/digital-libraries/trusted-digital-repository
UNT Libraries continues to value the partnership we have with TRAIL and look forward to opportunities to expand our work to provide access to these resources for users around the world.
Thank you. Mark Phillips http://digital.library.unt.edu/