Denise Troll Covey Associate Dean, University Libraries, Carnegie Mellon Pennsylvania Library Association Conference Pittsburgh, PA – October 5, 2003 Understanding & Assessing The Million Book Project
Million Book Project Vision “Attempt to understand & solve the technical, economic, & social policy issues of providing online access to all creative works of the human race.” – Dr. Raj Reddy
What is the Million Book Project? Effort to digitize & provide full-text searching & free-to-read access to a million books by 2007 Collection development Copyright permissions Acquisitions & shipping Scanning operations Proposal writing
Why is the Million Book Project? Democratize knowledge & empower citizenry Address disparity in library size & accessibility Facilitate new knowledge Combining old & new, east & west, technical & humanistic Enhance student learning & success of faculty research Address copyright absurdities
Why is the Million Book Project? Support digital library research Information distribution, management, & sustainability Security, copyright, & digital rights management Accuracy of optical character recognition (OCR) OCR of non-Romanic languages & scripts Automatic creation of structural metadata Automatic summarization Intelligent indexing Machine translation Storage formats Search engines
Who is involved? Carnegie Mellon University Libraries & School of Computer Science Other U.S. libraries Internet Archive India & China OCLC, DLF, & CRL Archival Resource Company Inc.
Funding Collection development NSF – $35,000 for initial planning meeting 2001 Funded by partners Copyright permission UC Merced – $35,000 Carnegie Mellon – ? Project administration Carnegie Mellon – ? Equipment & travel NSF – $3.6 million (discounts from Minolta) Labor for scanning India – $1.5 million China – ? Acquisitions & shipping Internet Archive – ? NSF – pilot shipment
Collection Development Strategies Librarians as selectors Best books – cited in bibliographies Technical reports & government documents Priorities of participants & funding agencies Topics to get full collections What can we acquire Bulk, cheap, fast Libraries weeding, closing, renovating
Initial Collection of Collections In Copyright Indigenous Indian & Chinese materials Public Domain 100, , ,000 Multi-lingual & multi-script language processing November 2001 planning meeting funded by NSF Books for College Libraries & other selected bibliographies
Current Collection of Collections In Copyright Indigenous Indian & Chinese materials Public Domain 500, , ,000 Multi-lingual & multi-script language processing Books for College Libraries & other selected bibliographies New copyright permission strategy
Scanning Underway in India Multiple centers – each with terabyte storage Indigenous materials Shipments from U.S. 100,000 books by 2004 Above average wages
6000 Book Pilot Shipment to India 20 ft ocean container – 25 days NY to Chennai 243 boxes – 9 palettes – 11,298 lbs – duty free Mostly public domain – government documents, social science, biography, history, & literature Approximate cost $2 per book round trip 4000 books did not have to be returned 2000 books were returned in good condition August 2002 to August 2003
Lessons Learned from Pilot Reduce shipping cost per book Change packing method = cost $1 per book or 50 cents per book shipped one way Reduce turn-around time Learned procedures for clearing customs Establish 5 international centers for U.S. shipments Funded by Indian government To be inaugurated January 2004 Receive books, scan, check quality, return
Distribution in India Central Distribution Site Deemed University, Thanjavur Pilot BangaloreHyderabad Current AllahabadCalcuttaDelhi
Scanning Underway in China Customs & content issues prohibit shipping books to China Scanning indigenous materials & U.S. copyrighted works already in their libraries (with permission granted) Above average wages
Standards & Workflow National standards for digital preservation Developed by IMLS 2001 & endorsed by DLF National standards for cataloging Carnegie Mellon University Libraries Developed & documented workflow Provided training
Digitization Workflow Operators scan, post-process, & OCR 600 DPI TIFF (v5) images ScanFix post-processing Abby Fine Reader OCR 98% accuracy with English Some foreign languages OCR being developed for other languages & scripts
Optimum Scanner Throughput 4000 books per year per Minolta scanner One scanner, two shifts daily = 16 books per day 250 work days per year = 4000 books per year 72,000 books per year with current 18 scanners 400,000 books per year with 100 scanners Allowing 50% deterioration in throughput, 100 scanners can complete the project in 5 years
Metadata Workflow Librarians capture Bibliographic metadata - for delivery system MARC from OCLC or create Dublin Core Guest IDs provided by OCLC Administrative metadata - for reporting system Bibliographic metadata Source library Return requested Copyright status – check renewal records Permission status – used by delivery system Collaborative development by India & Carnegie Mellon
Copyright Workflow #1 India Carnegie Mellon
Copyright Workflow #2 India Carnegie Mellon Internet Archive & Archival Resource Company, Inc.
Contextual Searching Legal to scan & create index without permission When no permission granted to display copyrighted book, search returns query terms in OCR context
Acquiring the Collection Archival Resources Company Inc. Packing, shipping, & tracking Help locate & acquire books Weeded collections Closing libraries Acquisitions web site Materials wanted Loaning & donating Insurance
Integrating the Collection Transporting files to Carnegie Mellon Inadequate Internet bandwidth Expense of copying to & from gold CD or DVD Physically transport files on disks
Sustaining the Collection Goal is 10 organizations host the Collection India – Digital Library of India multiple locations China – site(s) not yet known U.S. – Carnegie Mellon, Internet Archive, & University of California Merced Discussions with OCLC, Library of Congress, & Digital Library of Alexandria Estimated cost one million $$ per host site Estimated size is 20 terabytes
Next Steps Ship 12,500 books one-way to Hyderabad, India from University of $6000 Negotiations UMI/Proquest – print-on-demand service OCLC – digital registry & identifying source libraries CRL – supply or help acquire books November 2003 – collection meeting NSF proposal to create database of copyright renewal records
Print on Demand Service UMI/ProQuest Handle financial transactions Print, bind, & send books to customers Collaboration with Carnegie Mellon Negotiate royalties with publishers Develop suitable business model
Digital Rights Management – Lite Free-to-read by any Internet user Difficult to save or print books One page at a time using browser Secure servers restrict access Discourage hacking by offering affordable printed, bound books
Global Business Model Hardback book = $30.00 Paperback book = $15.00 Digitalback book = cost of cup of coffee Internet Archive POD = $1.00 paperback India POD = $0.80 paperback; $2.00 hardback
Open Access Feasibility Study Couldn’t locate publisher for 11% books If located publisher, half didn’t respond Even to second letter If got response Fewer than half gave permission Often permission was restricted 22% permission granted – statistically valid random sample
Success Rate Scholarly associations 45% University presses 37% Museums & galleries 31% Commercial publishers 12% Permission by Publisher Type 22% overall
Open Access Fine & Rare Books 367 titles in copyright (34% of collection) Couldn’t locate copyright holder for 13% of titles 127 letters & 44 follow-up calls to date 56% titles permission granted 6% with restrictions 3% titles permission denied Assumed if 3 contacts get no response
Transaction Costs $ 6,550FTE labor $ 225Phone calls $ 65Paper & postage $ 6,840TOTAL May 2003 through August 2003 Does not include legal fees, cost of Internet connectivity or administrator time $37.00 per title
Copyright Negotiations Educate Find online, but use print Online access increases use Open access doesn’t decrease, & can increase sales Copyright absurdity Ask Non-exclusive permission to scan & provide open access Minimal system functionality Give Preservation-quality copies Metadata & OCR Motivate $$ Use in added-value, fee-based services $$ Print on demand for out-of-print titles $$ Buy button for in-print titles Million Book Project
Initial Copyright Approach Do not pay permission cost Focus on out-of-print, in-copyright titles Books for College Libraries has 50,000 titles Begin with scholarly associations & university presses Transaction cost per title is prohibitive Identifying & inserting titles in letters Negotiating & tracking permission per title
Epiphany & New Approach Focus on publishers of quality books Treat bibliographies as approval plan of publishers Books for College Libraries has 5600 publishers Ask for permission to digitize All out-of-print, in-copyright titles All titles published prior to a date of their choosing All titles published # or more years ago List of titles they provide Follow-up phone call or visit
Current Statistics 5600 publishers in Books for College Libraries Using intermittent labor Couldn’t locate 30 publishers (so far) 184 letters & 24 follow-up calls to date 4% permission granted 5% permission denied Full-time staff October 2003
Results of New Approach Estimate transaction costs remain the same But acquire more books for $$ spent National Academy Press – 99% increase 26 titles in Books for College Libraries Permission for 3,046 titles Brookings Institution – 96% increase Rand McNally – 60% increase
“More Bang for the Buck” Indigenous Materials Public Domain In Copyright Initial Current
Projections Success rate (# BCL publishers) # of books per publisher Million Book Collection 4% (224) ,000 6% (336) ,000 22% (1,232)15001,848,000 We could need to negotiate with India for more labor
Transaction log analysis – beginning 2004 Number of searches, browses, pages displayed Use per title online & print-on-demand Use by different geographic demographics Use per time of day, day of week, month of the year Usage Assessments Outcomes assessment – User demographics – age, gender, location How users found the Collection Why they used it & what they did with it What difference the Collection made Their assessment of the quality of the Collection & the usability & functionality of the system Their view of the significance of the project
Copyright Assessments – 2006 Number of copyrighted books in the collection Success rate of permission requests Survey of participating publishers Overall satisfaction Quality of the copies What they did or plan to do with the copies Impact on revenue & view of open access
Dissemination Million Book Collection Books accessible via Google search Libraries can link to collection from web site Libraries can link books to catalog records Publisher database Successful negotiation strategies Research test bed
Thank you!