Massively Digitizing UC Library Collections Google, Microsoft, and More Learning in Retirement Libraries – The Intersection of Tradition and Innovation April 10, 2008 Ivy Anderson & Heather Christenson
California Digital Library Two Complementary Roles Facilitate library collaboration across the ten campuses of the UC system (e.g. shared collection development) Distinctive services emphasizing digital stewardship, innovation in scholarly publishing, and open-access digital collections Three Audiences UC libraries Broader UC community External constituencies and the general public Five Programs Collection Development and Management (Licensed Content, Shared Print Collections, Mass Digitization) Bibliographic Services (Melvyl Catalog, SFX) Preservation (Digital Preservation Repository, Web Archiving) Digital Special Collections (Calisphere, Online Archive of California) Publishing Services (eScholarship Repository, eScholarship Editions, collaboration with UC Press) “11 th University Library” founded 1997 Part of UC Office of the President
Digitization of Library Collections Special Collections Manuscripts, archival collections, photographs, etc. CDL / UC Libraries Online Archive of California Calisphere Berkeley, University of California, Bancroft Library, UCB 150, f. 252v
Digitization of Library Collections Specialized Texts and Corpora Making of America -10,000 texts in 10 years CDL eScholarship Editions
Digitization of Library Collections Commercial Partnerships EEBO: 100,000 important early English texts Licensed access via ProQuest Satans stratagems, copy from UCLA Library
…and Along Came Google Google Library Project 2005: The ‘Google Five:’ Harvard, Oxford, New York Public Library, Stanford, University of Michigan 2008: 20 library partners in 5 countries Google Publisher Partner Program
…and the Open Content Alliance October 2005 Founders: Internet Archive, University of California, U of Toronto… Large-scale digitization of out-of- copyright works only A project of the Internet Archive
…and Microsoft Out-of-Copyright Works Only
UC Involvement October 2005 August 2006 March 2007 Founding Member of Open Content Alliance UC Joins Google Library Project Microsoft Digitization Agreement
So: Three Projects, One Goal Goal: Mass digitization of library book collections Google In-copyright and out-of-copyright works Available via Google search engine and Google Book Search Microsoft Out-of-copyright works only Available via Microsoft Live Search Open Content Alliance Out-of-copyright works only Available (via the Internet Archive website) to any and all search engines Library and grant-funded
Why Are They Doing It? Google’s vision: To put all the world’s information online Google and Microsoft: To gain marketshare and competitive advantage for their search (and online advertising) services It’s all about Search OCA: To put the world’s information online, for free, forever It’s all about the public good
Why Are We Doing It? To enhance student and faculty research To put our collections where our users are – in Google! Mass digitization of these materials enhances access. It can make people aware of books they may not have discovered otherwise and lead them, through an internet search, back to our libraries To support deeper textual analysis and research. Scholars can trace the evolution of ideas and perform other sophisticated textual analysis when the full text is indexed and searchable by computer, opening scholarship in new ways. To fulfill our public service mission Many books of enduring general interest – including classic works of literature and more unique items such as early histories of the settlement of California and the West - can now be read by anyone, anywhere, anytime To preserve and protect our collections In earthquake and fire-prone California, digitizing books in our collections may also help protect the university from catastrophic loss should disaster someday strike our libraries
Microsoft/OCA Project Contributors Northern Regional Library Facility (NRLF) Southern Regional Library Facility (SRLF) UC Berkeley, Bancroft Library UCLA
Google Project Contributors Northern Regional Library Facility (NRLF) + UC Berkeley Systems UC Santa Cruz UC San Diego
CDL’s role, on behalf of UC Liaison with partners Planning & coordination Funding Stewardship of digital content New services
Campuses Provide the Books
The Book Digitization Process A world of barcodes, logistics, loading docks, packing materials, and scanning machines!
Reasons books might get rejected (images)
Costs to the UC Libraries Staffing (2-5 FTE at each of 5 locations) Physical space & facilities Scanning centers (where scanning machines are housed), book processing, queue storage (book trucks) Costs to run campus systems CDL servers for inventory database, digital preservation
Digital files Images OCR - Text OCR - Page coordinates Metadata
What sort of books are being digitized? American history Humanities Science Cookbooks Children’s books East Asian & Pacific Rim collections
Where can you access the books? Google Book Search: Microsoft Live Search Books: e=books Internet Archive: alifornia_libraries Test version of UC Union catalog:
Copyright status is a factor Out of copyright, pre-1923 “orphan works,” present
At the frontier…
What’s ahead Digital preservation –storage, storage, storage Copyright determination Print on demand
New modes of access & critical mass of digital books will transform scholarship Full text search - new form of book discovery Beyond search – text mining, computationally assisted research Machines can interact with massive amounts of texts, and provide new structures
Questions? Heather Christenson, CDL Mass Digitization Project Manager Ivy Anderson, CDL Director of Collections For more information: sdig/