Download presentation
Presentation is loading. Please wait.
Published byMyra Whitehead Modified over 9 years ago
1
Project Updates: Posner & Million Book Projects Denise Troll Covey Principal Librarian for Special Projects July 2004 – University Libraries Staff Meeting
2
Posner Project 2002 – 2004 Digitize the collection of Henry Posner Sr. –642 titles (1106 volumes) + 800 archival folders Acquire copyright permission –27% of volumes & small % of archival documents Create web site Provide unrestricted access Funded by Henry & Helen Posner Jr.
3
Preparation Evaluated & purchased Zeutschel scanner Interviewed & hired scanner operator Organized working group Designed workflow
4
Posner Working Group 2001–2002 Erika Linke Project Director 2002–2004 Denise Troll Covey Project Director –Linda Dujmic, Carole George, Bella Gerlich, Terry Hurlbert, Mary Kay Johnsen, Chris Kellen, Ann Marie Mesco, Joe Mesco, Gabrielle Michalek, Angel Morris, Jon Singletary, Brian Woolstrum
5
Scanning the Books 313,413 pages – all pages that can be scanned –600 DPI full color TIFF images –Most books scanned on Zeutschel –24 books scanned on loaner Indus –20 books not scanned –35 books partially scanned 200 gigabytes total storage –Scanned 40 gigabytes every 3–10 days
6
55 Books Not Scanned or Not Entirely Reasons –Bound too tight, pages uncut, poor condition, too large, too small, vellum Plans (20 surrogates located) –Locate & link to electronic versions at other sites –Digitize same title, different edition in our collection
7
Scanning the Archival documents 1,714 pages – all docs that are not confidential –Correspondence, dealer catalogs, pamphlets, advertisements, newspaper clippings, & newsletters Gloriana will verify that nothing confidential was scanned
8
Design, Implement & Maintain http://posner.library.cmu.edu Search & retrieval system (DIVA v2) Usage statistics User Interface design & functionality were informed by user studies
9
Search Currently search via metadata Full text search coming soon
10
Browse
11
Results Click to display book
12
Display – Full Text Available Click to display full text
13
Navigate
14
Display – Annotation Available Click to display annotation
15
User Interface Messages Blank pages Books not scanned in entirety No permission to display Alternative copy
16
Posner Collection Usage Books: 633 titles (1004 volumes) online Archival documents not yet online
17
Workflow Plan 2003 Scan page images (create TIFFs) Post–process (Wolfpack) –Fix – straighten, remove speckles, etc. –Convert to JPEG for display –Convert to text for full text search (OCR) Backup to tape Ingest into system for search & retrieval Update DOI server & add PURL to library catalog
18
Problems No OCR in initial workflow –Not specified in proposal –No OCR software for color images Scanned faster than we could post process –Huge backlogs of books Scanned to servers, waiting for post–processing On backup tape, waiting to restore & OCR Identified OCR as bottleneck Changed workflow to ingest books prior to OCR
19
Many–to–One, One–to–Many Problem Books –1 title many volumes –Many titles 1 volume Archival folders –1 folder many titles –Many folders 1 title
20
2004 Quality Control Discovered project personnel had different versions of the “complete” list of Posner books Created canonical spreadsheet Checking status of each book –Scanned & fixed page images –Converted to JPEG –Converted to text (OCR) –Added PURL to library catalog
21
Copyright Permission Goal: permission to digitize & provide open access –302 copyrighted volumes (27% of Collection) –Archival catalogs & newsletters
22
Copyright Permissions Workflow Identify copyrighted volumes (27%) Identify & locate copyright holders Send letters requesting non–exclusive permission –To digitize & provide open access –Option to restrict access to Carnegie Mellon only Follow up with phone call or email Negotiate to closure Update statistics & database
23
Problems Seeking Permission Identifying & locating copyright holders Publishers –Lost or misplaced request letter –Don’t know what they published –Don’t know what rights they have –Afraid of open access & lost revenue Learning copyright laws –Abandoned foreign works
24
29% 71% 24% 22% 54% 32% 17% 15% 36% Permission granted Permission denied No response Not located Posner Study Success Rates Received permission for 48% of books 2% of publishers applied access restrictions
25
Analysis by Copyright Holder Type Scholarly associations University presses Commercial publishers Authors & estates Other Success rate based on responses
26
Transaction Costs $ 10,808FTE labor $ 379Phone calls $ 100Paper & postage $ 11,287TOTAL May 2003 – October 2003 Does not include legal fees, administrator time, or cost of Internet connectivity or database creation. $78 per book/volume 174 letters & 159 follow–up calls or email
27
Other Work Created student internships (funded by Posner) Assisted Posner Center opening May 2004 –DVD, “jade” reproduction, initial exhibit Moving the Collection August 2004 Ongoing –Posner Center & Collection security –Staffing the Posner Center –Selecting & supervising interns –Preparing exhibits
28
The Million Book Project Digitize & provide open access to a million books Vision, leadership, & research – Carnegie Mellon $$ Equipment & travel – NSF $$ Labor & research – India & China “Attempt to understand & solve the technical, economic, & social policy issues of providing online access to all creative works of the human race.” Raj Reddy
29
Rationale Democratize knowledge & empower citizenry – Address disparity in library size & accessibility Facilitate new knowledge – Combining old & new, east & west, technical & humanistic Enhance student learning & success of faculty research Preserve cultural treasures Address copyright absurdities
30
Support Digital Library Research Information distribution, management, & sustainability Security, copyright, & digital rights management Accuracy of optical character recognition (OCR) OCR of non–Romanic languages & scripts Automatic creation of structural metadata Automatic summarization Intelligent indexing Machine translation Storage formats Search engines
31
Overview 2001–2002 2001 –NSF funded collection & planning meeting –Meet with Chinese partners in U.S. 2002 –NSF funded equipment & travel –Meet with Chinese partners in China –Meet with Indian partners in U.S. –Pilot shipment of books to India Gloriana & Mark Kamlet
32
Overview 2003 – 2004 2003 –All partners meet in India –Meet with partner OCLC –Partner U.C. Merced Libraries fund copyright permission work 2004 –All partners meet in U.S. –Lesk work on identifying public domain books –Kahle v Ashcroft Meeting with Indian President Dr. A.P.J. Abdul Kalam
33
Collection & Planning Meeting November 2001 Collection of collections Copyright considerations –Books for College Libraries –Separate funding Avoid duplicate scanning Safe, affordable shipments Pilot shipment to India Indigenous Materials Public Domain In Copyright
34
2002 Purchased scanners Sent scanners to India & China Trained scanner operators & librarians Began scanning indigenous materials Began working with DLF & OCLC to develop digital registry Incised palm leaves Saraswathi Mahal Library
35
Pilot Shipment to India Questions to be answered –What does it take to pull & pack books for shipment abroad & track their scanning & return? –Should shipments be coordinated among participating U.S. libraries, centralized, or individualized? –How long will the books be away? –Will they be returned safely? –How will the scanning turn out?
36
Explore trade–offs By air is faster, more expensive –Air containers hold 1500 kilos (3300 pounds) By sea is slower, less expensive –Ocean containers hold four times as much Preliminary talk with Emery –Space available on flights to India & China, but little space available on return flights –Suggested shrink–wrap & return books by sea
37
August 2002 Pilot Shipment to India 6,000 books – mostly public domain –243 boxes, 11,298 pounds of books on nine pallets –Shipped from New York to Chennai in a 20–foot ocean container Shipment took 25 days to reach Chennai Round trip cost $2 per book –A third of the books had to be returned to the U.S.
38
Pilot Shipment Book Distribution From Chennai, books went to the central distribution center in Tanjore From Tanjore, books were distributed to scanning centers Pilot Central Distribution Site Deemed University
39
One Year Later... Most of the books were returned in good condition in August 2003 Lost books located & returned 2004 Scanning center in Hyderabad, India
40
Lessons Learned & New Strategies Reduce costs –Packing books in crates costs $1 per book round trip –Books that don’t need to be returned cost just $0.50 Reduce turn around time –Learned customs procedures –Changed distribution strategy Books now go direct from seaport to 4 international megacenters
41
Copyright Permission Dedicated labor beginning November 2003 –Funded by U.C. Merced Libraries Send request letter + prompt follow up call or email Books for College Libraries –50,000 titles –5,600 publishers Scanning center in Hyderabad, India
42
“More Bang for the Buck!” Indigenous Materials in India & China Public Domain In Copyright Shifted from per title to per publisher approach Original Current
43
Request Letter Educate about open access & user behavior Ask for non–exclusive permission to digitize –All out of print, in copyright titles –All titles published prior to a date of their choosing –All titles published # or more years ago –List of titles they provide Assure – follow standards & laws; limit print & save Give – images, metadata, & OCR $$$
44
Problems Seeking Permission Identifying & locating copyright holders Publishers –Lost or misplaced request letter –Afraid of open access & lost revenue –Don’t know what they published or what rights they have & don’t have the resources to figure it out –Involved in other projects – perhaps exclusively –Copyright reverts to author when book out of print
45
46% 54% 73% 12% 14% Permission granted Permission denied Still negotiating Preliminary Statistics Need 18% success rate with BCL publishers granting permission for 500 books each 371 publishers contacted 14% success rate 46,700+ titles
46
Analysis by Copyright Holder Type Scholarly associations University presses Commercial publishers Authors & estates Other Million Book Project Success rate based on responses
47
Estimated Transaction Costs $ 18,846Labor $ 323Follow up $ 194Paper & postage $ 19,363TOTAL Nov 2003 – Jun 2004 Does not include legal fees, administrator time, or cost of Internet connectivity or database creation. $0.42 per book 540 letters & 359 follow up calls or email Internet Café, India
48
Experiment: Renewal Databases Catalog of Copyright Entries (Digitized at Carnegie Mellon) –http://onlinebooks.library.upenn.edu/cce/http://onlinebooks.library.upenn.edu/cce/ Library of Congress Copyright Catalog –http://www.copyright.gov/records/cohm.htmlhttp://www.copyright.gov/records/cohm.html –To find renewals 1973–1978 must consult another source for registration numbers Lesk Copyright Renewal Records –http://www.scils.rutgers.edu/~lesk/copyrenew.htmlhttp://www.scils.rutgers.edu/~lesk/copyrenew.html –Functionality created & enhanced by Michael Lesk
49
Experiment: U.S. Office of Copyright Expedite identifying & locating copyright holder –Asked to identify & locate copyright holders of 7 titles –$150 fee – charged 3 days after request –Estimated 4 to 6 weeks –Nudged at 8 weeks –Took 15 weeks to respond –Confirmed one citation Scanning center in Hyderabad, India
50
Experiment: Authors Registry Expedite locating copyright holder –Asked to locate 25 authors or estates –$2.50 fee per author/estate found –Same day response –Found 52% –92% accuracy rate Authors likely to grant permission, but transaction cost per book is high Scanning center in Hyderabad, India
51
Experiment: Workflow Time Trials Need to generate lists of BCL titles for publishers that will consider a list that we provide Expedite generating lists of titles (from dirty OCR) –Verify citation – 30% improvement using digital BCL over WorldCat –Verify copyright status – improvement using Lesk’s enhanced copyright renewal records –Verify print status – not cost effective Reduced cost per title from $2.12 to $1.41
52
Next Steps Partnerships pending –Vendor to provide print–on–demand service –U.S. Office of Copyright to study impact of CTEA Standards & best practices –NISO solicited proposal to develop copyright metadata –Lead best practice for acquiring copyright permission Research –Continue data gathering & analyses –Survey participating publishers’ views of open access
53
2004 Meeting at Carnegie Mellon Partners from India & China 124,000+ books scanned in India, China & Egypt Additional centers planned in India, China, Poland, Turkey, Korea
54
Scanning in China $8.46M from Ministry of Education 2003–2006 45,000 books scanned –Capacity 0.5M pages scanned per day –Want to scan 1M books in 2 years –14 scanning centers 25,000 books from our collection shipped to China July 2004 Scanning center in Beijing
55
Scanning in India $25M annually to support research 70,000 books scanned –Capacity 1.5M pages per day –20 scanning centers –4 are megacenters Cost 0.69 rupees per page –0.89 rupees with cost of scanner –$0.015 – $0.019 U.S. dollars BooksLanguage 45,660English 10,486Telugu 1,114Sanskrit 1,091Tamil 3,454Urdu 220Kamada 475Hindi 636Marathi 7,158Other
56
Scanning in India Many books scanned, but no OCR available Will be done with Indian books in 6 to 12 months Working on quality control Librarians in Hyderabad, India
57
Research Access for visual & hearing impaired Automatic summarization Machine translation OCR Searching User interfaces State Library in Hyderabad, India
58
Sample Translation Telugu to English
59
Current book display & navigation Search within the book Next & previous page Zoom in & out Select format Go to page
60
Proposed book display & navigation Zoom in & out Select format Search within the book Next & previous page Go to page Add or remove bookmark Add to or view bookbag Get info about the book New search for other books Return to results Help Title of book Location within book Navigate book Beginning & end of book
61
Needs More books –India requested 60K books –China wants 10K books per month More scanners More storage Way to transfer content to Carnegie Mellon –DVD burners to be provided Way to standardize the metadata New policy on copyright
62
Copyright Policy Public Domain Enhancement Act HR 2601 –Pay $1.00 to maintain copyright after 50 years –Register copyright agent –U.S. Office of Copyright maintains freely accessible list of titles & agents Options to compensate copyright holders –Compulsory licensing – like sound recordings –Public Lending Right – government pays –Subscription model – like HBO –Metered use – pay per view –Free to read – pay to print Want a global model
63
Lesk Identifying Public Domain Public domain appears to have 5.5M books in English –2M published in U.S. pre–1923 –2M published in U.S. 1923–1963 (90+% not renewed) –1.5M (out of 8M) published in foreign countries 1.5M books in French, German, Spanish & Italian Expect half to be difficult to locate
64
Kahle v Ashcroft 92% of books are in copyright, but out of print Challenge U.S. copyright system –No records of copyright ownership –Denies public access to orphaned works without providing any benefits http://notabug.com/kahle/ –Submit examples of impact of barriers to use of orphaned books
65
Use Internet Explorer The Universal Library, China site http://www.ulib.org.cn/ Digital Library of India http://www.dli.gov.in/home.htm The Universal Library, U.S. site http://www.ulib.org/html/index.html FAQ:http://www.library.cmu.edu/Libraries/MBP_FAQ.htmlhttp://www.library.cmu.edu/Libraries/MBP_FAQ.html
66
Thank you! Denise Troll Covey – troll@andrew.cmu.edutroll@andrew.cmu.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.