Google Confidential Making all the World’s Books Fully Searchable ICUDL 2006 Daniel Clancy Engineering Director, Google Book Search 19-Nov-2006.

Slides:



Advertisements
Similar presentations
Google Series Part 1: gmail Part 2: maps Part 3: talk Part 4: earth Part 5: books Part 6: picasa Part 7: sites Part x: ?
Advertisements

The Oxford-Google Digitization Project* Michael Popham Oxford Digital Library * Rules of commercial confidentiality apply to this presentation!
The Google Books Settlement: A Partner Library Perspective Ivy Anderson California Digital Library Library Journal Virtual E-Book.
Live Search and Premium Content March 11, 2007 Mike Buschman, Product Manager, Live Search Academic Live Search Selection Team.
A Publishers Perspective on Serials in Changing Times Patricia Hudson Associate Director of Institutional Marketing Online Products Oxford University Press.
Google Scholar and Google Books Meg Atwater-Singer UE Emeriti Presentation October 1, 2007.
Google Book Search Jodi Healy March 14, Google Confidential.
Acroterion Search Engine Solutions Presentation The next mousetrap Mom, What's a Library? Search Engine Marketing is Necessity Billions of dollars poured.
Janet Weber Manager, Publisher Relations OCLC MLAIB Discussion Group MLA & OCLC Update ALA Annual 28 June 2008.
An Introduction to Copyright Central Michigan University Libraries January, 2013.
Elizabeth Newbold and Samantha Tillett GL8 New Orleans, December 2006
“If you build it, they will come.”. Virtual Business  There is much more that goes into a virtual business than just building the web site.  You will.
Searching and Accessing the Cultural Heritage in a Digital World Yoram Elkaim International Conference on Intellectual Property & Cultural Heritage in.
Partnership agreement between Complutense University and Google Books Manuela Palafox Parejo Servicio Edición Digital y Web Biblioteca de la Universidad.
The impacts of google digitization projects on libraries
Is Cataloging Dead: Advocacy for Bibliographic Control Randy Roeder and Rebecca Routh ILA/ACRL Spring Conference Davenport, Iowa March 3, 2008.
Jonathan Band Jonathan Band PLLC Google Library Project: Copyright Issues.
Online Resources From Oxford University Press This presentation gives a brief description of University Press Scholarship.
Rich Foley - Executive Vice President Academic & Public Markets Helen Wilbur - Vice President Consortia Sales & Marketing Digital ArchivesResearch CollectionseBooks.
The world’s libraries. Connected. WorldShare platform & Management Services Integrate all of your collections: print, licensed & digital Chris Thewlis.
. Do not distribute. 2 Online Content (Billions of items indexed) Offline Content (Billions of items still un-indexed) Google’s.
HathiTrust Digital Library. Overview ›Began in 2008 ›Large scale digital preservation repository ›Partnership of major research libraries ›Focus on both.
Linking resources Praha, June 2001 Ole Husby, BIBSYS
“Getting Best Value from your Collection of E-Journals” Ian Pattenden - Bowker (UK) Ltd.
Live Search Books University of Toronto – Scholar’s Portal Forum 2007 January 2007.
Google Books, UMI and Other Intriguing Trends in Digital Publishing Joe Wible Hopkins Marine Station of Stanford University October 9, 2006.
Information Trends in Libraries Get More Value from Data Give More Value to Users Get Users involved July 9, 2007 Stuart Weibel Senior Research Scientist.
1 SIMPSONS SOLICITORS Get it on Google: Google Book Search A review of the US actions against Google Inc. and the implications in Australia.
-- Google Confidential – John Lewis (JL) Needham Strategic Partner Development Manager Google, Inc February 15, 2005 What Is Google Doing in My Library?:
Electronic CommerceNonhlanhla Shongwe  Introduction  Mission statement  Product  Business model  SWOT Analysis  Conclusion.
The OCLC-AMICAL RESPOND project: Leveraging WorldCat to connect international American universities.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
© 2015 albert-learning.com GOOGLE BOOKS CASE. © 2015 albert-learning.com Vocabulary Law suitA case in a court of law involving a claim, complaint, etc.,
Breana McCracken University of Illinois at Urbana-Champaign HathiTrust and Copyright Future Implications - Strong precedent for libraries to continue to.
WorldCat Local & World Cat Quick Start a new way to search your library’s resources and the world beyond.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
NCSU Libraries Andrew Pace & Emily Lynema NCSU Libraries May 24, 2006.
Google Confidential Daniel Clancy Engineering Director, Google Print 18-July-05.
OCLC Programs & Research Prospecting in the library data mines Brian Lavoie Consulting Research Scientist OCLC Programs & Research Annual Partners Meeting.
Tutorial EBSCO Discovery Service for Corporate Users support.ebsco.com.
CSM06 Information Retrieval Lecture 1a – Introduction Dr Andrew Salway
 1. Support your arguments.  2. Provide background.  3. See what other’s have done. 
Endeca: a faceted search solution for the library catalog Kristin Antelman & Emily Lynema UNC University Library Advisory Council June 15, 2006.
Catalogs for the Future Roy Tennant Future? What Future? Catalogs ain’t got no stinking future!
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
The Collaborative Reference Database Project of the National Diet Library of Japan By Kiyoko MURAKAMI Assistant Director Domestic Materials Acquisition.
RDA and Special Libraries Chris Todd, Janess Stewart & Jenny McDonald.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
INTELLECTUAL RIGHTS AND HISTORIC CORPORA Mark Sandler University of Michigan ICOLC, March, 2003.
1 The NSDL Program Stephen Griffin National Science Foundation.
WISER Finding stuff: Articles Kerry Webb, Deputy Librarian, English Faculty Library Isabel McMann, Academic Liaison Services, Radcliffe Science Library.
Intellectual Works and their Manifestations Representation of Information Objects IR Systems & Information objects Spring January, 2006 Bharat.
Daniel Boivin OCLC Canada OCLC and Access98. AgendaAgenda n What’s new with FirstSearch 4.0 n New FirstSearch or FirstSearch 5.0.
HATHITRUST A Shared Digital Repository Institution Uses of HathiTrust Jeremy York University of Maine May 24, 2013.
Copyright Laws Dodge City Public Schools November 2013 Compiled By: 6-12 Academic Coaches and DCHS Librarian Approved By: 6-12 Administrators.
April 2, For Today Review of Google Research Techniques and Software to me (10 points) Google Scholar Google Books Google Search.
Innovation, Copyright, and the Academy University of California Santa Barbara November 2, 2015 Kenneth D. Crews Gipson Hoffman & Pancione (Los Angeles)
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
WISER: Finding stuff Journal articles Kerry Webb, Deputy Librarian, English Faculty Library & Angela Carritt, OULS User Education Coordinator.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
HATHITRUST A Shared Digital Repository HathiTrust Large Digital Libraries: Beyond Google Books Modern Language Association January 5, 2012 Jeremy York,
Matt Goldner Product & Technology Advocate Mela Kircher Product Manager WorldCat Local Metasearch 13 November 2009.
WISER Finding stuff: Journals and Journal Articles Kerry Webb, Deputy Librarian, English Faculty Library & Angela Carritt, Bodleian Libraries’ User Education.
Using Google Scholar Ronald Wirtz, Ph.D.Calvin T. Ryan LibraryDec Finding Scholarly Information With A Popular Search Engine Tool.
1 Terminal Management System Usage Overview Document Version 1.1.
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Federated & Meta Search
ספרים אלקטרוניים: הלכה למעשה
Global Digital Content Management: Today & the Future
Christopher C. Brown Reference Librarian
Presentation transcript:

Google Confidential Making all the World’s Books Fully Searchable ICUDL 2006 Daniel Clancy Engineering Director, Google Book Search 19-Nov-2006

Google Confidential Making Information Accessible Arpanet Team

Google Confidential Larry and Sergey’s idea

Google Confidential Google’s Mission Online Content Billions of web pages Offline Content Billions of items still unindexed 4 To organize the world’s information and make it universally accessible and useful.

Google Confidential Two Initiatives Library Program ~85% of books are out of print and/or out of copyright – these books are only found in libraries 5 Partner Program GOAL: Create a comprehensive virtual card catalog of all books in all languages, while respecting publishers’ rights Only ~15% of books are in print

Google Confidential Google Books Library Project Current Partners University of Michigan Stanford University Harvard University Oxford University New York Public Library Library of Congress University of California System University of Virginia University of Madrid

Google Confidential Really, how many books? Library of Congress24,616,867 Harvard University14,437,361 Chicago Public Library10,994,943 New York Public Library10,608,570 Yale University10,492,812 Queens Borough Public Library10,357,159 Oxford University10,000,000 …. University of Michigan7,348,360 Stanford University7,286,437 Library Holdings

Google Confidential 92% of the world's books are neither generating revenue for the copyright holder nor easily accessible to potential readers.* The value is in the middle A Typical Library Collection In-Print Partner Program Public Domain Books published before 1923 Unclear copyright status Books after 1923 but… May be in copyright, but not for sale Rights may have reverted to author May be in the public domain Less than 20%**~65% or more 15% *Source: Covey, Denise Troll. "Global Cooperation for Global Access: The Million Book Project“ **OCLC analysis of the Google Books Library Project: ~15%

Google Confidential Three User Experiences Sample Pages View Full Book View Snippet View 20% 65% or more* ~15%

Google Confidential A Closer Look at the Snippet View User can view: Bibliographic info A few sentences around the query Restricted searching Same 3 snippets, never more Links to purchase In-print – online bookstores Out of print – used bookstores For books we scan that are still in copyright

Google Confidential Sample Pages View User Limitations: 10%-20% of pages/user/month Scrolling 2 pages left or right Full text of book is indexed No cut/copy/print

Google Confidential Public Domain Books Pre-1923 (in US) books that are no longer in copyright Public Domain UI Results displayed in onebox Full text of book is indexed Same as original UI with no browsing restrictions No cut/copy/print Additional link to ‘Find it in a library’ No attribution to the library that provided the book Google Print

Google Confidential Book Reference Page Links to “Buy this Book” Google Adsense for Content contextual advertising Bibliographic data Synopsis Reviews Publisher branding

Google Confidential Finding Books

Google Confidential Number of Search Queries/Keyword Keywords It’s Not Just About Our Most Popular Searches… 15 Harry Potter Wireless Home Networking Peruvian Orchids Jersey City What are people searching for? Everything

Google Confidential Finding Books europe constitution One BoxBook Search Property Blended Results

Google Confidential Ego-searching and discovering your past “Never before could I have found such an obscure and wonderful gem. Google Book Search prompted me to buy two copies of a book that I never would have known about, otherwise.” - Bernie Robichau, Columbia, SC

Google Confidential Book Flow Process Scanning Indexing & Serving Logistics Processing & Storage Some Principles: Use smart software when possible. Detect problems Fix them.

Google Confidential Assessing the scale of the problem Estimated parameter Note all numbers here are simply example numbers to demonstrate the scale and challenges of such a project and are not intended to indicate specific plans by Google or our partners or commitments.

Google Confidential Scanning Factors: Quality Cost Speed Scalability Diversity of content

Google Confidential Google Scanning Technology

Google Confidential Source Problems

Google Confidential Processing

Google Confidential Page Numbering What is the appropriate page number of this page?

Google Confidential Some “interesting” cases: what to do with Math? Indeed, it follows from (3.5') in view of (4.39) that (4.40) S k=l fc=i = (1 - a)C - (1 - a) C + From (4.40), � ^(u)x^= � and hence The last inequality means that TTQ 6 SF(f,N). Further, it follows from (4.38) that (4.41) P'{X# > /} = P*{/ -C � N(U) I{ZN f} = P'{I{ZN A} = (1 - a). Finally, we get from (4.38) and (4.41) that (4.42) P{X^>/}=E'IW � > A(l-a) > 1-a. The relations (4.41) and (4.42) show that the condition (4.35) holds for the strategy na, and hence TTQ is an a-((l - a)C, /, A^)-hedge. What has been obtained shows that it is possible to hedge a contingent claim with a specified probability (1 � a). Further, the initial funds can be reduced by the amount a C, though with a risk a the accepted contingent claim cannot be repaid. PROBLEMS 4.1. Prove that on a no- arbitrage (B, 5)-market we have for a standard Euro- pean option to buy (sell) that C(7V2) > C(Ni) (respectively, P(JV2) > P(M)) when 4.2. Prove that the fair price C = C(N, So, K) of a standard European option to buy, where N is the exercise time, So is the initial price of a share, and K is the exercise price, has the following properties: a) C(S0, K) is monotone in So and K; d) C(So, K) is convex in So and K; c) C(\S0,XK) = A C(S0,K) for A > 0.

Google Confidential Spell Correction? no longer. Den she fall down an' Wolf catch her./ Wolf look up at Br'er Rabbit, cough- in' an' hangin' on de roof, an' 'e laugh and say, " Ah, ha ! ol' fellow, you is hang on still, is you? You done well, but I'll hab you yet. I'll fix you dis night." Well, Rabbit hang on while 'e kin. De smoke da choke um an' de cough da strangle um tel 'e mos' fall; an' at lars' 'e see 'e gots for go. So 'e put 'e ban' in 'e pocket an' full up 'e mout' wid terbacker an' chew um tel 'e mout' full up wid de juice. Den 'e le' go, an' Wolf look up for catch um as 'e fall. But Br'er Rabbit spit out 'e whole mout' f ul o'terbacker juice right in Wolf

Google Confidential Erroneous metadata Metadata claims books is Russian - However: 157 pages are in English! Automatically detect languages? Better search Better OCR

Google Confidential â¬*Mlnv- Toy aleertennof n^m^ritn. qaeva* desTdos eosto 4 mi padre d snstnerlos a tu oniosidad, qae d eseri- birlos. Se" qne cometa ana impradtncia iilirfirir«dn on femenil deseo qne te aearreara modiM dokns; pcro ew- tigo mas quiero pecar de tolerant* qne de wrcro. Pra- fanart COD el secrete la memoria de mi boen padre. mas anadirt qoilates a tu carioo: eatre 1« respeto* de- bidos a. la memoria de on padre nmerlo, j d amor Spanish with English OCR

Google Confidential Page Ordering

Google Confidential Research Challenges or “Scanning is the easy part” What will the Digital Library of the future look like? What are some of the research challenges enabled by the massive amounts of content available via Google Book Search? How can Google act as a catalyst and enable some of this research through leveraged investment? How do you balance the rights of the copyright holder with the public interest?

Google Confidential Web of Off-line Content How do you create an ontology of objects from the off-line world with the myriad of links that connect these objects Some Relationships that exist: FRBR Hierarchy Work, manifestation, hierarchy References Authorship Criticism and review Inclusion Individuals Events Temporal relationships Different perspectives Topical similarity e.g. How to record the supporting evidence for the research supporting Ismail’s presentation last night?

Google Confidential Finding Stuff How do I find what I am looking for? Search Rich, explicit network structure does not exist How to allow the user to be an active participant? Browsing content What are the different dimensions of similarity between books and how to you allow people to navigate this space effectively. Serendipitous Discovery How to you enable people to discover stuff they were not looking for? Assistance How can one user’s effort help the experience of another user?

Google Confidential Interesting problems What is the granularity of documents for indexing? OCR: pretty good, but not perfect Internationalization Old texts, manuscripts, marginal scribbles Page Number Detection Page Ordering Meta-data and book equivalence Segmentation and disaggregation Book summarization Other types of Content and how to integrate

Google Confidential Some Discussion Topics Role of Library in the Future Access for everyone everywhere Problems with current institutional subscription models Publishing and User Generated Content How can Google help support research? Role of Private Companies

Google Confidential Questions?