HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of.

Slides:



Advertisements
Similar presentations
FOR PROFESSIONAL OR ACADEMIC PURPOSES September 2007 L. Codina. UPF Interdisciplinary CSIM Master Online Searching 1.
Advertisements

Million Book Project Today Gloriana St. Clair October 21, 2003 OCLC.
“How Can Research Help Me?” Please make SURE your notes are similar to what I have written in mine.
Knowledge Ontario An integrated and interactive digital environment about, and for, Ontarians.
Presentation Kluwer Online e-books and e-reference works By Walter Montenarie Licensing Manager
Fawcett Library Online Resources The Webb Schools of California.
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
Privacy Policy, Law and Technology Carnegie Mellon University Fall 2007 Lorrie Cranor 1 Fair Information.
Million Book Project: Dreams and Realities Dr. Gloriana St. Clair University Librarian, Carnegie Mellon.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
Finding Books in the Library Catalog CARSON-NEWMAN COLLEGE.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
WHAT HAVE WE DONE SO FAR?  Weeks 1 – 8 : various components of an information retrieval system  Now – look at various examples of information retrieval.
DATABASES FROM HCT LIBRARIES. HCT has many online databases for students to use to find information. A database is a collection of information organized.
THE BASICS OF THE WEB Davison Web Design. Introduction to the Web Main Ideas The Internet is a worldwide network of hardware. The World Wide Web is part.
Library HITS Library HITS: Helpful Information for Trinity Students/Staff Library eResources for Sciences Michaelmas Term 2013 Trinity College Library.
WEB DESIGNING Prof. Jesse A. Role Ph. D TM UEAB 2010.
Internet Research Finding Free and Fee-based Obituaries Online.
II. Visiting the Library 1 updated 12/02/09. 2 Pat’s English class visits the BCC Library to locate literary criticism on Charlotte Perkins Gilman’s story,
Recent Progress in the Million Book Digital Library Project in China By Prof. Jihai Zhao Zhejiang University Libraries, Hangzhou, China
Research Methods & Data AD140Brendan Rapple 2 March, 2005.
SFU Library services, resources, and research tips for SIAT researchers (or: How libraries are still useful in the age of the Digital Revolution and Breaking.
Using LIRN® Guide Click here to continue. Click here to exit. Click here to go to the Table of Contents.
Section 2.1 Compare the Internet and the Web Identify Web browser components Compare Web sites and Web pages Describe types of Web sites Section 2.2 Identify.
Million Book Project (MBP) Gloriana St. Clair Johns Hopkins University February 5, 2003.
ED Plus Electronic Reserve Collection For the Libraries Wai Chan Asia Corporate Information Ltd. October 1999.
IATUL Libraries and Education in the Networked Information Environment Identifying and Selecting Content for the Million Book Project Christina.
Did you know? That the Mercy College Libraries website will give you access to Over 22,000 full text journals 45 subject databases Over 30,000 eBooks And.
Computing - The Next 10 Years Universal Access to Information Raj Reddy Carnegie Mellon University Pittsburgh, USA April 6, 2001 Talk presented at Georgia.
Million Book Project (MBP) Coalition for Networked Information December 5-6, 2002.
1 Web Basics Section 1.1 Compare the Internet and the Web Compare Web sites and Web pages Identify Web browser components Describe types of Web sites Section.
LIB100: Introducing Library Resources Lingnan University Library Sep
Google Print ™, Million Book Project, and Google Scholar ™ Digital Libraries Colloquium January 27, 2005 Gloriana St. Clair Dean of University Libraries.
The Invisible Web Cynthia Rooley Computer Research.
Live Search Books University of Toronto – Scholar’s Portal Forum 2007 January 2007.
Dance: A Research Strategy Anne Harlow Reference and Information Services Samuel Paley Library Temple University October 4, 2004.
Selecting a Topic and Purpose
SELECTING SOURCES What they are and how to choose the best Ms. Christine HRS Library.
Chapter One Orientation: The world of digital libraries How to Build a Digital Library Ian H. Witten and David Bainbridge.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Moving from Ideas to a finished Research Paper. Brainstorming: Beginning the Process Make a list of any subjects that you find interesting - Anything.
Google Confidential Daniel Clancy Engineering Director, Google Print 18-July-05.
Digitizing Aloha: Using Information Technology to Preserve and Present the History and Culture of Hawai'i Bob Schwarzwalder Assistant University Librarian,
Library Research Sources at UGA. UGA Libraries  Comprised of the Main library, Science library, Student Learning Center and Research Facilities  3.7.
LOGO Searching the Web CHAPTER 2 Eastern Mediterranean University School of Computing and Technology Department of Information Technology ITEC229 Client-Side.
The Evolving Digital Mathematics Library: A Mathematics Librarian’s Perspective Timothy W. Cole University of Illinois at Urbana-Champaign 8 Dec
Locating Books in Your EPISD Library. ONLINE CATALOG.
Welcome to Georgia Library Learning Online for K-12 Schools
© 2006 Tally (India) Private Limited. All rights reserved. KnowledgeBase [KB] KB Overview From Tech Services.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Mass Digitization Projects Celebration and Challenges Presented to the 2 nd ICUDL Alexandria, Egypt by Dr. Gloriana St. Clair Carnegie Mellon University.
Information Retrieval
Mr. P’s Class Term Paper All the Steps on the Path to an “A” Term Paper in World History.
Lluís Codina (UPF) MUCS Dept. Of Communication Online Searching: role and characteristics of Academic Databases.
Current Information To help you find current news and information, many search engines and directories include a hyperlink to a "What's new" page. Many.
What is Academic Research and Where Does It Come From? Database v Internet.
Databases vs the Internet. QUESTION: What is the main difference between using library databases and search engines? ANSWER: Databases are NOT the Internet.
How to do Research 3 rd Grade. How to do Research Did you know that you can access "all the information in the known galaxies"? It's true! In libraries.
Presenting Documents How to Build a Digital Library Ian H. Witten and David Bainbridge.
introductionwhyexamples What is a Web site? A web site is: a presentation tool; a way to communicate; a learning tool; a teaching tool; a marketing important.
Information Literacy Learn to find and critically evaluate information sources. Increase your information literacy skills, to more effectively search,
Million Book Project: Vision Becoming Reality Gabrielle Michalek, Carnegie Mellon Presentation to Carnegie Mellon Qatar Library November 9 & 10, 2005.
What is Academic Research and Where Does It Come From? Database v Internet.
Client-Side Internet and Web Programming
The Knowledge Centre A comprehensive source of transport, logistics and supply-chain information delivering knowledge to you.
Where can I find articles for research
Million Book Project Today
Meet GALILEO Finding the Best Databases in GALILEO.
Multilingual Information Access in a Digital Library
Copyright & Fair Use What You Need to Know!.
Presentation transcript:

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania, USA

Where is Pittsburgh?

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Universal Library Project of Carnegie Mellon University All published works of mankind digitized and online Instantly available Free to read In any language Anywhere in the world Searchable and browsable by humans and machines DEMO

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Why Digitize? Books are inefficient carriers of information Heavy, expensive Environmentally harmful Linear, not hyperlinked Poorly indexed Not searchable Not easily transported MOST IMPORTANT: not everyone has every book IN FACT, no one has every book

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS How Do We Convey Information? Books Orally Observation Teaching (a combination of the above) The book is –Information –AND a physical carrier The information can be conveyed digitally We don’t CARE about the carrier

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Objections to Digital Books People can’t read books from a screen Books are convenient –You can carry them –You can write in them –You can put a place marker in them –You can lend them to people Books are beautiful Books smell nice

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS How Many Books Are There? 1996 World published output: 800,000 books Total book titles ever published ~ 100M 1 book = 500 pp., 2000 char/page = 1 megabyte uncompressed (about 1 floppy disk) –10 8 books = bytes = 100 terabytes –Disk costs HK$10 per gigabyte –100 terabytes costs about HK$1 million Total books in WorldCat = 41,000,000 –Requires only 41 terabytes, HK$410,000

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS We Can Store Everything 100 terabytes can store: 3,000,000,000 photographs (compressed) 100,000,000 books 10,000 movies 300 years of music 100 terabytes occupies 240 cubic feet on DVD = 1 van 6 x 4 x 10 feet

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS We Can Send Everything Human speech: 30 bits/sec Gigabit Internet: 1,000,000,000 bits/sec (This talk: < 1 millisecond including slides) Feb Fujitsu achieved 5 terabits per second on one optical fiber 100 terabytes = 800 terabits It would take less than 3 minutes to transmit every book ever published

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Why a Universal Library? The largest library in the world (U.S. Library of Congress) has less than 20% of all books –Two hours to retrieve one book –Must travel to Washington, DC –No copying allowed Largest university library: 14 million (Harvard ) Hong Kong University: 3 million Typical large U.S. university: 1 million Largest high school: 130,000 (Philips Andover) Largest public high schools: 30,000 (U.S.) Average high school: 5,000 (U.S.)

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Universal Library Goals Democratization of information –Knowledge is power Education, distance learning –“Library” for distance education Research, technology transfer Promotion of understanding Preservation of human culture

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project A million books is a lot. CMU just reached 1 million. Idea: scan 1 million books in each of several countries. Make them available to everyone NSF provided $3 million to buy scanners for China and India China and India are each providing 500 full-time people for scanning Each country is scanning 1 million books over the next 3 years CMU is hosting, indexing, building infrastructure

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Effect of the Million Book Project All books scanned (in many languages) will be available free to read to everyone over the Internet Many cultural artifacts and treasures are being scanned All works are fully keyword-indexed and searchable All participating countries will have complete copies (mirrors) of all content Knowledge will be available to all

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Partners China –Beijing University –Chinese Academy of Science –Fudan University –Ministry of Education of China –Nanjing UniversityNanjing University –Shanghai Jiaotung University –State Planning Commission of ChinaState Planning Commission of China –Tsinghua UniversityTsinghua University –Zhejiang UniversityZhejiang University

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Partners India –Arulmigu Kalasalingam College Of EngineeringArulmigu Kalasalingam College Of Engineering –Goa UniversityGoa University –Indian Institute of Information Technology - AllahabadIndian Institute of Information Technology - Allahabad –Indian Institute of ScienceIndian Institute of Science –International Institute of Information Technology - HyderabadInternational Institute of Information Technology - Hyderabad –Shanmugha Arts,Science,Technology & Research AcademyShanmugha Arts,Science,Technology & Research Academy –Tirumala Tirupati DevasthanamsTirumala Tirupati Devasthanams –Maharashtra Industrial Development CorporationMaharashtra Industrial Development Corporation –University of PuneUniversity of Pune

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Copyright Problem Compulsory License –Owner CAN’T refuse; user MUST pay –Limited in US (Music: 1.55¢/min, 8.0¢/song) –Extensive compulsory licensing in Japan Flat-fee subscription (e.g. HBO) Free (subsidized by government) Public Lending Right (UK) “Buy” button Metered use (electric company) Micropayments

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Roadblocks Biggest obstacle: librarians Belief that the project is too large No funding –In the U.S., everyone assumes it is being done –Outside the U.S., everyone assume the U.S. is doing it Copyright Myriad of small independent digital libraries

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Policy Challenges Convenience displaces quality (Gresham) What to digitize first? Suitable copyright law Economics (Who pays? Who gets?) Privacy Reliability of information Change in the nature of teaching, learning

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS LAYERED UL MODEL UNIVERSAL LIBRARY: DIGITIZED ITEMS NAVIGATION TOOLS RETRIEVER SERVICE CUSTOM CATALOGS HYPERTEXT GENERATORS SEARCHERS TRANSLATORS NEWS AGENTS HUMAN USERS DIRECT MACHINE USERS HUMAN USERS ENCYCLOPEDIA VALUE-ADDED SERVICES BASELINE UL SERVICES

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Universal Dictionary A glossary containing every word in every language, with a translation Use: indexing the Universal Library Now has 1 million words (26 languages) 2 million by February (50 languages) 3 million by May (80 languages)

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Q A &

Multilingual Searching Find all documents containing “elephant” Find all documents about elephants –Even if the word “elephant” does not occur in the document Translation, transliteration –Book titles, works of art, proper names –Idioms, colloquial phrases

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Use of © Content Philosophy: must pay for use –Authors, publishers must not lose Implied license Bulk licensing Compulsory licensing

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Universal Dictionary Lexicon of all words in all languages, with English translations, e.g. Obtained from –Web dictionaries –Scanning + OCR –Publishers machine-readable form Uses: –Indexing the Universal Library –Machine translation –Spelling correction –Linguistic studies

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Technological Challenges Input (scanning, digitizing, OCR) Data representation –text, kset, notations, images, web pages Navigation and Search Multilingual Issues Output (voice, pictures, virtual reality) Synthetic documents

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Navigation Keyword searching does not scale –Imagine 10 6 hits Browsing, finding, searching, flying Fractal view –Keys are granularity and connectivity View whole collections or one glyph –Hyperbolic trees, virtual reality, discovered similarities

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Hyperbolic Tree Navigation

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Multilingual Issues Character sets Representations Íîäà ôèçè÷åñêè íàõîäèòñÿ â çäàíèè Èçâåñòèé Нода физически находится в здании Известий Multilingual navigation Translation assistance

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS UNIVERSAL LIBRARY STATUS >10,000 digital volumes Public-domain issues of the New York TimesNew York Times Portal to hundreds of other collections Art, music, video, Internet radio Magazines, newspapers, journals Installing 1.25 terabytes Visit

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Language Identification Given a string x, which language(s) is it from? –What language is “peogwir” from? Given x, which language(s) does it seem to be from? –“contrefaçon” “dazs” “chalupa” “mbwewe” Character set may be unknown Brief input (e.g. single word) Intermixed languages –“Zeitgeist Fever” Neologisms, slang, abbreviations

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Generative Approach Assume that the lexicon of a language L is generated by a probabilistic finite-state machine M L < a b z a z > a z > a z > a z > START OF WORD PROB THAT WORD STARTS WITH A PROB THAT WORD STARTS WITH Z PROB (a|<a) PROB (>|<a) PROB (a|<z) PROB (z|<z) PROB (z|<za) > PRODUCT = PROB ( ) > PRODUCT = PROB ( )

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Problems Where do all the required probabilities come from? How can they all be stored? If string x does not actually occur in a language, its probability will be zero. Won’t work for neologisms or misspellings. “Moving trigrams” work

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Generative Approach Let p L (y| x) be the probability that string x is followed by string y in language L (i.e. the probability given a prefix x the suffix is y) Then p L (x), the probability that x= was generated by L, is p L (x 1 | | <x 1 x 2 x 3... x n-1 x n ) This computation requires huge memory, so approximate: Assume p L (x n | <x 1 x 2 x 3... x n-1 )  p L (x n | x n-2 x n-1 ) So p L (x)  p L (x 3 | | x n-1 x n ) Try it

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Searching Mathematics Has this integral ever been evaluated?

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Searching Mathematics MATHEMATICA C.F.: Integrate[ Times[Power[E,Times[ -1,Power[V1,2]]], Sin[Power[V1,2]]], {V1,0,Infinity}]

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Hierarchical Nature of Aboutness What does it mean to say that a book is “about” chemistry? Can a word be about chemistry? If one paragraph is about chemistry, is the book about chemistry? If the book is about chemistry, is every sentence in it about chemistry? Aboutness is central to cataloging and retrieval

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Aboutness Hierarchy Universe Word Sentence Paragraph Section Chapter Collection BookNewspaper Article Photograph Object 3D Artifact Glyph KEYWORD SEARCHING OCCURS HERE SUBJECT SEARCHING OCCURS HERE

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Thesauri and Aboutness A set of numbered thesaurus entries defines a topic Thesaurus is topic-hierarchical 1011 Hindrance – barrier, bar, gate, fence, wall, rampart, dam, moat … A word is “about” any topic to which it belongs Dam: –241.1 lake –293.7 close (v.) – mother –757.2 horse – put a stop to (v.) – barrier Thesaurus + aboutness hierarchy can be used to disambiguate meanings without “understanding” Note: topic numbers are language independent

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Set Theory of Aboutness Given a finite universe W of objects (e.g. all words) Define a topic T  W to be a subset of W (a wordlist) Topic inclusion (defines the hierarchy): –Topic T includes topic S iff S  T Definition of aboutness: –A subset P  W of the universe (e.g., a book) is about topic T iff P  T   (intersection is nonempty) Hierarchical nature of aboutness: –If P is about S and T includes S, then P is also about T

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS We Can Search a Few Things Text In the Roman alphabet “Hidden” databases effectively unsearchable No images or two-dimensional structures –math –music –dance notation... No subject index of photographs or art –Corbis is one of the “best”