1 Automated Digital Libraries William Y. Arms Department of Computer Science Cornell University.

Slides:



Advertisements
Similar presentations
1 William Y. Arms Cornell University October 25, 2002 The National Science Digital Library (NSDL) as an Example of Information Science Research.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
1 Open Access to Digital Libraries. Must Research Libraries be Expensive? William Y. Arms Department of Computer Science Cornell University.
Computing Studies Is it for me? Click here to find out…
ISI Web of Knowledge – Innovative Solutions ISI Web of Knowledge / Web of Science – coming developments BIOSIS Archive Web Citation Index – New product.
Tools: Computers and IT. VB, VBA, Excel, InterDev, Etc. Humans: Decision Making Process Algorithms: Math/Flow Chart stuff that helps the tools help the.
1 Technical Developments Related to Quality Issues Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY
1 Quality Control in Scholarly Publishing. What are the Alternatives to Peer Review? William Y. Arms Cornell University.
1 The Impact of the Internet on Research Universities Examples from Distance Education & Digital Libraries William Y. Arms Department of Computer Science.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
1 Mark Gleeson (01) Graduate Students‘ Union Trinity College, Dublin New Frontiers.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.
1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 22 Metadata 4.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
What is the Internet? The Internet is a computer network connecting millions of computers all over the world It has no central control - works through.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
1 Using Scopus for Literature Research. 2 Why Scopus?  A comprehensive abstract and citation database of peer- reviewed literature and quality web sources.
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
1 William Y. Arms September 26, 2002 A Research Program for Information Science with the NSDL as an Example.
1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.
1 William Y. Arms Cornell University April 4, 2003 Free Access to Information Today Who Benefits? What are the Risks? Who Pays?
Corporation For National Research Initiatives NSF SMETE Library Building the SMETE Library: Getting Started William Y. Arms.
1 Economic Models for Open Access William Y. Arms Department of Computer Science Cornell University Professional.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Araba Dawson-Andoh 122 A Alden Library
Allyn & Bacon 2003 Social Work Research Methods: Qualitative and Quantitative Approaches Topic 12: Reviewing Literature and Report Writing.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
1 The NSDL: A Case Study in Interoperability William Y. Arms Cornell University.
Purpose of study A high-quality computing education equips pupils to use computational thinking and creativity to understand and change the world. Computing.
LIS654lecture 1 Introduction Thomas Krichel
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
Live Search Books University of Toronto – Scholar’s Portal Forum 2007 January 2007.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation.
1 CS 430 / INFO 430 Information Retrieval Lecture 23 Non-Textual Materials 2.
NCBI/WHO PubMed/Hinari Course Introduction Session #1, Sept 13, 2005 Session #2, Sept 14, 2005 Internet Concepts and Scientific Literature Resources Ho.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
E - Physical Sciences & Engineering Jeff Pache IEE
Introduction to Information Literacy McNeese University Library.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Definition and search of scientific articles Tord Heljeberg
LIS654 lecture 1 Introduction Thomas Krichel
1 The Digital Library Landscape Looking for Trends William Y. Arms Department of Computer Science Cornell University.
1 The NSDL Program Stephen Griffin National Science Foundation.
Information Retrieval
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
1 CS 430 / INFO 430 Information Retrieval Lecture 17 Metadata 4.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
introductionwhyexamples What is a Web site? A web site is: a presentation tool; a way to communicate; a learning tool; a teaching tool; a marketing important.
General Architecture of Retrieval Systems 1Adrienn Skrop.
CONDUCTING RESEARCH – Lecture 19 Research- Where to Begin? What kind of information do you need? – Facts – Opinions – News reports – Research studies.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems.
Automated Information Retrieval
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
CS 430: Information Discovery
Search Engines & Subject Directories
Unit# 5: Internet and Worldwide Web
Introduction to Information Retrieval
Search Engines & Subject Directories
Objectives, activities, and results of the database Lituanistika
Search Engines & Subject Directories
Presentation transcript:

1 Automated Digital Libraries William Y. Arms Department of Computer Science Cornell University

2 Two Questions

3 Before Digital Libraries Access to scientific, medical, legal information In the United States: -- excellent if you belonged to a rich organization (e.g, a major university) -- very poor otherwise In many countries of the world: -- very poor for everybody

4 Question 1 Must access to scientific and professional information be expensive?

5 Research Libraries are Expensive library materials buildings & facilities staff

6 The Potential of Digital Libraries materials open access computers & networks staff ?

7 Question 2 How effectively can computers be used for the skilled tasks of professional librarianship? -- Time horizon: 5 to 20 years -- All materials in digital form

8 Automated Library Services

9 Skilled Librarianship People are skilled at judgment, understanding, discrimination, etc.: -- selection -- cataloguing, indexing -- seeking for information -- evaluating information -- reference service Can computers provide equivalent services?

10 Equivalent Services Example: Cataloguing rules -- Application of cataloguing rules to monographs is skilled -- It is hard to imagine a computer system with these skills but Catalogs and cataloguing rules are the means not the end

11 Equivalent Services Information discovery Why are web search services the most widely used information discovery tools in universities today?

12 Conventional Criteria Web search services have many weaknesses -- selection is arbitrary -- index records are crude -- no authority control -- duplicate detection is weak -- search precision is deplorable yet they clearly satisfy important requirements...

13 Effectiveness of Web Search Inspec v. Google Google is usually superior for general computing and computer science questions > Broader coverage > Adequate indexing records > Better ranking

14 Simple Algorithms + Immense Computing Power

15 History: Licklider J. C. R. Licklider Libraries of the Future, envisaged digital libraries for scientists at their place of work -- listed desiderata for a digital library -- studied construction of fully automated digital libraries -- put emphasis on artificial intelligence and natural language processing

16 History: Licklider Licklider's predictions for digital libraries were remarkably good, but over optimistic about progress in artificial intelligence -- underestimated what can be done by brute force computing

17 Brute Force Computing Few people can appreciate the power of Moore's Law -- Computing power doubles every 18 months -- Increases 100 times in 10 years -- Increases 10,000 times in 20 years Simple algorithms + immense computing power may outperform human intelligence

18 Brute Force Computing Example Creators of the world champion chess program (Deep Thought later Deep Blue) -- moderate chess players -- simple tree-search algorithm -- very, very fast computer hardware

19 An Anecdote The question (Marvin Minsky) -- How would you design as computer system that can answer questions such as, "Why was the space station a bad idea?"? The answer (Danny Hillis) -- Design much more powerful computers!

20 Examples of Automated Digital Library Services

21 Web Search Brute force indexing and retrieval -- retrieve every page on the web -- index every word -- repeat every month Getting better all the time -- improved algorithms -- faster computers and networks -- analysis of users

22 Web Search Ranking algorithms Closeness of match -- vector space and statistical methods (Salton, et al., c. 1970) Importance of digital object -- Google ranks web pages by how many other pages link to them, gives greater weight to links from higher ranking pages. (NSF/DARPA/NASA Digital Libraries Initiative)

23 Archiving and Preservation Internet Archive -- Monthly, web crawler gathers every open access web page with associated images -- Web pages are preserved for future generations -- Files are available for scholarly research not perfect HTML pages, images; no Java applets, style sheets -- materials are dumped with no organization or indexing -- access for scholars is rudimentary

24 Reference Linking Web of Science (ISI) -- input: combination of automatic means, skilled people -- limited number of journals -- very expensive ResearchIndex (a.k.a. CiteSeer, a.k.a. ScienceIndex) (NEC) -- fully automatic -- all open access material in computer science -- a free service

25 Beyond Text Informedia (Carnegie Mellon) Automatic processing of segments of video, e.g., television news. Algorithms for: -- dividing raw video into discrete items -- generating short summaries -- indexing the sound track using speech recognition -- recognizing faces -- searching using natural language processing (NSF/DARPA/NASA Digital Libraries Initiative)

26 Costs and Benefits

27 Costs of Catalogs and Indexes Catalog, index and abstracting records are very expensive when created by skilled professionals -- only available for certain categories of material (e.g., monographs, scientific journals) -- contain limited fields of information (e.g., no contents page) -- restricted to static information High costs reduce effectiveness and access

28 Costs of Automated Digital Libraries The Google company million searches daily people (half technical, 14 with Ph.D. in computing) -- 2,500 PCs running Linux, with 80 terabytes of disk The Internet Archive -- 7 people with support from Alexa (March 2000)

29 Overall If you are rich Research libraries, using commercial information services, provide excellent service at very high cost to a favored few -- Automated digital libraries are a long way from providing the personal reference service available to a faculty member at a well-endowed university but...

30 The Model T Library The Model T Ford, with mass production, brought car travel to the masses Automated digital libraries, with open access materials, can already provide good service at low cost -- In the future automated digital libraries can bring scientific, scholarly, medical and legal information to everybody at negligible cost

31 A Footnote

32 Library Expertise The future of scientific and professional information is tied to computing, but automated digital libraries need small teams of highly skilled people -- development of automated digital libraries is bypassing libraries (Google, ResearchIndex, Informedia, Internet Archive) The level of computing expertise in U.S. research libraries is depressingly low

33 Further reading William Y. Arms, "Automated digital libraries." To be submitted to D-Lib Magazine, July/August William Y. Arms, "Economic models for open-access publishing." iMP, March

34 Automated Digital Libraries William Y. Arms Department of Computer Science Cornell University