LIS618lecture 2 history Thomas Krichel 2011-09-19.

Slides:



Advertisements
Similar presentations
Using CAB Abstracts to Search for Articles. Objectives Learn what CAB Abstracts is Know the main features of CAB Abstracts Learn how to conduct searches.
Advertisements

Chapter 5: Introduction to Information Retrieval
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Project Proposal.
Lecture №2 State System of Scientific and Technical Information.
Search Engines and Information Retrieval
Intelligent Information Retrieval CS 336 Lisa Ballesteros Spring 2006.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
ISP 433/533 Week 8 IR in libraries. Goal Universal Access to Information Vannevar Bush 1945 article Memex A memex is a device in which an individual stores.
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
Literature Review Week 3 Lecture 1. School of Information Technologies Faculty of Science, College of Sciences and Technology The University of Sydney.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
Information Retrieval
PSYC512: Research Methods PSYC512: Research Methods Lecture 2 Brian P. Dyre University of Idaho.
Copyright © Allyn & Bacon (2007) Conducting Library Research Graziano and Raulin Research Methods: Appendix C This multimedia product and its contents.
Library and Information Center of the University of Crete
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
OER Case Study TJTS569 Advanced Topics in Global Information Systems Savenkova Iuliia.
PubMed/How to Search, Display, Download & (module 4.1)
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
1 DATABASES By: Hanna Ben-Or Phone: October 2011.
Library Resources Barbara Dorward November Previous session  Catalogues  Library resources  Finding information on the web  Evaluation of information.
1 Introduction to Library Databases Basic Searching.
LIS654lecture 1 Introduction Thomas Krichel
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
Copyright © Allyn & Bacon 2008 Locating and Reviewing Related Literature Chapter 3 This multimedia product and its contents are protected under copyright.
Chapter 3 Copyright © Allyn & Bacon 2008 Locating and Reviewing Related Literature This multimedia product and its contents are protected under copyright.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
LIS654 lecture 1 Introduction Thomas Krichel
Search Engine Architecture
LIS618 lecture 0 Thomas Krichel Organization homepage Contents to be discussed today. Send mail.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Intellectual Works and their Manifestations Representation of Information Objects IR Systems & Information objects Spring January, 2006 Bharat.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
1 CS 430: Information Discovery Lecture 5 Ranking.
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The Sci-tech Information Industry: Changes, Challenges and the CAS Response Presented by Michael Walsh Manager of International Marketing Operations Chemical.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
What is Information Retrieval (IR)?
Using computers to search electronic databases
Information Retrieval and Web Search
Search Engine Architecture
CS 430: Information Discovery
Information Retrieval and Web Search
Multimedia Information Retrieval
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
WIRED Week 2 Syllabus Update Readings Overview.
CS 430: Information Discovery
DATABASES By: Hanna Ben-Or Phone:
Introduction to Information Retrieval
Search Engine Architecture
Analyzing and Organizing Information
Presentation transcript:

LIS618lecture 2 history Thomas Krichel

contents based on a very fine paper by Michael Lesk “The Seven Ages of Information Retrieval”. That paper was written in 1997, so it does not cover recent advances.

general problem The general approach here is a computer science approach. How to build systems that will allow people to obtain information. This can be distinguished from the approach more positivistic approach to observe how people actually deal with obtaining information.

general overview In the last decades, no industry has been subjected to more technology change than any other industry. Nevertheless, there are two other factors of note – the political and economic environment – research and theoretical advances

the broad picture Through the life of information retrieval there has been a constant struggle about how to prepare the system by intellectual or statistical methods. Intellectual methods require substantial human input. Despite that, they have not been completely been outruled.

statistical methods When we use statistical methods, we interrogate a text for tokens of potential semantic significance, e.g. words. (details in next lecture). We look at how much they occur in documents. It’s a brute force job. We store a representation of this count in a computer.

information analysis method In this approach we have humans analyze the contents of documents to make judgments about them. This analysis predates computers. Early computer-based systems had to use this because the computers were not capable. In modern systems, such as web information retrieval, this is making a comeback.

1945 As We May Think This is very famous (but not a good IMHO) paper by Vannevar Bush. Bush envisages the memex. It stores the human knowledge in a desktop device. It lets users mange commented associations between its stored items. The essay pays scant attention to information retrieval (bar condemning existing approaches).

1950s The USSR launches the sputnik. The US is worried – need to improve the organization of science – need to understand what the Russians are doing learn Russian work on machine translation Suddenly it seems like a good idea to go back to something like a wartime science effort.

major problems There were about 100 computers in the United States. Computers operated in a batch-processing, rather than in an interactive mode. Supply of machine-readable text was very small.

special hardware There were some experiments with special hardware – Edge-notched cards by Calvin Moors – WRU Searching Selector of Allen Kent These were abandoned as the digital technology progressed.

Hans Peter Luhn (1896—1964) In 1957 he was the first to propose extracting information from text by automated means. – “It is here proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements.”

KWIC Luhn also worked on KWIC (key word in context) indexing. The idea was that information retrieval should not only be based on words. Example the idea was that Information Retrieval should worked on KWIC key word in Hans Peter Luhn was the firs

1960 This is when online IR really started. There were the first systems being built, with heavy subsidies. Some of today’s available scientific IR systems can be trace to precursors built at that time.

fuller descriptions As computers become more powerful, more and more descriptors could be taken from texts to make them findable in a retrieval systems. Taking more descriptors, however meant more work for the people involved in finding the descriptive terms.

full-text indexing During the 60s people started to look at the situation where all words would be used as descriptors for the document. Such a strategy is – less work for the human – more work for the computer

IR performance criteria In order to compare full-text and selective indexing, criteria had to be found and experiments to be conducted. Cyril Cleverdon (1914—1997) developed criteria of precision and recall. He and his disciples worked an a standard document set that could be used for testing, about 14k abstract on aeronautics.

and they found They found that full-text indexing was superior to human-aided indexing. This was very controversial at the time. All catalogers got very upset!

new retrieval techniques Having an experimental dataset that was shared by researchers was key to developing experiments that were comparable. One idea was “relevance feedback”. The idea was the user would select relevant documents in a fist step, and then the term from these documents would be added to the query in a second step.

NASA ReCON Name stand for “remote console”. It is thought to be the first multi-site bibliographic online information system for scientific literature. First large scale information systems – online – interactive – distributed

NASA ReCon It dealt with citation strings. It searched for keywords manually extracted from text. It allowed fielded searches. It allowed combinations of fields related and combinations are results received.

evolution from testbeds to services ReCON was developed by Lockheed in It become a testbed at Lockheed using the ERIC database in It became the DIALOG online system in That system was the largest online general retrieval system, mainly for bibliographic data for a long time.

specialized engines COLEX was a system developed in 1965 for the Air Force by Systems Development Corporation. It become SDC’s ORBIT system. The National Library of Medicine used that system to build the MEDLARS system that would give online access to medical abstracts as early an 1991.

increase of available text The development of computer-aided typesetting, and later word processing made for a lot more text available online than before. Among the first were the larger abstracting database such as Chemical abstracts. Later newspapers followed.

other services In 1973 the Lexis system of US court records was the first large scale full text database. OCLC got tapes from LoC and printed catalog cards. Later they started shared cataloging, because the LoC only cataloged about 2/3 of what the member libraries had.

research Research into IR prototype systems actually declined in the US. NSF shifted away research funds to the development of actual systems. There was some work on AI is IR but it got pretty much nowhere.

research on models Gerard Salton (1927—1995) and his group at Cornell refined the vector space model and introduced tf/idf term weighting in the early 70s. This will be covered later. Later in the 70s the probabilistic model of information retrieval emerged in work by Keith van Rijsbergen (1943—) and friends. This will not be covered in the course.

80s This decade saw the introduction of inexpensive personal computers. In libraries – Card catalogs are replaced by OPACs allowing non- specialists access to online information. – Online information started to use full text.

full-text retrieval In the scientific abstract databases full-text is not that important. But in legal databases, it is. Lexis were the first to provide full-text access. Later they were followed by the newspapers. They were among the first non-reference publishers to adopt computerized typesetting. These databases contained text only.

research effort The vector model was still the start, but researcher tried to using dictionaries to disambiguate terms in documents and queries. Some work was done on bi-lingual retrieval. Some work was done on part of speech assignments. Research made no impact on real-world commercial systems.

CD ROM In the late 80s the CD-ROM appeared to make a dent in online information retrieval. The CD-ROM fits the standard print publishing model. Sharing CD-ROMs however, proved a nightmare in libraries.

90s The 90s were the decade of the Internet. Online information retrieval become common. The CD-ROM lost out.

integration of graphics The web became compelling as soon as a graphical user interface was introduced by the mosaic browser. This introduced the subject of multimedia information retrieval, largely defined to research experiments before, out in the open.

start of search engines Early search engines looked at the web with the eyes of earlier information retrieval engines. The essentially applied the vector model to the text extracted from web pages. But web pages also contain also pictures and more importantly, links.

Stanford digital library project Run at Stanford between 1995 and 2004 by a group of researchers to “provide an infrastructure that affords interoperability among heterogeneous, autonomous digital library services.” In 1996 as Sergei Brin and Larry Page started to work on a project to analyze the links between web pages.

web information retrieval I argue that web information retrieval is different from conventional retrieval because on the web pages have different intrinsic values. The PageRank algorithm is a simple algorithm to make such values appear.

Please shutdown the computers when you are done. Thank you for your attention!