Lecture 8 Information Retrieval Introduction
Information Retrieval Introduction Databases Very formal & logical Input into them is (or can be) very tightly constrained In turn, DB queries are written assuming those constraints Information Retrieval Systems Empirical Cognitive modeling – the way we think
Information Retrieval Introduction Queries based on ‘things already there’ Words, documents What are characteristics of these things? Total # of words in English language What are most common words ? Least common words ? How many total ‘documents’ in the world are there ? How many web pages are there ? What kind of structure does the web have ? How rapidly is it changing ?
Information Retrieval Introduction Users have: An information need Use of information In an IR system, the user dynamically iterates with the system, e. g. “Was this helpful ?”
Information Retrieval Introduction Similar, but not identical, architectures DBMS IR Data Documents DBMS IRS Database Engine Search Engine Query Processor Query Processor UI Queries & Reports Interface to another system UI Retrieved Output Interface to another system
Information Retrieval Introduction Documents Medline, Westlaw, etc various retrieval methods – Boolean, Ranked w/weights, Vector space IRS Search Engine Silverplatter, Dialog, Inktomi Query Processor UI Retrieved Output Interface to another system Post-processing Value Add Via Web GUI, Command line
IRS Components Document preparation & analysis Task Definition Databases Indexing Search/Retrieval Engines Interfaces Usability & Cognitive Tools System Evaluation
Document Preparation & Analysis Formatting tools Mapping to/from formats (XML, PDF, text, postscript, etc) Natural Language Processing/Feature Extractions Stemming Parsing, word sense disambiguation, morphology Tokenization
Filtering, selective dissemination Cross lingual retrieval Task Definition Ad hoc Filtering, selective dissemination Cross lingual retrieval Categorization Topic detection & tracking Redundancy reduction Info synthesis/value add Cross doc/cross time summarization Presentation/visualization Info delivery when & where needed Info assistance Decision support Online analysis Resource discovery
Bibliographic Full text Multi-media Audio & video Web data IR Databases Bibliographic Full text Multi-media Audio & video Web data
Human indexing & Categorization In Everything Is Miscellaneous, Weinberger describes 3 orders of categorization: 1st order – organize things (made of atoms – takes up space) themselves, such as silverware in a drawer or books on a shelf 2nd order – there is a reference to the things themselves, such as a card catalogue that points to the physical space of the 1st order thing (but doesn’t necessarily say much about what’s inside) 3rd order – made of bits (takes up virtually no space) and can get to things ‘inside’ Use Everything is Miscellaneous Reference
Automatic indexing Indexing Algorithms to organize and weight text in documents
Weighted or partial match Link analysis Retrieval/Matching Boolean & exact match Weighted or partial match Link analysis
Interfaces Web GUI ‘Local’ GUI Command Line Gesture – James Bond, Quantum of Solace Minority Report
Dictionaries, Thesauri Gazetteers, CIA World Fact Book Encyclopedias Knowledge Tools Dictionaries, Thesauri Gazetteers, CIA World Fact Book Encyclopedias
Evaluation What questions to ask ? Is the system actually used ? Is it efficient ? Is the system effective ? Are users satisfied ? Do they find relevant information ? Complete information ?
Reading Read As We May Think http://www.theatlantic.com/doc/194507/bush