CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Slides:



Advertisements
Similar presentations
Organisation Of Data (1) Database Theory
Advertisements

Chapter 5: Introduction to Information Retrieval
Multimedia Database Systems
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
CS336 Lecture 8: Indexing Languages. File organizations or indexes are used to increase performance of system –Inverted files, signature files, bitmaps.
Information Retrieval in Practice
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Intelligent Information Retrieval CS 336 –Lecture 2: Query Language Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s slides.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
DB2 Net Search Extender Presenter: Sudeshna Banerji (CIS 595: Bioinformatics)
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Modern Information Retrieval Chapter 1 Introduction.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Overview of Search Engines
Software Development Unit 2 Databases What is a database? A collection of data organised in a manner that allows access, retrieval and use of that data.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653)
Introduction to Database Systems
OARE Module 2: Searching Strategies. Table of Contents Planning a Search Types of Sources Boolean Operators Google vs. (Google) Scholar Evaluating Web.
Information retrieval wed sept data…. -start at 6.45.
Intelligent Information Retrieval CS 336 Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s slides.
Modern Information Retrieval Computer engineering department Fall 2005.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-1 How Search Engines Work Today we show how a search engine works  What happens when.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Information Systems & Databases 2.2) Organisation methods.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
Lesson 13 Databases Unit 2—Using the Computer. Computer Concepts BASICS - 22 Objectives Define the purpose and function of database software. Identify.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
JORGE DIAZ PORRAS,FRANCISCO GARZA,NESTOR DOMINGUEZ.
Eric W. Wohlers, PE Env. Health Director Chris Crawford, Ph.D. Water Resource Specialist Cattaraugus County.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition.
Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
GUIDE. P UB M ED
Information Retrieval in Practice
Information Architecture
Education 499-R01 Search Basics.
Search Engine Architecture
Modern Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
Database Vocabulary Terms.
Searching for and Accessing Information
Multimedia Information Retrieval
Search Techniques and Advanced tools for Researchers
Thanks to Bill Arms, Marti Hearst
Text Categorization Assigning documents to a fixed set of categories
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
The ultimate in data organization
Presentation transcript:

CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

What is information retrieval? What is a relational database?

Relational Databases Can think of them as containing a bunch of tables or records. –Records contain information pertaining to a particular data item (e.g. patient information) Relationships are explicit labels or fields (e.g. date, name, age, …) possible field values (e.g. 2001, Mary Smith, 29)

Relational Databases vs IR How does the basic characteristic of data in a relational database differ from that of a text document? –Relational database has structure! employee records store inventory student information: id number, name, year of graduation, etc. Information retrieval is hard because textual data is unstructured …

Lets do an experiment … Why do crackers break into honeypots? What strategies did you use to answer this question?

Information Retrieval Trick is to find a means of describing object –We’ll focus on text, but could include images audio files video –Language Complicates our Task –What approach might you take to developing a document description?

Manual Indexing Earliest method –people read every document –choose descriptors from “controlled vocabulary” like card catalog –categorize document via descriptors

Example “Ebola” document Nat Med 1998 Jan;4(1):37-42 Immunization for Ebola virus infection. Xu L, Sanchez A, Yang Z, Zaki SR, Nabel EG, Nichol ST, Nabel GJ Department of Biological Chemistry, University of Michigan Medical Center, Ann Arbor , USA. Infection by Ebola virus causes rapidly progressive, often fatal, symptoms of fever, hemorrhage and hypotension. Previous attempts to elicit protective immunity for this disease have not met with success. We report here that protection against the lethal effects of Ebola virus can be achieved in an animal model by immunizing with plasmids encoding viral proteins. We analyzed immune responses to the viral nucleoprotein (NP) and the secreted or transmembrane forms of the glycoprotein (sGP or GP) and their ability to protect against infection in a guinea pig infection model analogous to the human disease. Protection was achieved and correlated with antibody titer and antigen-specific T-cell responses to sGP or GP. Immunity to Ebola virus can therefore be developed through genetic vaccination and may facilitate efforts to limit the spread of this disease.

MeSH Indexing of Example Document MH - Animal MH - Antibody Formation MH - Disease Models, Animal MH - Ebola Virus/*immunology MH - Female MH - Guinea Pigs MH - Hemorrhagic Fever, Ebola/*immunology/*prevention & control MH - Human MH - Male MH - Mice MH - Mice, Inbred BALB C MH - Nucleocapsid Proteins/immunology MH - Plasmids MH - T-Lymphocytes/immunology MH - Transfection MH - *Vaccines, DNA MH - Viral Proteins/biosynthesis/immunology MH - *Viral Vaccines

Honeypots Build your own controlled vocabulary for this document.

Controlled Vocabulary What kinds of difficulties do you think might arise?

Controlled Vocabulary What kinds of difficulties do you think might arise? –Maintenance of the vocabulary is costly changes over time must train specialists –Many documents = a lot of person hours reading/indexing –Searcher’s vocabulary may not match indexer’s

Free Text Choose words from within the document text –make two short lists words you think would be useful words you don’t believe would be useful

What does an IR system do? Generate a representation of each document –essentially pick best words and/or phrases Generate query representation –if documents processed specially, queries must also be –possibly weight query words Match queries and documents –find relevant documents Perhaps, rank and sort documents

Ambiguity Complicates the Task Synonyms: many ways to express concept –lorry/truck, elevator/lift, pump/impeller, hypertension/high blood pressure –failure to use specific words => failure to get doc Words have many meanings –How many diff meanings are there for “bank”?

Ambiguity Complicates the Task Difficult to Specify Important but Vague Concepts –e.g. will interest rates be raised in the next six months Spelling variants/ spelling errors

Basic Automatic Indexing Parse documents to recognize structure –e.g. title, date, other fields Scan for word tokens –numbers, special characters, hyphenation, capitalization, etc. –languages like Chinese need segmentation –record positional information for proximity operators Stopword removal –based on short list of common words such as “the”, “and”, “or” –saves storage overhead of very long indexes –can be dangerous (e.g. “Mr. The”, “and-or gates”)

Basic Automatic Indexing Stem words –group word variants such as plurals via morphological processing computer, computers, computing, computed, computation, computerized, computerize, computerizable –can make mistakes but generally preferred Optional –phrase indexing –thesaurus classes

How do you rank results? What does it mean for a document to be important/relevant? Word matching is imperfect, how do we decide which documents are most important?

How do you rank results? How do we decide which documents are most important? –Count words high frequency words indicate document “aboutness” –Weight infrequent corpus words more strongly can be strong signifiers of meaning; easier to partition –Determine meaning by analyzing text surrounding a word » –Give extra weight to title words, etc. –Make sense of references given, citations received, etc.

Free Text Search Engines Different engines use different ranking strategies (often a trade secret) –Word frequency –Placement in document –Popularity of document –Number of links to document –Business relationships etc….