Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Similar presentations


Presentation on theme: "CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?"— Presentation transcript:

1 CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

2 What is information retrieval? What is a relational database?

3 Relational Databases Can think of them as containing a bunch of tables or records. –Records contain information pertaining to a particular data item (e.g. patient information) Relationships are explicit labels or fields (e.g. date, name, age, …) possible field values (e.g. 2001, Mary Smith, 29)

4 Relational Databases vs IR How does the basic characteristic of data in a relational database differ from that of a text document? –Relational database has structure! employee records store inventory student information: id number, name, year of graduation, etc. Information retrieval is hard because textual data is unstructured …

5 Lets do an experiment … Why do crackers break into honeypots? What strategies did you use to answer this question?

6

7 Information Retrieval Trick is to find a means of describing object –We’ll focus on text, but could include images audio files video –Language Complicates our Task –What approach might you take to developing a document description?

8 Manual Indexing Earliest method –people read every document –choose descriptors from “controlled vocabulary” like card catalog –categorize document via descriptors

9 Example “Ebola” document Nat Med 1998 Jan;4(1):37-42 Immunization for Ebola virus infection. Xu L, Sanchez A, Yang Z, Zaki SR, Nabel EG, Nichol ST, Nabel GJ Department of Biological Chemistry, University of Michigan Medical Center, Ann Arbor 48109-0650, USA. Infection by Ebola virus causes rapidly progressive, often fatal, symptoms of fever, hemorrhage and hypotension. Previous attempts to elicit protective immunity for this disease have not met with success. We report here that protection against the lethal effects of Ebola virus can be achieved in an animal model by immunizing with plasmids encoding viral proteins. We analyzed immune responses to the viral nucleoprotein (NP) and the secreted or transmembrane forms of the glycoprotein (sGP or GP) and their ability to protect against infection in a guinea pig infection model analogous to the human disease. Protection was achieved and correlated with antibody titer and antigen-specific T-cell responses to sGP or GP. Immunity to Ebola virus can therefore be developed through genetic vaccination and may facilitate efforts to limit the spread of this disease.

10 MeSH Indexing of Example Document MH - Animal MH - Antibody Formation MH - Disease Models, Animal MH - Ebola Virus/*immunology MH - Female MH - Guinea Pigs MH - Hemorrhagic Fever, Ebola/*immunology/*prevention & control MH - Human MH - Male MH - Mice MH - Mice, Inbred BALB C MH - Nucleocapsid Proteins/immunology MH - Plasmids MH - T-Lymphocytes/immunology MH - Transfection MH - *Vaccines, DNA MH - Viral Proteins/biosynthesis/immunology MH - *Viral Vaccines

11 Honeypots Build your own controlled vocabulary for this document.

12 Controlled Vocabulary What kinds of difficulties do you think might arise?

13 Controlled Vocabulary What kinds of difficulties do you think might arise? –Maintenance of the vocabulary is costly changes over time must train specialists –Many documents = a lot of person hours reading/indexing –Searcher’s vocabulary may not match indexer’s

14 Free Text Choose words from within the document text –make two short lists words you think would be useful words you don’t believe would be useful

15 What does an IR system do? Generate a representation of each document –essentially pick best words and/or phrases Generate query representation –if documents processed specially, queries must also be –possibly weight query words Match queries and documents –find relevant documents Perhaps, rank and sort documents

16 Ambiguity Complicates the Task Synonyms: many ways to express concept –lorry/truck, elevator/lift, pump/impeller, hypertension/high blood pressure –failure to use specific words => failure to get doc Words have many meanings –How many diff meanings are there for “bank”?

17 Ambiguity Complicates the Task Difficult to Specify Important but Vague Concepts –e.g. will interest rates be raised in the next six months Spelling variants/ spelling errors

18 Basic Automatic Indexing Parse documents to recognize structure –e.g. title, date, other fields Scan for word tokens –numbers, special characters, hyphenation, capitalization, etc. –languages like Chinese need segmentation –record positional information for proximity operators Stopword removal –based on short list of common words such as “the”, “and”, “or” –saves storage overhead of very long indexes –can be dangerous (e.g. “Mr. The”, “and-or gates”)

19 Basic Automatic Indexing Stem words –group word variants such as plurals via morphological processing computer, computers, computing, computed, computation, computerized, computerize, computerizable –can make mistakes but generally preferred Optional –phrase indexing –thesaurus classes

20 How do you rank results? What does it mean for a document to be important/relevant? Word matching is imperfect, how do we decide which documents are most important?

21 How do you rank results? How do we decide which documents are most important? –Count words high frequency words indicate document “aboutness” –Weight infrequent corpus words more strongly can be strong signifiers of meaning; easier to partition –Determine meaning by analyzing text surrounding a word » –Give extra weight to title words, etc. –Make sense of references given, citations received, etc.

22 Free Text Search Engines Different engines use different ranking strategies (often a trade secret) –Word frequency –Placement in document –Popularity of document –Number of links to document –Business relationships etc….


Download ppt "CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?"

Similar presentations


Ads by Google