Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS336 Lecture 8: Indexing Languages. File organizations or indexes are used to increase performance of system –Inverted files, signature files, bitmaps.

Similar presentations


Presentation on theme: "CS336 Lecture 8: Indexing Languages. File organizations or indexes are used to increase performance of system –Inverted files, signature files, bitmaps."— Presentation transcript:

1 CS336 Lecture 8: Indexing Languages

2 File organizations or indexes are used to increase performance of system –Inverted files, signature files, bitmaps Text indexing is the process of deciding what terms will be used to represent a given document index terms are then used to build indexes for the documents A retrieval model describes how the indexed terms are incorporated into a model –Relationship between retrieval model and indexing model

3 Generating Document Representations Want to use significant terms to build representations Manual indexing: professional indexers –Manually assign terms from a controlled vocabulary –Typically phrases Automatic indexing: machine selects –Terms can be single words, phrases, or other features from the text of documents –Takes ~ 1 hour to index 10 GB

4 Index Languages Language used to describe docs and queries Exhaustivity: number of different topics indexed, completeness or breadth –increased exhaustivity => higher recall/ lower precision Specificity - accuracy of indexing, detail –increased specificity => higher precision/lower recall Pre-coordinate indexing –combinations of terms (e.g. phrases) used as an indexing label Post-coordinate indexing –combinations generated at search time –Most common

5 The Trade Off 0.5 Recall Precis i on 0.5 Broad terms Narrow terms Students want high precision: narrow terms. Lawyers want high recall: broad terms. For unknown population use terms in the middle

6 MeSH Medical Subject Headings Faceted classification: http://www.nlm.nih.gov/mesh/2006/MeSHtree.html

7 Disadvantages of Manual Indexing Human effort considerable Controlled vocabulary per collection Subjective –intersection between indexers is only about 40% –But … –Human experts that use indexing aids describing allowable vocabulary and usage (e.g. “scope notes”) achieve good indexing uniformity

8 Development of Automatic Methods 60’s: search services relied on manual approaches –automatic methods were sometimes an add-on –focus remained the use of intermediaries (specialists) –strong belief that manual must be better than natural language What caused focus to shift? –sheer volume of text: very costly to maintain vocabulary and indexing –full text of documents became more readily available … less reliance on abstracts and titles –computing power and access increased –The Web! Encouraged direct searching by user reduced dependence on professional searchers

9 Which is better? Salton - claims result of automatic comparable to manual –Based on small databases Can depend upon task and environment Experiments have shown that using both manual and automatic improves performance –“combination of evidence” Typically, manual indexing not a practical option Why?

10 Automatic Indexing with Full Text more flexible: no decisions about doc content are made at the time of indexing –no a priori assumptions about future search needs –indexing effort not devoted to docs outside search scope –document left open to a variety of index descriptions post-coordination indexing lets user define representation but, no effort given to explain document content –pressures user to think more carefully about search –pressures system designer to develop tools to aid user

11 Manual vs Automatic Indexing

12 MeSH Medical Subject Headings Faceted classification: http://www.nlm.nih.gov/mesh/2006/MeSHtree.html

13 Category C. Diseases C1. Bacterial Infections and Mycoses C2. Virus Diseases C3. Parasitic Diseases C4. Neoplasms C5. Musculoskeletal Diseases C6. Digestive System Diseases C7. Stomatognathic Diseases C8. Respiratory Tract Diseases C9. Otorhinolaryngologic Diseases C10. Nervous System Diseases C11. Eye Diseases C12. Urologic and Male Genital Diseases C13. Female Genital Diseases and Pregnancy Complications C14. Cardiovascular Diseases C15. Hemic and Lymphatic Diseases C16. Neonatal Diseases and Abnormalities C17. Skin and Connective Tissue Diseases C18. Nutritional and Metabolic Diseases C19. Endocrine Diseases C20. Immunologic Diseases C21. Injuries, Poisonings, and Occupational Diseases C22. Animal Diseases C23. Symptoms and General Pathology Category C2. Virus Diseases --------------------------- Arbovirus Infections African Horse Sickness Bluetongue Dengue Dengue Hemorrhagic Fever Encephalitis, Epidemic Encephalitis, California Encephalitis, Japanese Encephalitis, St. Louis Encephalitis, Tick-Borne West Nile Fever Encephalomyelitis, Equine Encephalomyelitis, Venezuelan Equine Phlebotomus Fever Rift Valley Fever Tick-Borne Diseases African Swine Fever Colorado Tick Fever Encephalitis, Tick-Borne Hemorrhagic Fever, Crimean Hemorrhagic Fever, Omsk Kyasanur Forest Disease Nairobi Sheep Disease West Nile Fever Yellow Fever

14 Example “Ebola” document Nat Med 1998 Jan;4(1):37-42 Immunization for Ebola virus infection. Xu L, Sanchez A, Yang Z, Zaki SR, Nabel EG, Nichol ST, Nabel GJ Department of Biological Chemistry, University of Michigan Medical Center, Ann Arbor 48109-0650, USA. Infection by Ebola virus causes rapidly progressive, often fatal, symptoms of fever, hemorrhage and hypotension. Previous attempts to elicit protective immunity for this disease have not met with success. We report here that protection against the lethal effects of Ebola virus can be achieved in an animal model by immunizing with plasmids encoding viral proteins. We analyzed immune responses to the viral nucleoprotein (NP) and the secreted or transmembrane forms of the glycoprotein (sGP or GP) and their ability to protect against infection in a guinea pig infection model analogous to the human disease. Protection was achieved and correlated with antibody titer and antigen-specific T-cell responses to sGP or GP. Immunity to Ebola virus can therefore be developed through genetic vaccination and may facilitate efforts to limit the spread of this disease.

15 Indexing If you were to look for documents about immunization against the Ebola virus, what might your query look like?

16 Example “Ebola” document Nat Med 1998 Jan;4(1):37-42 Immunization for Ebola virus infection. Xu L, Sanchez A, Yang Z, Zaki SR, Nabel EG, Nichol ST, Nabel GJ Department of Biological Chemistry, University of Michigan Medical Center, Ann Arbor 48109-0650, USA. Infection by Ebola virus causes rapidly progressive, often fatal, symptoms of fever, hemorrhage and hypotension. Previous attempts to elicit protective immunity for this disease have not met with success. We report here that protection against the lethal effects of Ebola virus can be achieved in an animal model by immunizing with plasmids encoding viral proteins. We analyzed immune responses to the viral nucleoprotein (NP) and the secreted or transmembrane forms of the glycoprotein (sGP or GP) and their ability to protect against infection in a guinea pig infection model analogous to the human disease. Protection was achieved and correlated with antibody titer and antigen-specific T-cell responses to sGP or GP. Immunity to Ebola virus can therefore be developed through genetic vaccination and may facilitate efforts to limit the spread of this disease. MH - Animal MH - Antibody Formation MH - Disease Models, Animal MH - Ebola Virus/*immunology MH - Female MH - Guinea Pigs MH - Hemorrhagic Fever, Ebola/*immunology/*prevention & control MH - Human MH - Male MH - Mice MH - Mice, Inbred BALB C MH - Nucleocapsid Proteins/immunology MH - Plasmids MH - T-Lymphocytes/immunology MH - Transfection MH - *Vaccines, DNA MH - Viral Proteins/biosynthesis/immunology MH - *Viral Vaccines


Download ppt "CS336 Lecture 8: Indexing Languages. File organizations or indexes are used to increase performance of system –Inverted files, signature files, bitmaps."

Similar presentations


Ads by Google