Indexing Overview Approaches to indexing Automatic indexing Information extraction.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
A Quality Focused Crawler for Health Information Tim Tang.
Information Retrieval in Practice
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Vector Space Models.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Automated Information Retrieval
Information Retrieval in Practice
Plan for Today’s Lecture(s)
Text Based Information Retrieval
CS 430: Information Discovery
Multimedia Information Retrieval
CSc4730/6730 Scientific Visualization
CS 430: Information Discovery
Multimedia Information Retrieval
Introduction to Information Retrieval
CS 430: Information Discovery
Presentation transcript:

Indexing Overview Approaches to indexing Automatic indexing Information extraction

Overview Indexing: the transformation of documents to searchable data structures.  May be manual or automatic  Creates basis for direct search,or for search through index files.  Historically performed by professional indexers associated with library organizations.  A critical process: user’s ability to find documents on particular subject is limited by the indexer creating index terms for this subject.  Initial computerization still relied on human indexers, but encouraged using more index terms(index cards no longer being required for each index term)

Changes in Objectives of Indexing Due to full Tex Availability  Indexing defines the source major concepts of documents.  The use of a controlled vocabulary(the domain of the index),help standardize the choice of terms.  Controlled vacabularies slow the indexing process,but aid users because they know the domain the indexer had to use  With the availability of full text the need for manual indexing is diminishing  Source information (citation data) can easily be extracted.  Every word of a document(after appropriate normalization) may be used as a term  Thesauri compensate for lack of controlled vocabularies. Hence,importance of manual indexing shifts to its ability to  Perform abstractions and determine additional related terms.  Judge the value of the information (e.g., more difficult to “cheat”)

Approaches:Scope  Exhaustively: the extent to which concepts are indexed.  Should we index only the most important concepts, of also more minor concepts?  In a 10-page document, should a 2-sentence discussion of some subject be indexed?  Specificity: the preciseness of the index term used.  Should we use general indexing terms of more specific terms?  Should we use the term “computer”, “personal computer”, or “IBM Aptiv a Model M61”?  Main effect:  Low exhaustivity has adverse effect on recall.  Low specificity has adverse effect on precision.  Related issues:  Index title and abstract only, or the entire document?  Should index terms be weighted?

Approaches : Pre-coordination  Post-coordination : when a query uses a set of terms linked by AND, it links these terms together.  Pre-coordination : links among terms are specified in the index. Pre- coordination improves retrieval for post-coordinated queries.  Example : Document discusses drilling of oil wells in Mexico by CITGO and introduction of oil refineries in Peru by the U.S. 1No pre-coordination of terms: oil, wells, Mexico, CITGO, refineries, Peru, U.S.  Document retrieval if query links “oil”, “Mexico” and “Peru”. 2Simple re-coordination: (oil wells, Mexico, CITGO) (oil refineries, Peru, U.S.)  Document not retrieved if query links “oil”, “Mexico” and “Peru”

Example(cont.) 3Pre-coordination with position indicating role: (CITGO, drill, old wells, Mexico) (U.S. introduce, oil refineries, Peru)  Discriminates which country introduces refineries into the other country 4Pre-coordination with modifier indicating role: (Subject: CITGO, Action:drill, Object: oil wells, Modifier: in Mexico) (Subject: U.S., Action: introduce, Object: oil refineries, Modifier : in Peru)  If document discussed U.S. introducing refineries in Peru, Bolivia, and Argentina, one entry is used with three Modifier fields.

Automatic Indexing System automatically determines the index terms assigned to documents. Relative advantages –Human indexing: Ability to determine concept abstractions. Ability to judge the value of concepts. –Automatic indexing: Reduced cost : once initial hardware cost is amortized, operational cost is cheaper vs. compensation for human indexers. Reduced processing time : at most few seconds vs. at least a few minutes. Improved consistency : algorithms select index terms terms much more consistently than humans.

Weighted and Unweighted indexes Unweighted indexing: –No attempt to determine the value of the different terms assigned to a document. Not possible to distinguish between major topics and casual references. –All retrieved documents are equal in value. –Typical of commercial systems through the 1980s. Weighted indexing: –Attempt made to place a value on each term as a description of the document. –This value is related to frequency of occurrence of the term in the document(higher is better), but also to number of collection documents that use this term (lower is better). –Query weights and document weights are combined to a value describing the likelihood that a document matches a query, and a threshold value limits the number of documents returned. –Typically used only with automatic indexing.

Automatic Indexing by Term and by Concept Indexing by Term: The item is represented by terms extracted from the item. –The Vector model –The Bayesian Model –Natural language processing indexing by concept:The document is represented by concepts not necessarily used in the document.

Indexing by Term:the Vector Model The SMART system developed by Salton at Cornell University. –Each document is stored as a vector of weights. –Each vector position represents a term in the database domain(the dimension of these vectors is the size of the vocabulary). –The value is represented by a similar vector –The Search involves calculating the vector distance between the query vector and each document vector.

Indexing by Term : the Bayesian Model Bayes rule of conditional probability : –P(A/B) = P(A,B)/P(B) = P(A)P(B/A)/P(B) Bayesian methods can be used to determine the processing tokens and their weights. Principle : calculate the (posterior) probability that a given document contains concept C, given the presence of features (words) F 1,…,F m in the document. To calculate this probability we need to know : –The prior probability that the document is relevant to the concept C. –The conditional probability that the features F i are present in a document, given that the document is relevant to the concept C.

Indexing by Term : Natural Language Processing The DR-LINK system. –Enhance indexing by using semantic information ( in addition to statistical information). –Process the language, rather than treat each word as an independent entity. –Process documents at different levels : morphological, lexical, semantic, syntactic, and discourse ( beyond the sentence).

Indexing by Concept There are many ways to represent the same idea and increased retrieval performance comes from using a single representation. Hence, a single canonical set of concepts is determined and is used for indexing all documents. The MatchPlus system: –A set of n features (concepts) is selected. –For each word stem a context vector of dimension n is built, describing how strongly the stem reflects each feature. –The context vectors for the word stems are combined with a weighted sum, to create a single context vector for the entire documents. –This vector represents the document in terms of the concepts. –Queries go through same analysis to determine vector representations. –During search, query vector is compared to document vectors.

Information Extraction Two processed related to indexing : –Extraction of facts(e.g, when building indexes automatically). –Document summarization. Extraction of facts into a database: –Extract specific types of information using extraction criteria (indexing attempts to represent the entire document). –Recall now refers to how much information was extracted from a documents(vs. how much should have been extracted). –Precision now refers to the proportion of the extracted information which is accurate. –Experiments show that automatic extraction performs much worse than human extracion (55% precision and recall vs. about 80%), but operates about 20 times faster.

Information Extraction(cont.) Documents summarization: –Extract the most important ideas, while reducing the size significantly. –Example : the abstract of a document. –“True summarization” is not feasible. –Instead, most summarization techniques extract the “most significant” subsets(e.g., sentences), and concatenate them. –Each sentence is assigned a score, and the highest scoring sentences are extracted. –No guarantee of a coherent narrative. –Heuristic algorithms, with no overall theory. For example, Consider sentences over 5 words in length. Look for “cues”; e.g., “in conclusion”. Focus on the first 10 and last 5 paragraphs.