Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)

Introduction Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body of the document Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body of the document But, the document structure is one additional piece of information which can be taken advantage of But, the document structure is one additional piece of information which can be taken advantage of For instance, words appearing in the title or in sub-titles within the document could receive higher importance For instance, words appearing in the title or in sub-titles within the document could receive higher importance

Introduction (cont.) Consider the following information need: Consider the following information need: –Retrieve all documents which contain a page in which the string “atomic holocaust” appears in italic in the text surrounding a Figure whose label contains the word earth The corresponding query could be: The corresponding query could be: –same-page( near( italic(“atomic holocaust”), Figure( label( “earth” ))))

Introduction (cont.) Advanced interfaces that facilitate the specification of the structure are also highly desirable Advanced interfaces that facilitate the specification of the structure are also highly desirable Models which allow combining information on text content with information on document structure are called structured text models Models which allow combining information on text content with information on document structure are called structured text models Structured text models include no ranking (open research problem) Structured text models include no ranking (open research problem)

Basic Definitions Match point: the position in the text of a sequence of words that match the query Match point: the position in the text of a sequence of words that match the query –Query: “atomic holocaust in Hiroshima” –Doc dj: contains 3 lines with this string –Then, doc dj contains 3 match points Region: a contiguous portion of the text Region: a contiguous portion of the text Node: a structural component of the text such as a chapter, a section, etc Node: a structural component of the text such as a chapter, a section, etc

Non-Overlapping Lists Due to Burkowski, 1992. Due to Burkowski, 1992. Idea: divide the text in non-overlapping regions which are collected in a list Idea: divide the text in non-overlapping regions which are collected in a list Multiple ways to divide the text in non- overlapping parts yield multiple lists: Multiple ways to divide the text in non- overlapping parts yield multiple lists: –a list for chapters –a list for sections –a list for subsections Text regions from distinct lists might overlap Text regions from distinct lists might overlap

Non-Overlapping Lists L0L0 L1L1 L2L2 Sections SubSections SubSubSections L3L3 Chapter

Non-Overlapping Lists Implementation: Implementation: –single inverted file that combines keywords and text regions –to each entry in this inverted file is associated a list of text regions –lists of text regions can be merged with lists of keywords

Non-Overlapping Lists Regions are non-overlapping which limits the queries that can be asked Regions are non-overlapping which limits the queries that can be asked Types of queries: Types of queries: –select a region that contains a given word –select a region A that does not contain a region B (regions A and B belong to distinct lists) –select a region not contained within any other region

Non-Overlapping Lists The non-overlapping lists model is simple and allows efficient implementation The non-overlapping lists model is simple and allows efficient implementation But, types of queries that can be asked are limited But, types of queries that can be asked are limited Also, model does not include any provision for ranking the documents by degree of similarity to the query Also, model does not include any provision for ranking the documents by degree of similarity to the query What does structural similarity mean? What does structural similarity mean?

Hybrid Model Hybrid Model The model sees the database as composed of a set of documents (or files, if no structure is defined), which may have fields. The model sees the database as composed of a set of documents (or files, if no structure is defined), which may have fields. Those fields need not to fully cover the text, and can nest and overlap. Those fields need not to fully cover the text, and can nest and overlap. There are a number of operations to obtain match points: prefix search, proximity, etc. There are a number of operations to obtain match points: prefix search, proximity, etc. There are operations for union, intersection, difference and complement of both documents and match points; There are operations for union, intersection, difference and complement of both documents and match points;

Hybrid Model to restrict matches to only some fields, and to retrieve fields containing some match point. to restrict matches to only some fields, and to retrieve fields containing some match point. Since it is not possible to determine whether a field is included in other (except under certain assumptions on the hierarchy) we say that the model is “flat", Since it is not possible to determine whether a field is included in other (except under certain assumptions on the hierarchy) we say that the model is “flat", and since it is not possible to make certain compositions of expressions involving fields, we say that it is not “compositional". and since it is not possible to make certain compositions of expressions involving fields, we say that it is not “compositional".

PAT Expressions The only index is on match points, there is no indexing on structure. The only index is on match points, there is no indexing on structure. For that purpose, the language allows dynamic definition of structures, based on match point expressions for the beginning and end of regions. It also allows to use externally computed regions. For that purpose, the language allows dynamic definition of structures, based on match point expressions for the beginning and end of regions. It also allows to use externally computed regions. Structures can have substructures of other type; this fact is not explicit, but derived from the inclusion relationship between regions. Structures can have substructures of other type; this fact is not explicit, but derived from the inclusion relationship between regions.

PAT Expressions Recursive structures (e.g. sections having other sections inside) are not allowed, each structure owns a set of non-overlapping areas of the text. Recursive structures (e.g. sections having other sections inside) are not allowed, each structure owns a set of non-overlapping areas of the text. Despite these drawbacks, the model is a good example of structuring and querying documents by mixing content and structure. Despite these drawbacks, the model is a good example of structuring and querying documents by mixing content and structure. What is most important, since all operations are based on the PAT array, they are extremely fast. Operations on areas are also fast, since they are non-overlapping and non-recursive. What is most important, since all operations are based on the PAT array, they are extremely fast. Operations on areas are also fast, since they are non-overlapping and non-recursive.

Overlapped Lists The original idea was to have a lists of disjoint segments, originated by textual searches or by “regions" like chapters. The original idea was to have a lists of disjoint segments, originated by textual searches or by “regions" like chapters. It enhances the algebra with overlapping capabilities, some new operators and a framework for an implementation. It enhances the algebra with overlapping capabilities, some new operators and a framework for an implementation.

Overlapped Lists With these enhancements, the model becomes a reworking of PAT expressions that solves elegantly its semantic problems. With these enhancements, the model becomes a reworking of PAT expressions that solves elegantly its semantic problems. The new operators allow to perform set union, and to combine areas. The new operators allow to perform set union, and to combine areas. Combination means selecting the minimal text areas including two segments, for any two segments taken from two sets. Combination means selecting the minimal text areas including two segments, for any two segments taken from two sets.

Lists of References Although the structure of documents is hierarchical (with only one strict hierarchy), answers to queries are at (only the top-level elements qualify), and all elements must be from the same type (e.g. only sections, or only paragraphs). Although the structure of documents is hierarchical (with only one strict hierarchy), answers to queries are at (only the top-level elements qualify), and all elements must be from the same type (e.g. only sections, or only paragraphs). Answers to queries are seen as lists of “references". Answers to queries are seen as lists of “references". A reference is a pointer to a region of the database. This integrates in an elegant way answers to queries and hypertext links, since all are lists of references. A reference is a pointer to a region of the database. This integrates in an elegant way answers to queries and hypertext links, since all are lists of references.

Lists of References The model has also navigational features to traverse those lists. The model has also navigational features to traverse those lists. This model is very powerful, and because of this, has efficiency problems. To make the model suitable for our comparisons, we consider only the portion related to querying structures. Even this portion is quite powerful, and allows to efficiently solve queries by first locating the text matches and then filtering the candidates with the structural restrictions. This model is very powerful, and because of this, has efficiency problems. To make the model suitable for our comparisons, we consider only the portion related to querying structures. Even this portion is quite powerful, and allows to efficiently solve queries by first locating the text matches and then filtering the candidates with the structural restrictions.

Proximal Nodes Due to Navarro and Baeza-Yates, 1997 Due to Navarro and Baeza-Yates, 1997 Idea: define a strict hierarchical index over the text. This enrichs the previous model that used flat lists. Idea: define a strict hierarchical index over the text. This enrichs the previous model that used flat lists. Multiple index hierarchies might be defined Multiple index hierarchies might be defined Two distinct index hierarchies might refer to text regions that overlap Two distinct index hierarchies might refer to text regions that overlap

Definitions Each indexing structure is a strict hierarchy composed of Each indexing structure is a strict hierarchy composed of –chapters –sections –subsections –paragraphs –lines Each of these components is called a node Each of these components is called a node To each node is associated a text region To each node is associated a text region

Proximal Nodes Sections SubSections SubSubSections Chapter holocaust10 25648,324

Proximal Nodes Key points: Key points: –In the hierarchical index, one node might be contained within another node –But, two nodes of a same hierarchy cannot overlap –The inverted list for keywords complements the hierarchical index –The implementation here is more complex than that for non-overlapping lists

Proximal Nodes Queries are now regular expressions: Queries are now regular expressions: –search for strings –references to structural components –combination of these Model is a compromise between expressiveness and efficiency Model is a compromise between expressiveness and efficiency Queries are simple but can be processed efficiently Queries are simple but can be processed efficiently Further, model is more expressive than non-overlapping lists Further, model is more expressive than non-overlapping lists

Proximal Nodes Query: find the sections, the subsections, and the subsubsections that contain the word “holocaust” Query: find the sections, the subsections, and the subsubsections that contain the word “holocaust” –[(*section) with (“holocaust”)] Simple query processing: Simple query processing: –traverse the inverted list for “holocaust” and determine all match points –use the match points to search in the hierarchical index for the structural components

Proximal Nodes Query: [(*section) with (“holocaust”)] Query: [(*section) with (“holocaust”)] Sophisticated query processing: Sophisticated query processing: –get the first entry in the inverted list for “holocaust” –use this match point to search in the hierarchical index for the structural components –Innermost matching component: smaller one –Check if innermost matching component includes the second entry in the inverted list for “holocaust” –If it does, check the third entry and so on –This allows matching efficiently the nearby (or proximal) nodes

Proximal Nodes Model allows formulating queries that are more sophisticated than those allowed by non-overlapping lists Model allows formulating queries that are more sophisticated than those allowed by non-overlapping lists To speed up query processing, nearby nodes are inspected To speed up query processing, nearby nodes are inspected Types of queries that can be asked are somewhat limited (all nodes in the answer must come from a same index hierarchy!) Types of queries that can be asked are somewhat limited (all nodes in the answer must come from a same index hierarchy!) Model is a compromise between efficiency and expressiveness Model is a compromise between efficiency and expressiveness

Tree Matching A model relying on a single primitive, tree inclusion, is proposed. A model relying on a single primitive, tree inclusion, is proposed. The idea of tree inclusion is, seeing both the structure of the database and the query (a pattern on structure) as trees, to find an embedding of the pattern into the database which respects the hierarchical relationships between nodes of the pattern. The idea of tree inclusion is, seeing both the structure of the database and the query (a pattern on structure) as trees, to find an embedding of the pattern into the database which respects the hierarchical relationships between nodes of the pattern.

Tree Matching forces the embedding to respect the left- to-right relations among siblings in the pattern, while unordered inclusion does not. forces the embedding to respect the left- to-right relations among siblings in the pattern, while unordered inclusion does not. Tree inclusion is a way to query on structural properties in which the user does not need to be aware of all the details of the structure, but only on what he/she is interested. This stands for “data independence". Tree inclusion is a way to query on structural properties in which the user does not need to be aware of all the details of the structure, but only on what he/she is interested. This stands for “data independence".

Parsed Strings The language used to express a database schema is a context free grammar, that is, the database is structured by giving a grammar to parse its text. The fundamental data structure is the p-string, or parsed string, which is composed of a derivation tree plus the underlying text. The language used to express a database schema is a context free grammar, that is, the database is structured by giving a grammar to parse its text. The fundamental data structure is the p-string, or parsed string, which is composed of a derivation tree plus the underlying text.

Parsed Strings The parsing process implicitly comprises the work of pattern-matching, there are no further operations to express it. The parsing process implicitly comprises the work of pattern-matching, there are no further operations to express it. There are a number of powerful operations that can be performed to manipulate parsed strings: they can be reparsed by another grammar, some non terminals can be hidden, etc. There are a number of powerful operations that can be performed to manipulate parsed strings: they can be reparsed by another grammar, some non terminals can be hidden, etc. The problem is efficiency. Being such a dynamic approach, it is hard to implement efficiently. The problem is efficiency. Being such a dynamic approach, it is hard to implement efficiently.

expressiveness analysis

A Taxonomy of Models An analysis in three parts: An analysis in three parts: –structuring power, –query language –efficiency.

Structuring Power

Query Language

Query Time Complexity From the description of the implementation of the different models, we classify them according to querying times. We measure the efficiency of a query as a function of n, the total size of intermediate results, except otherwise specified. From the description of the implementation of the different models, we classify them according to querying times. We measure the efficiency of a query as a function of n, the total size of intermediate results, except otherwise specified.

Query Time Complexity

Conclusion No model is the best for all applications, especially because the more expressive the model, the less efficient can it be. No model is the best for all applications, especially because the more expressive the model, the less efficient can it be. Each application has its own set of requirements, and should select the most efficient model supporting them. Each application has its own set of requirements, and should select the most efficient model supporting them. Another important issue is the perspective of the user. When we incorporate operators and evaluate the cost of implementing them, we are implicitly assuming that they are useful for the user of the system. Another important issue is the perspective of the user. When we incorporate operators and evaluate the cost of implementing them, we are implicitly assuming that they are useful for the user of the system.

Additional Reading Integrating Contents and Structure in Text Retrieval paper Integrating Contents and Structure in Text Retrieval paper

Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)

Similar presentations

Presentation on theme: "Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)

Similar presentations

Presentation on theme: "Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)"— Presentation transcript:

Similar presentations

About project

Feedback