Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hsin-Hsi Chen1 Chapter 2 Modeling Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.

Similar presentations


Presentation on theme: "Hsin-Hsi Chen1 Chapter 2 Modeling Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University."— Presentation transcript:

1

2 Hsin-Hsi Chen1 Chapter 2 Modeling Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

3 Hsin-Hsi Chen2 Indexing

4 Hsin-Hsi Chen3 Indexing indexing: assign identifiers to text items. assign: manual vs. automatic indexing identifiers: –objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, … –controlled vs. uncontrolled vocabularies instruction manuals, terminological schedules, … –single-term vs. term phrase

5 Hsin-Hsi Chen4 Two Issues Issue 1: indexing exhaustivity –exhaustive: assign a large number of terms –nonexhaustive Issue 2: term specificity –broad terms (generic) cannot distinguish relevant from nonrelevant items –narrow terms (specific) retrieve relatively fewer items, but most of them are relevant

6 Hsin-Hsi Chen5 Parameters of retrieval effectiveness Recall Precision Goal high recall and high precision

7 Hsin-Hsi Chen6 Nonrelevant Items Relevant Items Retrieved Part a b c d

8 Hsin-Hsi Chen7 A Joint Measure F-score –  is a parameter that encode the importance of recall and procedure. –  =1: equal weight –  >1: precision is more important –  <1: recall is more important

9 Hsin-Hsi Chen8 Choices of Recall and Precision Both recall and precision vary from 0 to 1. In principle, the average user wants to achieve both high recall and high precision. In practice, a compromise must be reached because simultaneously optimizing recall and precision is not normally achievable.

10 Hsin-Hsi Chen9 Choices of Recall and Precision (Continued) Particular choices of indexing and search policies have produced variations in performance ranging from 0.8 precision and 0.2 recall to 0.1 precision and 0.8 recall. In many circumstance, both the recall and the precision varying between 0.5 and 0.6 are more satisfactory for the average users.

11 Hsin-Hsi Chen10 Term-Frequency Consideration Function words –for example, "and", "or", "of", "but", … –the frequencies of these words are high in all texts Content words –words that actually relate to document content –varying frequencies in the different texts of a collect –indicate term importance for content

12 Hsin-Hsi Chen11 A Frequency-Based Indexing Method Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words. Compute the term frequency tf ij for all remaining terms T j in each document D i, specifying the number of occurrences of T j in D i. Choose a threshold frequency T, and assign to each document D i all term T j for which tf ij > T.

13 Hsin-Hsi Chen12 Discussions high-frequency terms favor recall high precision the ability to distinguish individual documents from each other high-frequency terms good for precision when its term frequency is not equally high in all documents.

14 Hsin-Hsi Chen13 Inverse Document Frequency Inverse Document Frequency (IDF) for term Tj where dfj (document frequency of term Tj) is number of documents in which Tj occurs. –fulfil both the recall and the precision –occur frequently in individual documents but rarely in the remainder of the collection

15 Hsin-Hsi Chen14 New Term Importance Indicator weight w ij of a term T j in a document t i Eliminating common function words Computing the value of w ij for each term T j in each document D i Assigning to the documents of a collection all terms with sufficiently high (tf x idf) factors

16 Hsin-Hsi Chen15 Term-discrimination Value Useful index terms distinguish the documents of a collection from each other Document Space –two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together –when a high-frequency term without discrimination is assigned, it will increase the document space density

17 Hsin-Hsi Chen16 Original State After Assignment of good discriminator After Assignment of poor discriminator A Virtual Document Space

18 Hsin-Hsi Chen17 Good Term Assignment When a term is assigned to the documents of a collection, the few items to which the term is assigned will be distinguished from the rest of the collection. This should increase the average distance between the items in the collection and hence produce a document space less dense than before.

19 Hsin-Hsi Chen18 Poor Term Assignment A high frequency term is assigned that does not discriminate between the items of a collection. Its assignment will render the document more similar. This is reflected in an increase in document space density.

20 Hsin-Hsi Chen19 Term Discrimination Value definition dv j = Q - Q j whereQ and Qj are space densities before and after the assignments of term Tj. dv j >0, T j is a good term; dv j <0, T j is a poor term.

21 Hsin-Hsi Chen20 Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j <0 N Thesaurus transformation Phrase transformation Variations of Term-Discrimination Value with Document Frequency

22 Hsin-Hsi Chen21 Another Term Weighting w ij = tf ij x dv j compared with – : decrease steadily with increasing document frequency –dv j : increase from zero to positive as the document frequency of the term increase, decrease shapely as the document frequency becomes still larger.

23 Hsin-Hsi Chen22 Term Relationships in Indexing Single-term indexing –Single terms are often ambiguous. –Many single terms are either too specific or too broad to be useful. Complex text identifiers –subject experts and trained indexers –linguistic analysis algorithms, e.g., NP chunker –term-grouping or term clustering methods

24 Hsin-Hsi Chen23 Term Classification (Clustering)

25 Hsin-Hsi Chen24 Term Classification (Clustering) Column part Group terms whose corresponding column representation reveal similar assignments to the documents of the collection. Row part Group documents that exhibit sufficiently similar term assignment.

26 Hsin-Hsi Chen25 Linguistic Methodologies Indexing phrases: nominal constructions including adjectives and nouns –Assign syntactic class indicators (i.e., part of speech) to the words occurring in document texts. –Construct word phrases from sequences of words exhibiting certain allowed syntactic markers (noun- noun and adjective-noun sequences).

27 Hsin-Hsi Chen26 Term-Phrase Formation Term Phrase a sequence of related text words carry a more specific meaning than the single terms e.g., “computer science” vs. computer; Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j <0 N Thesaurus transformation Phrase transformation

28 Hsin-Hsi Chen27 Simple Phrase-Formation Process the principal phrase component (phrase head) a term with a document frequency exceeding a stated threshold, or exhibiting a negative discriminator value the other components of the phrase medium- or low- frequency terms with stated co- occurrence relationships with the phrase head common function words not used in the phrase-formation process

29 Hsin-Hsi Chen28 An Example Effective retrieval systems are essential for people in need of information. –“are”, “for”, “in” and “of”: common function words –“system”, “people”, and “information”: phrase heads

30 Hsin-Hsi Chen29 The Formatted Term-Phrases effective retrieval systems essential people need information *: phrases assumed to be useful for content identification 2/55/12

31 Hsin-Hsi Chen30 The Problems A phrase-formation process controlled only by word co-occurrences and the document frequencies of certain words in not likely to generate a large number of high-quality phrases. Additional syntactic criteria for phrase heads and phrase components may provide further control in phrase formation.

32 Hsin-Hsi Chen31 Additional Term-Phrase Formation Steps Syntactic class indicator are assigned to the terms, and phrase formation is limited to sequences of specified syntactic markers, such as adjective- noun and noun-noun sequences. Adverb-adjective  adverb-noun  The phrase elements are all chosen from within the same syntactic unit, such as subject phrase, object phrase, and verb phrase.

33 Hsin-Hsi Chen32 Consider Syntactic Unit effective retrieval systems are essential for people in the need of information subject phrase –effective retrieval systems verb phrase –are essential object phrase –people in need of information

34 Hsin-Hsi Chen33 Phrases within Syntactic Components Adjacent phrase heads and components within syntactic components –retrieval systems* –people need –need information* Phrase heads and components co-occur within syntactic components –effective systems [ subj effective retrieval systems] [ vp are essential ] for [ obj people need information] 2/3

35 Hsin-Hsi Chen34 Problems More stringent phrase formation criteria produce fewer phrases, both good and bad, than less stringent methodologies. Prepositional phrase attachment, e.g., The man saw the girl with the telescope. Anaphora resolution He dropped the plate on his foot and broke it.

36 Hsin-Hsi Chen35 Problems (Continued) Any phrase matching system must be able to deal with the problems of –synonym recognition –differing word orders –intervening extraneous word Example –retrieval of information vs. information retrieval

37 Hsin-Hsi Chen36 Equivalent Phrase Formulation Base form: text analysis system Variants: –system analyzes the text –text is analyzed by the system –system carries out text analysis –text is subjected to system analysis Related term substitution –text: documents, information items –analysis: processing, transformation, manipulation –system: program, process

38 Hsin-Hsi Chen37 Thesaurus-Group Generation Thesaurus transformation –broadens index terms whose scope is too narrow to be useful in retrieval –a thesaurus must assemble groups of related specific terms under more general, higher-level class indicators Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j <0 N Thesaurus transformation Phrase transformation

39 Hsin-Hsi Chen38 Sample Classes of Roget’s Thesaurus

40 Hsin-Hsi Chen39 The Indexing Prescription (1) Identify the individual words in the document collection. Use a stop list to delete from the texts the function words. Use an suffix-stripping routine to reduce each remaining word to word-stem form. For each remaining word stem T j in document D i, compute w ij. Represent each document D i by D i =(T 1, w i1 ; T 2, w i2 ; …, T t, w it )

41 Hsin-Hsi Chen40 Word Stemming effectiveness --> effective --> effect picnicking --> picnic king -\-> k

42 Hsin-Hsi Chen41 Some Morphological Rules Restore a silent e after suffix removal from certain words to produce “hope” from “hoping” rather than “hop” Delete certain doubled consonants after suffix removal, so as to generate “hop” from “hopping” rather than “hopp”. Use a final y for an I in forms such as “easier”, so as to generate “easy” instead of “easi”.

43 Hsin-Hsi Chen42 The Indexing Prescription (2) Identify individual text words. Use stop list to delete common function words. Use automatic suffix stripping to produce word stems. Compute term-discrimination value for all word stems. Use thesaurus class replacement for all low-frequency terms with discrimination values near zero. Use phrase-formation process for all high-frequency terms with negative discrimination values. Compute weighting factors for complex indexing units. Assign to each document single term weights, term phrases, and thesaurus classes with weights.

44 Hsin-Hsi Chen43 Query vs. Document Differences –Query texts are short. –Fewer terms are assigned to queries. –The occurrence of query terms rarely exceeds 1. Q=(w q1, w q2, …, w qt ) where w qj : inverse document frequency D i =(d i1, d i2, …, d it ) where d ij : term frequency*inverse document frequency

45 Hsin-Hsi Chen44 Query vs. Document When non-normalized documents are used, the longer documents with more assigned terms have a greater chance of matching particular query terms than do the shorter document vectors. or

46 Hsin-Hsi Chen45 Relevance Feedback Terms present in previously retrieved documents that have been identified as relevant to the user’s query are added to the original formulations. The weights of the original query terms are altered by replacing the inverse document frequency portion of the weights with term-relevance weights obtained by using the occurrence characteristics of the terms in the previous retrieved relevant and nonrelevant documents of the collection.

47 Hsin-Hsi Chen46 Relevance Feedback Q = (w q1, w q2,..., w qt ) D i = (d i1, d i2,..., d it ) New query may be the following form Q’ =  {w q1, w q2,..., w qt }+  {w’ qt+1, w’ qt+2,..., w’ qt+m } The weights of the newly added terms T t+1 to T t+m may consist of a combined term- frequency and term-relevance weight.

48 Hsin-Hsi Chen47 Final Indexing Identify individual text words. Use a stop list to delete common words. Use suffix stripping to produce word stems. Replace low-frequency terms with thesaurus classes. Replace high-frequency terms with phrases. Compute term weights for all single terms, phrases, and thesaurus classes. Compare query statements with document vectors. Identify some retrieved documents as relevant and some as nonrelevant to the query.

49 Hsin-Hsi Chen48 Final Indexing Compute term-relevance factors based on available relevance assessments. Construct new queries with added terms from relevant documents and term weights based on combined frequency and term-relevance weight. Return to step (7). Compare query statements with document vectors ……..

50 Hsin-Hsi Chen49 Summary of expected effectiveness of automatic indexing Basic single-term automatic indexing- Use of thesaurus to group related terms in the given topic area+10% to +20% Use of automatically derived term associations obtained from joint term assignments found in sample document collections0% to -10% Use of automatically derived term phrases obtained by using co-occurring terms found in the texts of sample collections+5% to +10% Use of one iteration of relevant feedback to add new query terms extracted from previously retrieved relevant documents+30% to +60%

51 Hsin-Hsi Chen50 Models

52 Hsin-Hsi Chen51 Ranking central problem of IR –Predict which documents are relevant and which are not Ranking –Establish an ordering of the documents retrieved IR models –Different model provides distinct sets of premises to deal with document relevance

53 Hsin-Hsi Chen52 Information Retrieval Models Classic Models –Boolean model set theoretic documents and queries are represented as sets of index terms compare Boolean query statements with the term sets used to identify document content. –Vector model algebraic model documents and queries are represented as vectors in a t- dimensional space compute global similarities between queries and documents. –Probabilistic model probabilistic documents and queries are represented on the basis of probabilistic theory compute the relevance probabilities for the documents of a collection.

54 Hsin-Hsi Chen53 Information Retrieval Models (Continued) Structured Models –reference to the structure present in written text –non-overlapping list model –proximal nodes model Browsing –flat –structured guided –hypertext

55 Hsin-Hsi Chen54 Taxonomy of Information Retrieval Models USERTASKUSERTASK Retrieval: Adhoc Filtering Browsing Classic Models boolean vector probabilistic Structured Models boolean vector probabilistic Browsing Flat Structured Guided Hypertext Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Network Probabilistic Inference Network Brief Network

56 Hsin-Hsi Chen55 Issues of a retrieval system Models –boolean –vector –probabilistic Logical views of documents –full text –set of index terms User task –retrieval –browsing

57 Hsin-Hsi Chen56 Combinations of these issues Index TermsFull Text Full Text+ Structure Retrieval Classic Set Theoretic Algebraic Probabilistic Structured Browsing Flat Hypertext Flat Classic Set Theoretic Algebraic Probabilistic Structure Guided Hypertext USERTASKUSERTASK LOGICAL VIEW OF DOCUMENTS

58 Hsin-Hsi Chen57 Retrieval: Ad hoc and Filtering Ad hoc retrieval –Documents remain relatively static while new queries are submitted Filtering –Queries remain relatively static while new documents come into the system e.g., news wiring services in the stock market –User profile describes the user’s preferences Filtering task indicates to the user which document might be interested to him Which ones are really relevant is fully reserved to the user –Routing: a variation of filtering Ranking filtered documents and show this ranking to users

59 Hsin-Hsi Chen58 User profile Simplistic approach –The profile is described through a set of keywords –The user provides the necessary keywords Elaborate approach –Collect information from the user –initial profile + relevance feedback (relevant information and nonrelevant information)

60 Hsin-Hsi Chen59 Formal Definition of IR Models /D, Q, F, R(q i, d j )/ –D: a set composed of logical views (or representations) for the documents in collection –Q: a set composed of logical views (or representations) for the user information needs –F: a framework for modeling documents representations, queries, and their relationships –R(q i, d j ): a ranking function which associations a real number with q i  Q and d j  D query

61 Hsin-Hsi Chen60 Formal Definition of IR Models (continued) classic Boolean model –set of documents –standard operations on sets classic vector model –t-dimensional vector space –standard linear algebra operations on vector classic probabilistic model –sets –standard probabilistic operations, and Bayes’ theorem

62 Hsin-Hsi Chen61 Basic Concepts of Classic IR index terms (usually nouns): index and summarize weight of index terms Definition –K={k 1, …, k t }: a set of all index terms –w i,j : a weight of an index term k i of a document d j –d j =(w 1,j, w 2,j, …, w t,j ): an index term vector for the document d j –g i (d j )= w i,j assumption –index term weights are mutually independent w i,j associated with (k i,d j ) tells us nothing about w i+1,j associated with (k i+1,d j ) The terms computer and network in the area of computer networks

63 Hsin-Hsi Chen62 Boolean Model The index term weight variables are all binary, i.e., w i,j  {0,1} A query q is a Boolean expression (and, or, not) q dnf : the disjunctive normal form for q q cc : conjunctive components of q dnf sim(d j,q): similarity of d j to q –1: if  q cc | (q cc  q dnf  (  k i, g i (d j )=g i (q cc )) –0: otherwise dj is relevant to q

64 Hsin-Hsi Chen63 Boolean Model (Continued) Example –q=k a  (k b   k c ) –q dnf =(1,1,1)  (1,1,0)  (1,0,0) (k a  k b )  (k a   k c ) = (k a  k b  k c )  (k a  k b   k c )  (k a  k b   k c )  (k a   k b   k c ) = (k a  k b  k c )  (k a  k b   k c )  (k a   k b   k c ) kaka kbkb kckc (1,0,0) (1,1,0) (1,1,1)

65 Hsin-Hsi Chen64 Boolean Model (Continued) advantage: simple disadvantage –binary decision (relevant or non-relevant) without grading scale –exact match (no partial match) e.g., d j =(0,1,0) is non-relevant to q=(k a  (k b   k c ) –retrieve too few or too many documents

66 Hsin-Hsi Chen65 Basic Vector Space Model Term vector representation of documents D i =(a i1, a i2, …, a it ) queries Q j =(q j1, q j2, …, q jt ) t distinct terms are used to characterize content. Each term is identified with a term vector T. t vectors are linearly independent. Any vector is represented as a linear combination of the t term vectors. The rth document D r can be represented as a document vector, written as

67 Hsin-Hsi Chen66 Document representation in vector space a document vector in a two-dimensional vector space

68 Hsin-Hsi Chen67 Similarity Measure measure by product of two vectors x y = |x| |y| cos  document-query similarity how to determine the vector components and term correlations? term vector:document vector:

69 Hsin-Hsi Chen68 Similarity Measure (Continued) vector components

70 Hsin-Hsi Chen69 Similarity Measure (Continued) term correlations T i T j are not available assumption: term vectors are orthogonal T i T j =0 (i  j) T i T j =1 (i=j) Assume that terms are uncorrelated. Similarity measurement between documents

71 Hsin-Hsi Chen70 Sample query-document similarity computation D 1 =2T 1 +3T 2 +5T 3 D 2 =3T 1 +7T 2 +1T 3 Q=0T 1 +0T 2 +2T 3 similarity computations for uncorrelated terms sim(D 1,Q)=20+3 0+5 2=10 sim(D 2,Q)=30+7 0+1 2=2 D 1 is preferred

72 Hsin-Hsi Chen71 Sample query-document similarity computation (Continued) T 1 T 2 T 3 T 1 10.50 T 2 0.51-0.2 T 3 0-0.21 similarity computations for correlated terms sim(D 1,Q)=(2T 1 +3T 2 +5T 3 ) (0T 1 +0T 2 +2T 3 ) =4T 1T 3 +6T 2 T 3 +10T 3 T 3 =-6*0.2+10*1=8.8 sim(D 2,Q)=(3T 1 +7T 2 +1T 3 ) (0T 1 +0T 2 +2T 3 ) =6T 1T 3 +14T 2 T 3 +2T 3 T 3 =-14*0.2+2*1=-0.8 D 1 is preferred

73 Hsin-Hsi Chen72 Vector Model w i,j : a positive, non-binary weight for (k i,d j ) w i,q : a positive, non-binary weight for (k i,q) q=(w 1,q, w 2,q, …, w t,q ): a query vector, where t is the total number of index terms in the system d j = (w 1,j, w 2,j, …, w t,j ): a document vector

74 Hsin-Hsi Chen73 Similarity of document d j w.r.t. query q The correlation between vectors d j and q | q | does not affect the ranking | d j | provides a normalization Q djdj  cos(d j,q)

75 Hsin-Hsi Chen74 document ranking Similarity (i.e., sim(q, d j )) varies from 0 to 1. Retrieve the documents with a degree of similarity above a predefined threshold (allow partial matching)

76 Hsin-Hsi Chen75 term weighting techniques IR problem: one of clustering –user query: a specification of a set A of objects –clustering problem: determine which documents are in the set A (relevant), which ones are not (non-relevant) –intra-cluster similarity the features better describe the objects in the set A tf factor in vector model the raw frequency of a term k i inside a document d j –inter-cluster similarity the features better distinguish the the objects in the set A from the remaining objects in the collection C idf factor (inverse document frequency) in vector model the inverse of the frequency of a term k i among the documents in the collection

77 Hsin-Hsi Chen76 Definition of tf N: total number of documents in the system n i : the number of documents in which the index term k i appears freq i,j : the raw frequency of term k i in the document d j f i,j : the normalized frequency of term k i in document d j Term t l has maximum frequency in the document d j (0~1)

78 Hsin-Hsi Chen77 Definition of idf and tf-idf scheme idf i : inverse document frequency for k i w i,j : term-weighting by tf-idf scheme query term weight (Salton and Buckley) freq i,q : the raw frequency of the term k i in q

79 Hsin-Hsi Chen78 Analysis of vector model advantages –its term-weighting scheme improves retrieval performance –its partial matching strategy allows retrieval of documents that approximate the query conditions –its cosine ranking formula sorts the documents according to their degree of similarity to the query disadvantages –indexed terms are assumed to be mutually independently

80 Hsin-Hsi Chen79 Probabilistic Model Given a query, there is an ideal answer set –a set of documents which contains exactly the relevant documents and no other query process –a process of specifying the properties of an ideal answer set problem: what are the properties?

81 Hsin-Hsi Chen80 Probabilistic Model (Continued) Generate a preliminary probabilistic description of the ideal answer set Initiate an interaction with the user –User looks at the retrieved documents and decide which ones are relevant and which ones are not –System uses this information to refine the description of the ideal answer set –Repeat the process many times.

82 Hsin-Hsi Chen81 Probabilistic Principle Given a user query q and a document d j in the collection, the probabilistic model estimates the probability that user will find d j relevant assumptions –The probability of relevance depends on query and document representations only –There is a subset of all documents which the user prefers as the answer set for the query q Given a query, the probabilistic model assigns to each document dj a measure of its similarity to the query

83 Hsin-Hsi Chen82 Probabilistic Principle w i,j  {0,1}, w i,q  {0,1}: the index term weight variables are all binary non-relevant q: a query which is a subset of index terms R: the set of documents known to be relevant R ( complement of R ): the set of documents P(R|d j ): the probability that the document d j is relevant to the query q P(R|dj): the probability that d j is non-relevant to q

84 Hsin-Hsi Chen83 similarity sim(d j,q): the similarity of the document d j to the query q (by definition) (Bayes’ rule) (P(R) and P(R) are the same for all documents) : the probability of randomly selecting the document d j from the set of R of relevant documents P(R): the probability that a document randomly selected from the entire collection is relevant

85 Hsin-Hsi Chen84 P(k i |R): the probability that the index term k i is present in a document randomly selected from the set R. P(k i |R): the probability that the index term k i is not present in a document randomly selected from the set R. independence assumption of index terms

86 Hsin-Hsi Chen85 Problem: where is the set R?

87 Hsin-Hsi Chen86 Initial guess P(k i |R) is constant for all index terms k i. The distribution of index terms among the non-relevant documents can be approximated by the distribution of index terms among all the documents in the collection. ( 假設 N>>|R|,N  |R|)

88 Hsin-Hsi Chen87 Initial ranking V: a subset of the documents initially retrieved and ranked by the probabilistic model (top r documents) V i : subset of V composed of documents which contain the index term k i Approximate P(k i |R) by the distribution of the index term k i among the documents retrieved so far. Approximate P(k i |R) by considering that all the non-retrieved documents are not relevant.

89 Hsin-Hsi Chen88 Small values of V and V i alternative 1 alternative 2 a problem when V=1 and V i =0

90 Hsin-Hsi Chen89 Analysis of Probabilistic Model advantage –documents are ranked in decreasing order of their probability of being relevant disadvantages –the need to guess the initial separation of documents into relevant and non-relevant sets –do not consider the frequency with which an index terms occurs inside a document –the independence assumption for index terms

91 Hsin-Hsi Chen90 Comparison of classic models Boolean model: the weakest classic model Vector model is expected to outperform the probabilistic model with general collections (Salton and Buckley)

92 Hsin-Hsi Chen91 Alternative Set Theoretic Models -Fuzzy Set Model Model –a query term: a fuzzy set –a document: degree of membership in this set –membership function Associate membership function with the elements of the class 0: no membership in the set 1: full membership 0~1: marginal elements of the set documents

93 Hsin-Hsi Chen92 Fuzzy Set Theory A fuzzy subset A of a universe of discourse U is characterized by a membership function µ A : U  [0,1] which associates with each element u of U a number µ A (u) in the interval [0,1] –complement: –union: –intersection: a class a document

94 Hsin-Hsi Chen93 Examples Assume U={d 1, d 2, d 3, d 4, d 5, d 6 } Let A and B be {d 1, d 2, d 3 } and {d 2, d 3, d 4 }, respectively. Assume  A ={d 1 :0.8, d 2 :0.7, d 3 :0.6, d 4 :0, d 5 :0, d 6 :0} and  B ={d 1 :0, d 2 :0.6, d 3 :0.8, d 4 :0.9, d 5 :0, d 6 :0} ={d 1 :0.2, d 2 :0.3, d 3 :0.4, d 4 :1, d 5 :1, d 6 :1} ={d 1 :0.8, d 2 :0.7, d 3 :0.8, d 4 :9, d 5 :0, d 6 :0} ={d 1 :0.2, d 2 :0.6, d 3 :0.6, d 4 :0, d 5 :0, d 6 :0}

95 Hsin-Hsi Chen94 Fuzzy Information Retrieval basic idea –Expand the set of index terms in the query with related terms (from the thesaurus) such that additional relevant documents can be retrieved –A thesaurus can be constructed by defining a term-term correlation matrix c whose rows and columns are associated to the index terms in the document collection keyword connection matrix

96 Hsin-Hsi Chen95 Fuzzy Information Retrieval (Continued) normalized correlation factor c i,l between two terms k i and k l (0~1) In the fuzzy set associated to each index term k i, a document d j has a degree of membership µ i,j where n i is # of documents containing term k i n l is # of documents containing term k l n i,l is # of documents containing k i and k l

97 Hsin-Hsi Chen96 Fuzzy Information Retrieval (Continued) physical meaning –A document d j belongs to the fuzzy set associated to the term k i if its own terms are related to k i, i.e.,  i,j =1. –If there is at least one index term k l of d j which is strongly related to the index k i, then  i,j  1. k i is a good fuzzy index –When all index terms of d j are only loosely related to k i,  i,j  0. k i is not a good fuzzy index

98 Hsin-Hsi Chen97 Example q=(k a  (k b   k c ) =(k a  k b  k c )  (k a  k b   k c )  (k a   k b   k c ) =cc 1 +cc 2 +cc 3 DaDa DbDb DcDc cc 3 cc 2 cc 1 D a : the fuzzy set of documents associated to the index k a d j  D a has a degree of membership  a,j > a predefined threshold K D a : the fuzzy set of documents associated to the index k a (the negation of index term k a )

99 Hsin-Hsi Chen98 Example Query q=k a  (k b   k c ) disjunctive normal form q dnf =(1,1,1)  (1,1,0)  (1,0,0) (1) the degree of membership in a disjunctive fuzzy set is computed using an algebraic sum (instead of max function) more smoothly (2) the degree of membership in a conjunctive fuzzy set is computed using an algebraic product (instead of min function) Recall

100 Hsin-Hsi Chen99 Alternative Algebraic Model: Generalized Vector Space Model independence of index terms –k i : a vector associated with the index term k i –the set of vectors {k 1, k 2, …, k t } is linearly independent orthogonal: –The index term vectors are assumed linearly independent but are not pairwise orthogonal in generalized vector space model –The index term vectors, which are not seen as the basis of the space, are composed of smaller components derived from the particular collection. for i  j

101 Hsin-Hsi Chen100 Generalized Vector Space Model {k 1, k 2, …, k t }: index terms in a collection w i,j : binary weights associated with the term-document pair {k i, d j } The patterns of term co-occurrence (inside documents) can be represented by a set of 2 t minterms g i (m j ): return the weight {0,1} of the index term k i in the minterm m j (1  i  t) m 1 =(0, 0, …, 0): point to documents containing none of index terms m 2 =(1, 0, …, 0): point to documents containing the index term k 1 only m 3 =(0,1,…,0): point to documents containing the index term k 2 only m 4 =(1,1,…,0): point to documents containing the index terms k 1 and k 2 … m 2 t =(1, 1, …, 1): point to documents containing all the index terms

102 Hsin-Hsi Chen101 Generalized Vector Space Model (Continued) m i (2 t -tuple vector) is associated with minterm m i (t-tuple vector) e.g., m 4 is associated with m 4 containing k 1 and k 2, and no others co-occurrence of index terms inside documents: dependencies among index terms (the set of m i are pairwise orthogonal)

103 Hsin-Hsi Chen102 minterm m r m r vector m 1 =(0,0,0)m 1 =(1,0,0,0,0,0,0,0) m 2 =(0,0,1)m 2 =(0,1,0,0,0,0,0,0) m 3 =(0,1,0)m 3 =(0,0,1,0,0,0,0,0) m 4 =(0,1,1)m 4 =(0,0,0,1,0,0,0,0) m 5 =(1,0,0)m 5 =(0,0,0,0,1,0,0,0) m 6 =(1,0,1)m 6 =(0,0,0,0,0,1,0,0) m 7 =(1,1,0)m 7 =(0,0,0,0,0,0,1,0) m 8 =(1,1,1)m 8 =(0,0,0,0,0,0,0,1) t=3 d1 (t1)d11 (t1 t2) d2 (t3)d12 (t1 t3) d3 (t3)d13 (t1 t2) d4 (t1)d14 (t1 t2) d5 (t2)d15 (t1 t2 t3) d6 (t2)d16 (t1 t2) d7 (t2 t3)d17 (t1 t2) d8 (t2 t3)d18 (t1 t2) d9 (t2)d19 (t1 t2 t3) d10 (t2 t3)d20 (t1 t2)

104 Hsin-Hsi Chen103 minterm m r m r vector m 1 =(0,0,0)m 1 =(1,0,0,0,0,0,0,0) m 2 =(0,0,1)m 2 =(0,1,0,0,0,0,0,0) m 3 =(0,1,0)m 3 =(0,0,1,0,0,0,0,0) m 4 =(0,1,1)m 4 =(0,0,0,1,0,0,0,0) m 5 =(1,0,0)m 5 =(0,0,0,0,1,0,0,0) m 6 =(1,0,1)m 6 =(0,0,0,0,0,1,0,0) m 7 =(1,1,0)m 7 =(0,0,0,0,0,0,1,0) m 8 =(1,1,1)m 8 =(0,0,0,0,0,0,0,1) t=3 d1 (t1)d11 (t1 t2) d2 (t3)d12 (t1 t3) d3 (t3)d13 (t1 t2) d4 (t1)d14 (t1 t2) d5 (t2)d15 (t1 t2 t3) d6 (t2)d16 (t1 t2) d7 (t2 t3)d17 (t1 t2) d8 (t2 t3)d18 (t1 t2) d9 (t2)d19 (t1 t2 t3) d10 (t2 t3)d20 (t1 t2)

105 Hsin-Hsi Chen104 minterm m r m r vector m 1 =(0,0,0)m 1 =(1,0,0,0,0,0,0,0) m 2 =(0,0,1)m 2 =(0,1,0,0,0,0,0,0) m 3 =(0,1,0)m 3 =(0,0,1,0,0,0,0,0) m 4 =(0,1,1)m 4 =(0,0,0,1,0,0,0,0) m 5 =(1,0,0)m 5 =(0,0,0,0,1,0,0,0) m 6 =(1,0,1)m 6 =(0,0,0,0,0,1,0,0) m 7 =(1,1,0)m 7 =(0,0,0,0,0,0,1,0) m 8 =(1,1,1)m 8 =(0,0,0,0,0,0,0,1) t=3 d1 (t1)d11 (t1 t2) d2 (t3)d12 (t1 t3) d3 (t3)d13 (t1 t2) d4 (t1)d14 (t1 t2) d5 (t2)d15 (t1 t2 t3) d6 (t2)d16 (t1 t2) d7 (t2 t3)d17 (t1 t2) d8 (t2 t3)d18 (t1 t2) d9 (t2)d19 (t1 t2 t3) d10 (t2 t3)d20 (t1 t2)

106 Hsin-Hsi Chen105 Generalized Vector Space Model (Continued) Determine the index vector k i associated with the index term k i Collect all the vectors m r in which the index term k i is in state 1. Sum up w i,j associated with the index term k i and document d j whose term occurrence pattern coincides with minterm m r

107 Hsin-Hsi Chen106 Generalized Vector Space Model (Continued) k i  k j quantifies a degree of correlation between k i and k j standard cosine similarity is adopted

108 Hsin-Hsi Chen107

109 Hsin-Hsi Chen108 Comparison with Standard Vector Space Model d1 (t1): (w 1,1,0,0)d11 (t1 t2) d2 (t3): (0,0,w 3,2 )d12 (t1 t3) d3 (t3): (0,0,w 3,3 )d13 (t1 t2) d4 (t1): (w 1,4,0,0)d14 (t1 t2) d5 (t2): (0,w 2,5,0)d15 (t1 t2 t3) d6 (t2): (0,w 2,6,0)d16 (t1 t2) d7 (t2 t3): (0,w 2,7,w 3,7 )d17 (t1 t2) d8 (t2 t3): (0,w 2,8,w 3,8 )d18 (t1 t2) d9 (t2): (0,w 2,9,0)d19 (t1 t2 t3) d10 (t2 t3): (0,w 2,10,w 3,10 )d20 (t1 t2)

110 Hsin-Hsi Chen109 Latent Semantic Indexing Model representation of documents and queries by index terms –problem 1: many unrelated documents might be included in the answer set –problem 2: relevant documents which are not indexed by any of the query keywords are not retrieved possible solution: concept matching instead of index term matching –application in cross-language information retrieval

111 Hsin-Hsi Chen110 basic idea Map each document and query vector into a lower dimensional space which is associated with concepts Retrieval in the reduced space may be superior to retrieval in the space of index terms

112 Hsin-Hsi Chen111 Definition t: the number of index terms in the collection N: the total number of documents M=(M ij ): a term-document association matrix with t rows and N columns M ij : a weight w i,j associated with the term- document pair [k i, d j ] (e.g., using tf-idf)

113 Hsin-Hsi Chen112 Singular Value Decomposition where D = 1 2 n... 0 0 diagonal matrix orthogonal 1  2  …  n  0

114 Hsin-Hsi Chen113 where D = 1 2 n... 0 0 diagonal matrix orthogonal (AB) T = B T A T 1  2  …  n  0

115 Hsin-Hsi Chen114 1 2 n... 0 1, 2, …, n 為 A 之 eigenvalues , q k 為 A 相對於 k 之 eigenvector

116 Hsin-Hsi Chen115 Singular Value Decomposition According to

117 Hsin-Hsi Chen116 對照 A=QDQ T Q is matrix of eigenvectors of A D is diagonal matrix of singular values 得到 s < r (Concept space is reduced)

118 Hsin-Hsi Chen117 Consider only the s largest singular values of S 1 2 n... 0 0 1  2  …  n  0 The resultant M s matrix is the matrix of rank s which is closest to the original matrix M in the least square sense. (s<<t, s<<N)

119 Hsin-Hsi Chen118 Ranking in LSI query: a pseudo-document in the original M term-document –query is modeled as the document with number 0 –M s t M s : the ranks of all documents w.r.t this query

120 Hsin-Hsi Chen119 Structured Text Retrieval Models Definition –Combine information on text content with information on the document structure –e.g., same-page(near(‘atomic holocaust’, Figure(label(‘earth’)))) Expressive power vs. evaluation efficiency –a model based on non-overlapping lists –a model based on proximal nodes Terminology –match point: position in the text of a sequence of words that matches the user query –region: a contiguous portion of the text –node: a structural component of the document (chap, sec, …)

121 Hsin-Hsi Chen120 Non-Overlapping Lists divide the whole text of each document in non- overlapping text regions (lists) example Text regions from distinct lists might overlap L0 Chapter L1 Sections L2 Subsections L3 Subsubsections indexing lists a list of all chapters in the document a list of all sections in the document a list of all subsections in the document a list all subsubsections in the document 15000 1 3000 Chapter 1 300150001.11.2 110001001300030015000 1.1.1 1.1.2 1.2.1 1500501 10001001

122 Hsin-Hsi Chen121 Non-Overlapping Lists (Continued) Data structure –a single inverted file –each structural component stands as an entry –for each entry, there is a list of text regions as a list occurrences Operations –Select a region which contains a given word –Select a region A which does not contain any other region B (where B belongs to a list distinct from the list for A) –Select a region not contained within any other region –… Recall that there is another inverted file for the words in the text

123 Hsin-Hsi Chen122 Inverted Files File is represented as an array of indexed records.

124 Hsin-Hsi Chen123 Inverted-file process The record-term array is inverted (transposed).

125 Hsin-Hsi Chen124 Inverted-file process (Continued) Take two or more rows of an inverted term-record array, and produce a single combined list of record identifiers. Query(term2 and term3) 1100 0111 --------------------------------- 1 <-- R2

126 Hsin-Hsi Chen125 Extensions of Inverted Index Operations (Distance Constraints) Distance Constraints –(A within sentence B) terms A and B must co-occur in a common sentence –(A adjacent B) terms A and B must occur adjacently in the text

127 Hsin-Hsi Chen126 Extensions of Inverted Index Operations (Distance Constraints) Implementation –include term-location in the inverted indexes information:{R345, R348, R350, …} retrieval:{R123, R128, R345, …} –include sentence-location in the indexes information: {R345, 25; R345, 37; R348, 10; R350, 8; …} retrieval: {R123, 5; R128, 25; R345, 37; R345, 40; …}

128 Hsin-Hsi Chen127 Extensions of Inverted Index Operations (Distance Constraints) –include paragraph numbers in the indexes sentence numbers within paragraphs word numbers within sentences information: {R345, 2, 3, 5; …} retrieval: {R345, 2, 3, 6; …} –query examples (information adjacent retrieval) (information within five words retrieval) –cost: the size of indexes

129 Hsin-Hsi Chen128 Model Based on Proximal Nodes hierarchical vs. flat indexing structures Chapter Sections Subsections Subsubsections … holocaust1025648,324 … paragraphs, pages, lines … an inverted list for holocaust hierarchical index flat index entries: positions in the text nodes: position in the text

130 Hsin-Hsi Chen129 Model Based on Proximal Nodes (Continued) query language –Specification of regular expressions –Reference to structural components by name –Combination –Example Search for sections, subsections, or subsubsections which contain the word ‘holocaust’ [(*section) with (‘holocaust’)]

131 Hsin-Hsi Chen130 Model Based on Proximal Nodes (Continued) Basic algorithm –Traverse the inverted list for the term ‘holocaust’ –For each entry in the list (i.e., an occurrence), search the hierarchical index looking for sections, subsections, and sub-subsections Revised algorithm –For the first entry, search as before –Let the last matching structural component be the innermost matching component –Verify the innermost matching component also matches the second entry. If it does, the larger structural components above it also do. nearby nodes

132 Hsin-Hsi Chen131 Models for Browsing Browsing vs. searching –The goal of a searching task is clearer in the mind of the user than the goal of a browsing task Models –Flat browsing –Structure guided browsing –The hypertext model

133 Hsin-Hsi Chen132 Models for Browsing Flat organization –Documents are represented as dots in a 2-D plan –Documents are represented as elements in a 1-D list, e.g., the results of search engine Structure guided browsing –Documents are organized in a directory, which group documents covering related topics Hypertext model –Navigating the hypertext: a traversal of a directed graph

134 Hsin-Hsi Chen133 Trends and Research Issues Library systems –Cognitive and behavioral issues oriented particularly at a better understanding of which criteria the users adopt to judge relevance Specialized retrieval systems –e.g., legal and business documents –how to retrieve all relevant documents without retrieving a large number of unrelated documents The Web –User does not know what he wants or has great difficulty in formulating his request –How the paradigm adopted for the user interface affects the ranking –The indexes maintained by various Web search engine are almost disjoint


Download ppt "Hsin-Hsi Chen1 Chapter 2 Modeling Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University."

Similar presentations


Ads by Google