Under The Hood [Part II] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Under The Hood [Part II] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Today’s Topics Term weighting in detail Generalized Vector Space Model (GVSM) Maximal Marginal Relevance Summarization as Passage Retrieval

Term Weighting Revisited (1) Definitions w i "ith Term:" a word, stemmed word, or indexed phrase D j "jth Document:" a unit of indexed text, e.g. a web-page, a news report, an article, a patent, a legal case, a book, a chapter of a book, etc.

Term Weighting Revisited (2) Definitions C "The Collection:" the full set of indexed documents (e.g. the New York Times archive, the Web,...) Tf(w i,D j ) "Term Frequency:" the number of times w i occurs in document D j. Tf is sometimes normalized by dividing by frequency of the most-frequent non-stop term in the document [Tf norm = Tf/ max_TF ].

Term Weighting Revisited (3) Definitions Df(w i,C) "Document Frequency:" the number of documents from C in which w i occurs. Df may be normalized by dividing it by the total number of documents in C. IDf(w i, C)“Inverse Document Frequency”: [Df(w i, C)/size(C)] -1. Most often the log 2 (IDf) is used, rather than IDf directly.

Term Weighting Revisited (4) TfIDf Term Weights In general: TfIDf(w i, D j, C) = F 1 (Tf(w i, D j ) * F 2 (IDf(w i, C)) Usually F 1 = 0.5 + log 2 (Tf), or Tf/Tf max or 0.5 + 0.5Tf/Tf max Usually F 2 = log 2 (IDf) In the SMART IR system: TfIDf(w i, D j,C) = [0.5 + 0.5Tf(w i, D j /Tf max (D j )] * log 2 (IDf(w i, C))

Term Weighting beyond TfIDf (1) Probabilistic Models Old style (see textbooks) Improves precision-recall slightly Full statistical language modeling (CMU) Improves precision-recall more significantly Difficult to compute efficiently.

Term Weighting beyond TfIDf (2) Neural Networks Theoretically attractive Do not scale up at all, unfortunately Fuzzy Sets Not deeply researched, scaling difficulties

Term Weighting beyond TfIDf (3) Natural Language Analysis Analyze and understand D’s & Q first Ultimate IR method, in theory Generally NL understanding is an unsolved problem Scale up challenges, even if we could do it But, shown to improve IR for very limited domains

Generalized Vector Space Model (1) Principles Define terms by their occurrence patterns in documents Define query terms in the same way Compute similarity by document-pattern overlap for terms in D and Q Use standard Cos similarity and either binary or TfIDf weights

Generalized Vector Space Model (2) Advantages Automatically calculates partial similarity If "heart disease" and "stroke" and "ventricular" co-occur in many documents, then if the query contains only one of these terms, documents containing the other will receive partial credit proportional to their document co-occurrence ratio. No need to do query expansion or relevance feedback

Generalized Vector Space Model (3) Disadvantages Computationally expensive Performance = vector space + Q expansion

GVSM, How it Works (1) Represent the collection as vector of documents: Let C = [D 1, D 2,..., D m ] Represent each term by its distributional frequency: Let t i = [Tf(t i, D 1 ), Tf(t i, D 2 ),..., Tf(t i, D m )] Term-to-term similarity is computed as: Sim(t i, t j ) = cos(vec(t i ), vec(t j )) Hence, highly co-occurring terms like "Arafat" and "PLO" will be treated as near-synonyms for retrieval

GVSM, How it Works (2) And query-document similarity is computed as before: Sim(Q,D) = cos(vec(Q)), vec(D)), except that instead of the dot product calculation, we use a function of the term-to-term similarity computation above, For instance: Sim(Q,D) = Σ i [Max j (sim(q i, d j )] or normalizing for document & query length: Sim norm (Q, D) =

GVSM, How it Works (3) Primary problem: More computation (sparse => dense) Primary benefit: Automatic term expansion by corpus

A Critique of Pure Relevance (1) IR Maximizes Relevance Precision and recall are relevance measures Quality of documents retrieved is ignored

A Critique of Pure Relevance (2) Other Important Factors What about information novelty, timeliness, appropriateness, validity, comprehensibility, density, medium,...?? In IR, we really want to maximize: P(U(f i,..., f n ) | Q & {C} & U & H) where Q = query, {C} = collection set, U = user profile, H = interaction history...but we don’t yet know how. Darn.

Maximal Marginal Relevance (1) A crude first approximation: novelty => minimal-redundancy Weighted linear combination: (redundancy = cost, relevance = benefit) Free parameters: k and λ

Maximal Marginal Relevance (2) MMR(Q, C, R) = Argmax k d i in C [λS(Q, d i ) - (1-λ)max d j in R (S(d i, d j ))]

Maximal Marginal Relevance (MMR) (3) COMPUTATION OF MMR RERANKING 1. Standard IR Retrieval of top-N docs Let D r = IR(D, Q, N) 2. Rank max sim(d i ε D r, Q) as top doc, i.e. Let Ranked = {d i } 3. Let D r = D r \{d i } 4. While D r is not empty, do: a. Find d i with max MMR(D r, Q. Ranked) b. Let Ranked = Ranked.d i c. Let D r = D r \{d i }

MMR Ranking vs Standard IR query documents MMR IR λ controls spiral curl

Maximal Marginal Relevance (MMR) (4) Applications: Ranking retrieved documents from IR Engine Ranking passages for inclusion in Summaries

Document Summarization in a Nutshell (1) Types of Summaries TaskQuery-relevant (focused) Query-free (generic) INDICATIVE, for Filtering (Do I read further?) To filter search engine results Short abstracts CONTENTFUL, for reading in lieu of full doc. To solve problems for busy professionals Executive summaries

Document Summarization in a Nutshell (2) Other Dimensions Single vs multi document summarization Genre-adaptive vs one-size-fits all Single-language vs translingual Flat summary vs hyperlinked pyramid Text-only vs multi-media...

Summarization as Passage Retrieval (1) For Query-Driven Summaries 1. Divide document into passages e.g, sentences, paragraphs, FAQ-pairs,.... 2. Use query to retrieve most relevant passages, or better, use MMR to avoid redundancy. 3. Assemble retrieved passages into a summary.

Summarization as Passage Retrieval (2) For Generic Summaries 1. Use title or top-k Tf-IDF terms as query. 2. Proceed as Query-Driven Summarization.

Summarization as Passage Retrieval (3) For Multidocument Summaries 1. Cluster documents into topically-related groups. 2. For each group, divide document into passages and keep track of source of each passage. 3. Use MMR to retrieve most relevant non- redundant passages (MMR is necessary for multiple docs). 4. Assemble a summary for each cluster.

Under The Hood [Part II] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Similar presentations

Presentation on theme: "Under The Hood [Part II] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Under The Hood [Part II] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Similar presentations

Presentation on theme: "Under The Hood [Part II] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell."— Presentation transcript:

Similar presentations

About project

Feedback