Water’s getting aggressive Administrative.. n Homework is not due until socket is closed… u Typically one week after socket closes n Mailing list & Blog.

Water’s getting aggressive

Administrative.. n Homework is not due until socket is closed… u Typically one week after socket closes n Mailing list & Blog are set up; invitations were sent to join the blog 8/28 Dear Sir, I'am attaching the image of the 2D convergence graph plot I obtained for the last question in the matrices assignment. This I am sending as email because I'am doubtful if I can take a printout of this graph before coming for today's session. I will anyway submit this graph in hard copy as soon as I take a print.

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Finding“Sweet Spots” in computer-mediated cooperative work It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop –All you need is to find the right sweet spot, where the computer plays a pre-processing role and presents “potential solutions” –…and the human very gratefully does the in-depth analysis on those few potential solutions Examples: –The incredible success of “Bag of Words” model! Bag of letters would be a disaster ;-) Bag of sentences and/or NLP would be good –..but only to your discriminating and irascible searchers ;-)

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Collaborative Computing AKA Brain Cycle Stealing AKA Computizing Eyeballs A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks –It is like “cycle stealing”—except we are stealing “human brain cycles” (the most idle of the computers if there is ever one ;-) Remember the mice in the Hitch Hikers Guide to the Galaxy? (..who were running a mass-scale experiment on the humans to figure out the question..) –Collaborative knowledge compilation (wikipedia!) –Collaborative Curation –Collaborative tagging –Paid collaboration/contracting Many big open issues –How do you pose the problem such that it can be solved using collaborative computing? –How do you “incentivize” people into letting you steal their brain cycles? Pay them! (Amazon mturk.com ) Make it fun (ESP game)

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Tapping into the Collective Unconscious Another thread of exciting research is driven by the realization that WEB is not random at all! –It is written by humans –…so analyzing its structure and content allows us to tap into the collective unconscious.. Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness” Examples: –Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper) –Analyzing the link-structure of the web graph to discover communities DoD and NSA are very much into this as a way of breaking terrorist cells –Analyzing the transaction patterns of customers (collaborative filtering)

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Web as a collection of information Web viewed as a large collection of__________ –Text, Structured Data, Semi-structured data –(connected) (dynamically changing) (user generated) content – (multi-media/Updates/Transactions etc. ignored for now) So what do we want to do with it? –Search, directed browsing, aggregation, integration, pattern finding How do we do it? –Depends on your model (text/Structured/semi-structured)

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Structure How will search and querying on these three types of data differ? A generic web page containing text A movie review [English] [SQL] [XML] Semi-Structured An employee record

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Structure helps querying Expressive queries Give me all pages that have key words “Get Rich Quick” Give me the social security numbers of all the employees who have stayed with the company for more than 5 years, and whose yearly salaries are three standard deviations away from the average salary Give me all mails from people from ASU written this year, which are relevant to “get rich quick” keyword SQL XML

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Does Web have Structured data? Isn’t web all text? –The invisible web Most web servers have back end database servers They dynamically convert (wrap) the structured data into readable english – => The capital of India is New Delhi. –So, if we can “unwrap” the text, we have structured data! »(un)wrappers, learning wrappers etc… –Note also that such dynamic pages cannot be crawled... –The Semi-structured web Most pages are at least “semi”-structured XML standard is expected to ease the presenatation/on-the-wire transfer of such pages. (BUT…..)

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati How to get Structure? When the underlying data is already strctured, do unwrapping –Web already has a lot of structured data! –Invisible web…that disguises itself..else extract structure –Go from text to structured data (using quasi NLP techniques)..or annotate metadata to add structure –Semantic web idea..

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Adapting old disciplines for Web-age Information (text) retrieval –Scale of the web –Hyper text/ Link structure –Authority/hub computations Social Network Analysis –Ease of tracking/centrally representing social networks Databases –Multiple databases Heterogeneous, access limited, partially overlapping –Network (un)reliability Datamining [Machine Learning/Statistics/Databases] –Learning patterns from large scale data

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Information Retrieval Traditional Model –Given a set of documents A query expressed as a set of keywords –Return A ranked set of documents most relevant to the query –Evaluation: Precision: Fraction of returned documents that are relevant Recall: Fraction of relevant documents that are returned Efficiency Web-induced headaches –Scale (billions of documents) –Hypertext (inter-document connections) & simplifications –Easier to please “lay” users Consequently –Ranking that takes link structure into account Authority/Hub –Indexing and Retrieval algorithms that are ultra fast

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Social Networks Traditional Model –Given a set of entities (humans) And their relations (network) –Return Measures of centrality and importance Propagation of trust (Paths through networks) –Many uses Spread of diseases Spread of rumours Popularity of people Friends circle of people Web-induced headaches –Scale (billions of entities) –Implicit vs. Explicit links Hypertext (inter-entity connections easier to track) Interest-based links & Simplifications –Global view of social network possible… Consequently –Ranking that takes link structure into account Authority/Hub –Recommendations (collaborative filtering; trust propagation)

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Information Integration Database Style Retrieval Traditional Model (relational) –Given: A single relational database –Schema –Instances A relational (sql) query –Return: All tuples satisfying the query Evaluation –Soundness/Completeness –efficiency Web-induced headaches Many databases –With differing Schemas all are partially complete overlapping heterogeneous schemas access limitations Network (un)reliability Consequently Newer models of DB Newer notions of completeness Newer approaches for query planning

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Further headaches brought on by Semi-structured retrieval If everyone puts their pages in XML –Introducing similarity based retrieval into traditional databases –Standardizing on shared ontologies...

6/27/2015 7:20 PMCopyright © 2001 S. Kambhampati Learning Patterns (from web and users) Traditional classification learning (supervised) –Given a set of structured instances of a pattern (concept) –Induce the description of the pattern Evaluation: –Accuracy of classification on the test data –(efficiency of learning) Mining headaches –Training data is not obvious (relevance) –Training data is massive –Training instances are noisy and incomplete Consequently –Primary emphasis on fast classification Even at the expense of accuracy

Outline of IR topics n Background u Definitions, etc. n The Problem u 100,000+ pages n The Solution u Ranking docs u Vector space n Extensions u Relevance feedback, u clustering, u query expansion, etc.

Information Retrieval n Traditional Model u Given F a set of documents F A query expressed as a set of keywords u Return F A ranked set of documents most relevant to the query u Evaluation: F Precision: Fraction of returned documents that are relevant F Recall: Fraction of relevant documents that are returned F Efficiency n Web-induced headaches u Scale (billions of documents) u Hypertext (inter- document connections) n Consequently u Ranking that takes link structure into account F Authority/Hub u Indexing and Retrieval algorithms that are ultra fast

What is Information Retrieval n Given a large repository of documents, how do I get at the ones that I want u Examples: Lexus/Nexus, Medical reports, AltaVista F Keyword based n Different from databases u Unstructured (or semi-structured) data u Information is (typically) text u Requests are (typically) word-based In principle, this requires NLP! --NLP too hard as yet --IR tries to get by with syntactic methods Catch22: Since IR doesn’t do NLP, users tend to write cryptic keyword queries

Information vs. Data n Data retrieval F which docs contain a set of keywords? F Well defined semantics F a single erroneous object implies failure! A single missed object implies failure too.. n Information retrieval F information about a subject or topic F semantics is frequently loose F small errors are tolerated n IR system: F interpret contents of information items F generate a ranking which reflects relevance F notion of relevance is most important

How do you find out the relavance function? n Learn u Active (utility elicitation) u Passive (learn from what the user does) n Make up the users’ mind u What you are “really” looking for is.. (used car sales people) n Combination of the above u Saree shops ;-) n Assume (impose) a relevance model. R(d|Q,U) relevance of a document d to the user U under query Q

Difficulties in designing ranking methods n We want a ranking algorithm that captures the user’s relevance metric u Only the user’s relevance metric is not fully captured by the short keyword query F Worse when the query has 10 words limit (as in most search engines) n So, we hypothesize what might be underlying the user’s relevance judgment u Similarity of words u Similarity of co-citation u Popularity of the document n..and hope that our hypotheses are good We dance round in a ring and suppose, But the Secret sits in the middle and knows. -- Robert Frost.

Marginal (Residual) Relevance n It is clear that the first document returned should be the one most similar to the query n How about the second…and top-10 documents? u If we have near-duplicate documents, you would think the user wouldn’t want to see all copies! u If there seem to be different clusters of documents that are all close to the query, it is best to hedge your bets by returning one document from each cluster (e.g. given a query “bush”, you may want to return one page on republican bush, one on Kalahari bushmen and one on rose bush etc..) n Insight: If you are returning top-K documents, they should simultaneously satisfy two constraints: u They are as similar as possible to the query u They are as dissimilar as possible from each other n Most search engines do care about this “result diversity” u They don’t necessarily do it by directly solving the optimization problem. One idea is to take top-100 documents that are similar to they query and then cluster them. You can then give one representative document from each cluster F Example: Vivisimo.com So we need R(d|Q,U,{d1…di-1}) where d1..di-1 are documents already shown to the user.

Drunk searching for his keys… n What we really want: u Relevance of doc D to user U, given query Q u Marginal/residual relevance of doc D’ to user U given query Q, and the fact that U has already seen docs {d1…dk} n What we hope to get by: u Similarity between doc D and query Q (to heck with the user and her relevance) u Document D’ that is most similar to Q while being most distant from docs {d1…dk} already shown

Measuring Performance n Precision u Proportion of selected items that are correct n Recall u Proportion of target items that were selected n Precision-Recall curve u Shows tradeoff tn fptpfn System returned these Actual relevant docs Recall Precision Why don’t we use precision/recall measurements for databases? 1.0 precision ~ Soundness ~ nothing but the truth 1.0 recall ~ Completeness ~ whole truth Analogy: Swearing-in witnesses in courts

Why can’t search engines have 100% precision and 100% recall? n Because relevance is in the eye of the beholder… u I think that a page pointing to culture of Kalahari Bushmen is highly relevant to my query “bush” u The campus republicans might find that it is a lousy answer..

Measuring performance of retrieval system n Why do courts ask witnesses to swear that “..I will tell the whole truth and nothing but the truth..” Why not just ask them to swear “I will tell the truth”

Measuring Performance n Precision/recall studies involving real users…

Precision/Recall Curves 11-point recall-precision curve plots precision at recalls 0,.1,.2,.3….1.0 Example: Suppose for a given query, 10 documents are relevant. Suppose when all documents are ranked in descending similarities, we have d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 d 18 d 19 d 20 d 21 d 22 d 23 d 24 d 25 d 26 d 27 d 28 d 29 d 30 d 31 … recall precision.1.3 1.0.2 recall happens at the third doc Here the precision is 2/3=.66.3 recall happens at 6 th doc. Here the Precision is 3/6=0.5

Precision Recall Curves… When evaluating the retrieval effectiveness of a text retrieval system or method, a large number of queries are used and their average 11-point recall- precision curve is plotted. n Methods 1 and 2 are better than method 3. n Method 1 is better than method 2 for high recalls. recall precision Method 1 Method 2 Method 3 Note: We assume that all Methods are using the same Document corpus

Combining precision and recall into a single measure n We can consider a weighted summation of precision and recall into a single quantity u What is the best way to combine? F Arithmetic mean? F Geometric mean? F Harmonic mean? F-measure (aka F 1 -measure) (harmonic mean of precision and recall) If you travel at 40mph on The way out and 60mph On the return, what is Your average speed? f=0 if p=0 or r=0 f=0.5 if p=r=0.5

Mean Average Precision n Average of the precision scores after each relevant document retrieved

Sophie’s choice: Web version n If you can either have precision or recall but not both, which would you rather keep? u If you are a medical doctor trying to find the right paper on a disease u If you are Joe Schmoe surfing on the web?

Evaluation: TREC n How do you evaluate information retrieval algorithms? n Need prior relevance judgements n TREC:Text Retrieval Competion u Given F documents; F a set of queries; and for each query, prior relevance judgements u Rank systems based on their precision recall on the corpus of queries n There are variants of TREC u TREC for bio-informatics; TREC for collection selection etc F Very benchmark driven….

What is IR cont. n IR: representation, storage, organization of, and access to information items n Focus is on the user information need n User information need: u Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. n Emphasis is on the retrieval of information (not data)

IR - Past and Present n IR at the center of the stage u IR in the last 20 years: F classification and categorization F systems and languages F user interfaces and visualization u The Web has renewed focus on IR F universal repository of knowledge F free (low cost) universal access F no central editorial board F many problems though: IR seen as key to finding the solutions!

Classic IR Models - Basic Concepts n Each document represented by a set of representative keywords or index terms u Query is seen as a “mini”document n An index term is a document word useful for remembering the document main themes u Usually, index terms are nouns because nouns have meaning by themselves F [However, search engines assume that all words are index terms (full text representation)]

User Interface Text Operations (stemming, noun phrase detection etc..) Query Operations (elaboration, relevance feedback Indexing Searching (hash tables etc.) Ranking (vector models..) Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module 4, 10 6, 7 58 2 8 Text Database Text The Retrieval Process

A quick glimpse at inverted files Dictionary Postings

Generating keywords (index terms) in traditional IR structure Accents spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms n Stop-word elimination n Noun phrase detection n “data structure” “computer architecture” n Stemming (Porter Stemmer for English) n If suffix of a word is “IZATION” and prefix contains at least one vowel followed by a consonant, then replace suffix with “IZE” (e.g. Binarization  Binarize) Generating index terms Improving quality of terms. (e.g. Synonyms, co-occurence detection, latent semantic indexing..

The number of Web pages on the World Wide Web was estimated to be over 800 million in 1999. Stop word elimination Stemming Example of Stemming and Stopword Elimination So does Google use stemming? All kinds of stemming? Stopword elimination? Any non-obvious stop-words?

Why don’t search engines do much text-ops? n User population is too large and is easily impressed with reasonably relevant answers u We are not talking of medical doctors looking for the most relevant paper describing the cure for the symptoms of their patient u A search engine can do well even if all the doctors give it low marks F Corollary: All of these text-ops may well be relevant for “Vertical” (topic-specific) search engines n Some of the text-ops were put in place as a way of dealing with the computational limitations u E.g. indexing in terms of only few keywords u These are not as relevant in the era of current day computers…

Ranking n A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query n A ranking is based on fundamental premisses regarding the notion of relevance, such as: u common sets of index terms u sharing of weighted terms u likelihood of relevance n Each set of premisses leads to a distinct IR model The biggie

Difficulties in designing ranking methods n We want a ranking algorithm that captures the user’s relevance metric u Only the user’s relevance metric is not fully captured by the short keyword query F Worse when the query has 10 words limit (as in most search engines) n So, we hypothesize what might be underlying the user’s relevance judgment u Similarity of words u Similarity of co-citation u Popularity of the document n..and hope that our hypotheses are good We dance round in a ring and suppose, But the Secret sits in the middle and knows. -- Robert Frost.

Marginal (Residual) Relevance n It is clear that the first document returned should be the one most similar to the query n How about the second…and top-10 documents? u If we have near-duplicate documents, you would think the user wouldn’t want to see all copies! u If there seem to be different clusters of documents that are all close to the query, it is best to hedge your bets by returning one document from each cluster (e.g. given a query “bush”, you may want to return one page on republican bush, one on Kalahari bushmen and one on rose bush etc..) n Insight: If you are returning top-K documents, they should simultaneously satisfy two constraints: u They are as similar as possible to the query u They are as dissimilar as possible from each other n Most search engines do care about this “result diversity” u They don’t necessarily do it by directly solving the optimization problem. One idea is to take top-100 documents that are similar to they query and then cluster them. You can then give one representative document from each cluster F Example: Vivisimo.com So we need R(d|Q,U,{d1…di-1}) where d1..di-1 are documents already shown to the user.

IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext

(Some) Desiderata for Ranking Metrics n Partial matches should be allowed u Can’t throw out a document just because it is missing one of the 20 words in the query.. n Weighted matches should be allowed u If the query is “Red Sponge” a document that just has “red” should be seen to be less relevant than a document that just has the word “Sponge” n Relevance (similarity) should not depend on the size! u Doubling the size of a document by concatenating it to itself should not increase its similarity Boolean out. Vector/Jaccard okay Reduce the importance Of common words Normalize the Document Sizes

Digression: Similarity vs. Duplicate detection n Duplicate detection (as used in plagiarism detection) is different from similarity computation u Highly similar documents may not necessarily be plagiarized versions of each other u Often, duplicate detection may require comparing documents at the level of “Shingles” F A shingle is a contiguous chunk of text A plagiarized document may have many of the shingles of the original document but re- arranged See http://www- db.stanford.edu/~shiva/Pubs/DlMag/dlmag.html

Terminology: Term Weights n Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents u The importance of the index terms is represented by weights associated to them u Ki is an index term u dj is a document u t is the total number of docs u K = (k1, k2, …, kt) is the set of all index terms u wij >= 0 is a weight associated with (ki,dj) F wij = 0 indicates that term does not belong to doc u vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj

The Boolean Model n Simple model based on set theory u Documents as sets of keywords n Queries specified as boolean expressions u q = ka  (kb   kc) F precise semantics n Terms are either present or absent. Thus, wij  {0,1} n Consider u q = ka  (kb   kc) u vec(qdnf) = (1,1,1)  (1,1,0)  (1,0,0) u vec(qcc) = (1,1,0) is a conjunctive component AI Folks: This is DNF as against CNF which you used in 471

The Boolean Model n q = ka  (kb   kc) n sim(q,dj) = 1 if  vec(qcc) | (vec(qcc)  vec(qdnf))  (  ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise (1,1,1) (1,0,0) (1,1,0) KaKb Kc A document dj is a long conjunction of keywords

Boolean model is popular in legal search engines.. /s  same sentence /p  same para /k  within k words Notice long Queries, proximity ops

Drawbacks of the Boolean Model n Retrieval based on binary decision criteria with no notion of partial matching n No ranking of the documents is provided (absence of a grading scale) n Information need has to be translated into a Boolean expression which most users find awkward u The Boolean queries formulated by the users are most often too simplistic F As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query Keyword (vector model) is not necessarily better—it just annoys the users somewhat less

Documents as bags of words a: System and human system engineering testing of EPS b: A survey of user opinion of computer system response time c: The EPS user interface management system d: Human machine interface for ABC computer applications e: Relation of user perceived response time to error measurement f: The generation of random, binary, ordered trees g: The intersection graph of paths in trees h: Graph minors IV: Widths of trees and well-quasi-ordering i: Graph minors: A survey

9/4  Possible make-up class next Wednesday and Friday (so the teacher can go down-under the week after) 

Types of Web Queries… Web queries can be classified into three categories n Informational Queries u Want to know about some topic n Navigational Queries u Want to find a particular site n Transactional Queries u Want to find a site so as to do some transaction on it.. IR work focuses implicitly on informational queries

t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Documents as bags of keywords (another eg)

Jaccard Similarity Metric n Although vector similarity measure is used widely, another similarity measure with useful properties is Jaccard Similarity metric u Estimates the degree of overlap between sets (or bags) u For bags, intersection and union are defined in terms of max & min F If A has 5 oranges and 8 apples and B has 3 oranges and 12 apples F A.intersection. B is 3 oranges and 8 apples F A.union. B is 5 oranges and 12 apples F Jaccard similarity is (3+8)/(5 +12)= 11/17

t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Documents as bags of keywords (another eg) Similarity(d1,d2) = (24+10+5)/32+21+9+3+3 =0.57 What about d1 and d1d1 (which is a twice concatenated version of d1)? --need to normalize the bags (e.g. divide coeffs by bag size) --Also can better differentiate the ceffs (tf/idf metrics)

(Some) Desiderata for Ranking Metrics n Partial matches should be allowed u Can’t throw out a document just because it is missing one of the 20 words in the query.. n Weighted matches should be allowed u If the query is “Red Sponge” a document that just has “red” should be seen to be less relevant than a document that just has the word “Sponge” n Relevance (similarity) should not depend on the size! u Doubling the size of a document by concatenating it to itself should not increase its similarity Boolean out. Vector/Jaccard okay Reduce the importance Of common words Normalize the Document Sizes

Marginal (Residual) Relevance n It is clear that the first document returned should be the one most similar to the query n How about the second…and top-10 documents? u If we have near-duplicate documents, you would think the user wouldn’t want to see all copies! u If there seem to be different clusters of documents that are all close to the query, it is best to hedge your bets by returning one document from each cluster (e.g. given a query “bush”, you may want to return one page on republican bush, one on Kalahari bushmen and one on rose bush etc..) n Insight: If you are returning top-K documents, they should simultaneously satisfy two constraints: u They are as similar as possible to the query u They are as dissimilar as possible from each other n Most search engines do care about this “result diversity” u They don’t necessarily do it by directly solving the optimization problem. One idea is to take top-100 documents that are similar to they query and then cluster them. You can then give one representative document from each cluster F Example: Vivisimo.com

The Vector Model n Use of binary weights is too limiting u Non-binary weights provide consideration for partial matches n These term weights are used to compute a degree of similarity between a query and each document n Ranked set of documents provides for better matching

The Vector Model n Documents/Queries bags are seen as Vectors over keyword space u vec(dj) = (w1j, w2j,..., wtj) vec(q) = (w1q, w2q,..., wtq) wiq >= 0 associated with the pair (ki,q) –wij > 0 whenever ki  dj u To each term ki is associated a unitary vector vec(i) F The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) –Is this Reasonable?????? n The t unitary vectors vec(i) form an orthonormal basis for a t-dimensional space u Each vector holds a place for every term in the collection u Therefore, most vectors are sparse

t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Vector Space Example

a: System and human system engineering testing of EPS b: A survey of user opinion of computer system response time c: The EPS user interface management system d: Human machine interface for ABC computer applications e: Relation of user perceived response time to error measurement f: The generation of random, binary, ordered trees g: The intersection graph of paths in trees h: Graph minors IV: Widths of trees and well-quasi-ordering i: Graph minors: A survey

Similarity Function The similarity or closeness of a document d = ( w 1, …, w i, …, w n ) with respect to a query (or another document) q = ( q 1, …, q i, …, q n ) is computed using a similarity (distance) function. Many similarity functions exist Eucledian distance, dot product, normalized dot product (cosine-theta)

Eucledian distance n Given two document vectors d1 and d2 n

Dot Product distance sim(q, d) = dot(q, d) = q 1  w 1 + … + q n  w n Example: Suppose d = (0.2, 0, 0.3, 1) and q = (0.75, 0.75, 0, 1), then sim(q, d) = 0.15 + 0 + 0 + 1 = 1.15 Observations of the dot product function. n Documents having more terms in common with a query tend to have higher similarities with the query. n For terms that appear in both q and d, those with higher weights contribute more to sim(q, d) than those with lower weights. n It favors long documents over short documents. n The computed similarities have no clear upper bound.

A normalized similarity metric n Sim(q,dj) = cos(  ) = [vec(dj)  vec(q)] / |dj| * |q| = [  wij * wiq] / |dj| * |q| n Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 n A document is retrieved even if it matches the query terms only partially i j dj q 

t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Eucledian Cosine Comparison of Eucledian and Cosine distance metrics Whiter => more similar

Answering Queries n Represent query as vector n Compute distances to all documents n Rank according to distance n Example u “database index” t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Given Q={database, index} = {1,0,1,0,0,0}

Term Weights in the Vector Model n Sim(q,dj) = [  wij * wiq] / |dj| * |q| n How to compute the weights wij and wiq ? u Simple keyword frequencies tend to favor common words F E.g. Query: The Computer Tomography n A good weight must take into account two effects: u quantification of intra-document contents (similarity) F tf factor, the term frequency within a document u quantification of inter-documents separation (dissi- milarity) F idf factor, the inverse document frequency u wij = tf(i,j) * idf(i)

Tf-IDF n Let, u N be the total number of docs in the collection u ni be the number of docs which contain ki u freq(i,j) raw frequency of ki within dj n A normalized tf factor is given by u f(i,j) = freq(i,j) / max(freq(i,j)) F where the maximum is computed over all terms which occur within the document dj n The idf factor is computed as u idf(i) = log (N/ni) F the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.

Document/Query Representation using TF-IDF n The best term-weighting schemes use weights which are given by u wij = f(i,j) * log(N/ni) u the strategy is called a tf-idf weighting scheme n For the query term weights, several possibilities: u wiq = (0.5 + 0.5 * [freq(i,q) / max(freq(i,q)]) * log(N/ni) F Alternatively, just use the IDF weights (to give preference to rare words) u Let the user give the weights to the keywords to reflect her *real* preferences F Easier said than done... Users are often dunderheads..  Help them with “relevance feedback” techniques.

t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Given Q={database, index} = {1,0,1,0,0,0} Note: In this case, the weights used in query were 1 for t1 and t3, and 0 for the rest.

The Vector Model:Summary n The vector model with tf-idf weights is a good ranking strategy with general collections u The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute. n Advantages: u term-weighting improves quality of the answer set u partial matching allows retrieval of docs that approximate the query conditions u cosine ranking formula sorts documents according to degree of similarity to the query n Disadvantages: u assumes independence of index terms u Does not handle synonymy/polysemy u Query weighting may not reflect user relevance criteria.

So many ways things can go wrong… Reasons that ideal effectiveness hard to achieve: 1. Document representation loses information. 2. Users’ inability to describe queries precisely. 3. Similarity function used not be good enough. 4. Importance/weight of a term in representing a document and query may be inaccurate 5. Same term may have multiple meanings and different terms may have similar meanings. Query expansion Relevance feedback LSI Co-occurrence analysis

Making the document representation less lossy.. n Considering documents as bag of words is probably too coarse u Hey—it is less coarse than thinking of them as bag of letters u One idea is to consider documents as strings.. F Strings of letters? But then you get stuck too closely with the low-level details/distinctions F Strings of words? Less stuck with low-level details, but still too costly.. u A middle ground is to consider documents as bags of shingles F A k-shingle is set of k contiguous words extracted by sliding a k-size window over the document. u..a cheaper version of this idea is do “adaptive” detection of frequently appearing shingles F E.g. Noun-phrase detection (computer-science will be considered a new word distinct from “computer” and “science”)

Digression:Plagiarism detection using similarity metrcs n Will bag similarity be sufficient for plagiarism detection..? u No. Students will be accused of plagiarism just because they have similar (impoverished) vocabulary as the other students n How about string similarity/identicality u No. Teachers will miss plagiarised essays just because a couple of padding sentences are thrown in… n A middle ground: u Similarity over bag of shingles.. F A k-shingle is set of k contiguous words extracted by sliding a k-size window over the document. A plagiarized document may have many of the shingles of the original document but re-arranged See http://www- db.stanford.edu/~shiva/Pubs/DlMag/dlmag.html F Too costly for normal retrieval since there are many more shingles than there are words! Second order Digression: This whole discussion Can also be done in Terms of strings (rather Than documents) --In the context of strings, shingles are called “grams”. So a q-gram is a contiguous sequence of q letters from a string --Relevant for looking at similar strings (potentially misspelled)  also relevant for comparing genes… (since genes are but enormous strings over a small set of letters)

Some improvements n Query expansion techniques (for 1) u relevance feedback u co-occurrence analysis (local and global thesauri) n Improving the quality of terms [(2), (3) and (5).] u Latent Semantic Indexing u Phrase-detection

Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query F and/or re-weight the terms already in the query u Two main approaches: F Users select relevant documents Directly or indirectly (by pawing/clicking/staring etc) F Automatic (psuedo-relevance feedback) Assume that the top-k documents are the most relevant documents.. u Users/system select terms from an automatically- generated list

Relevance Feedback n Usually do both: u expand query with new terms u re-weight terms in query n There are many variations u usually positive weights for terms from relevant docs u sometimes negative weights for terms from non-relevant docs u Remove terms ONLY in non-relevant documents

Relevance Feedback for Vector Model Cr = Set of documents that are truly relevant to Q N = Total number of documents In the “ideal” case where we know the relevant Documents a priori

Rocchio Method Qo is initial query. Q 1 is the query after one iteration Dr are the set of relevant docs Dn are the set of irrelevant docs Alpha =1; Beta=.75, Gamma=.25 typically. Other variations possible, but performance similar

Rocchio/Vector Illustration Retrieval Information 0.5 1.0 0 0.51.0 D1D1 D2D2 Q0Q0 Q’ Q” Q 0 = retrieval of information = (0.7,0.3) D 1 = information science = (0.2,0.8) D 2 = retrieval systems = (0.9,0.1) Q’ = ½*Q 0 + ½ * D 1 = (0.45,0.55) Q” = ½*Q 0 + ½ * D 2 = (0.80,0.20)

Example Rocchio Calculation Relevant docs Non-rel doc Original Query Constants Rocchio Calculation Resulting feedback query

Rocchio Method n Rocchio automatically u re-weights terms u adds in new terms (from relevant docs) F have to be careful when using negative terms F Rocchio is not a machine learning algorithm n Most methods perform similarly u results heavily dependent on test collection n Machine learning methods are proving to work better than standard IR approaches like Rocchio

Using Relevance Feedback n Known to improve results u in TREC-like conditions (no user involved) n What about with a user in the loop? u How might you measure this? F Precision/Recall figures for the unseen documents need to be computed

Water’s getting aggressive Administrative.. n Homework is not due until socket is closed… u Typically one week after socket closes n Mailing list & Blog.

Similar presentations

Presentation on theme: "Water’s getting aggressive Administrative.. n Homework is not due until socket is closed… u Typically one week after socket closes n Mailing list & Blog."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Water’s getting aggressive Administrative.. n Homework is not due until socket is closed… u Typically one week after socket closes n Mailing list & Blog.

Similar presentations

Presentation on theme: "Water’s getting aggressive Administrative.. n Homework is not due until socket is closed… u Typically one week after socket closes n Mailing list & Blog."— Presentation transcript:

Similar presentations

About project

Feedback