1/22 --Are you getting e-mails sent to class mailing list? (if not, send me email) Agenda: --Intro wrapup --Start on text retrieval Link du jour: vivisimo.com.

1/22 --Are you getting e-mails sent to class mailing list? (if not, send me email) Agenda: --Intro wrapup --Start on text retrieval Link du jour: vivisimo.com

Web as a bow-tie 39% 21% 19% 14% 7% Probability that two pages are connected: (.21+.39) * (.39 +.19) =.348 Reference: The Web as a Graph. PODS 2000: 1-10PODS 2000 Ravi KumarRavi Kumar, Prabhakar Raghavan, Sridhar RajagopalanSridhar Rajagopalan, D. Sivakumar,D. Sivakumar Andrew TomkinsAndrew Tomkins, Eli Upfal:Eli Upfal Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the end)

Structure helps querying --If there is structure, exploit it. If not, extract it (Information Extraction—mining; clustering; tagging)

Background Check n 34 forms turned in Java expertise: Low 3; High 9.5; Avg: ~7

Outline of IR topics n Background u Definitions, etc. n The Problem u 100,000+ pages n The Solution u Ranking docs u Vector space n Extensions u Relevance feedback, u clustering, u query expansion, etc.

Information Retrieval n Traditional Model u Given F a set of documents F A query expressed as a set of keywords u Return F A ranked set of documents most relevant to the query u Evaluation: F Precision: Fraction of returned documents that are relevant F Recall: Fraction of relevant documents that are returned F Efficiency n Web-induced headaches u Scale (billions of documents) u Hypertext (inter- document connections) n Consequently u Ranking that takes link structure into account F Authority/Hub u Indexing and Retrieval algorithms that are ultra fast

What is Information Retrieval n Given a large repository of documents, how do I get at the ones that I want u Examples: Lexus/Nexus, Medical reports, AltaVista F Keyword based [can’t handle synonymy, polysemy] n Different from databases u Unstructured (or semi-structured) data u Information is (typically) text u Requests are (typically) word-based In principle, this requires NLP! --NLP too hard as yet --IR tries to get by with syntactic methods

What is IR cont. n IR: representation, storage, organization of, and access to information items n Focus is on the user information need n User information need: u Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. n Emphasis is on the retrieval of information (not data)

Information vs. Data n Data retrieval F which docs contain a set of keywords? F Well defined semantics F a single erroneous object implies failure! A single missed object implies failure too.. n Information retrieval F information about a subject or topic F semantics is frequently loose F small errors are tolerated n IR system: F interpret contents of information items F generate a ranking which reflects relevance F notion of relevance is most important

IR - Past and Present n IR at the center of the stage u IR in the last 20 years: F classification and categorization F systems and languages F user interfaces and visualization u The Web has renewed focus on IR F universal repository of knowledge F free (low cost) universal access F no central editorial board F many problems though: IR seen as key to finding the solutions!

Classic IR Models - Basic Concepts n Each document represented by a set of representative keywords or index terms u Query is seen as a “mini”document n An index term is a document word useful for remembering the document main themes u Usually, index terms are nouns because nouns have meaning by themselves F [However, search engines assume that all words are index terms (full text representation)]

Measuring Performance n Precision u Proportion of selected items that are correct n Recall u Proportion of target items that were selected n Precision-Recall curve u Shows tradeoff tn fptpfn System returned these Actual relevant docs Recall Precision

Precision/Recall Curves 11-point recall-precision curve Example: Suppose for a given query, 10 documents are relevant. Suppose when all documents are ranked in descending similarities, we have d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 d 18 d 19 d 20 d 21 d 22 d 23 d 24 d 25 d 26 d 27 d 28 d 29 d 30 d 31 … recall precision.1.3 1.0

Precision Recall Curves… When evaluating the retrieval effectiveness of a text retrieval system or method, a large number of queries are used and their average 11-point recall- precision curve is plotted. n Methods 1 and 2 are better than method 3. n Method 1 is better than method 2 for high recalls. recall precision Method 1 Method 2 Method 3

1/27 Clarification re: homeworks TA issue Try “weapons of mass destruction” on Google..;-)

Classic IR Models - Basic Concepts n Each document represented by a set of representative keywords or index terms u Query is seen as a “mini”document n An index term is a document word useful for remembering the document main themes u Usually, index terms are nouns because nouns have meaning by themselves F [However, search engines assume that all words are index terms (full text representation)]

User Interface Text Operations (stemming, noun phrase detection etc..) Query Operations (elaboration, relevance feedback Indexing Searching (hash tables etc.) Ranking (vector models..) Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module 4, 10 6, 7 58 2 8 Text Database Text The Retrieval Process

Generating keywords (index terms) n Logical view of the documents structure Accents spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms n Stop-word elimination n Noun phrase detection n Stemming n Generating index terms n Improving quality of terms. n Synonyms, co-occurence detection, latent semantic indexing..

The number of Web pages on the World Wide Web was estimated to be over 800 million in 1999. Stop word elimination Stemming Example of Stemming and Stopword Elimination So does Google use stemming? All kinds of stemming? Stopword elimination? Any non-obvious stopwords?

A quick glimpse at inverted files Dictionary Postings

t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Documents as bags of keywords Did long discussion on 1. unintuitive nature of high-D space 2. the fact that the keywords (and thus the dimensions) can be highly correlated

Documents as bags of words (another example) a: System and human system engineering testing of EPS b: A survey of user opinion of computer system response time c: The EPS user interface management system d: Human machine interface for ABC computer applications e: Relation of user perceived response time to error measurement f: The generation of random, binary, ordered trees g: The intersection graph of paths in trees h: Graph minors IV: Widths of trees and well-quasi-ordering i: Graph minors: A survey

Ranking n A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query n A ranking is based on fundamental premisses regarding the notion of relevance, such as: u common sets of index terms u sharing of weighted terms u likelihood of relevance n Each set of premisses leads to a distinct IR model

IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext

Trivia Things to come n Next class: Latent Semantic Indexing. n Read the online reading+background on PCA n A Matrix homework problem will be assigned before next class

IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext

Terminology: Term Weights n Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents u The importance of the index terms is represented by weights associated to them u Ki is an index term u dj is a document u t is the total number of docs u K = (k1, k2, …, kt) is the set of all index terms u wij >= 0 is a weight associated with (ki,dj) F wij = 0 indicates that term does not belong to doc u vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj

The Boolean Model n Simple model based on set theory n Queries specified as boolean expressions u q = ka  (kb   kc) F precise semantics n Terms are either present or absent. Thus, wij  {0,1} n Consider u q = ka  (kb   kc) u vec(qdnf) = (1,1,1)  (1,1,0)  (1,0,0) u vec(qcc) = (1,1,0) is a conjunctive component AI Folks: This is DNF as against CNF which you used in 471

The Boolean Model n q = ka  (kb   kc) n sim(q,dj) = 1 if  vec(qcc) | (vec(qcc)  vec(qdnf))  (  ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise (1,1,1) (1,0,0) (1,1,0) KaKb Kc

Drawbacks of the Boolean Model n Retrieval based on binary decision criteria with no notion of partial matching n No ranking of the documents is provided (absence of a grading scale) n Information need has to be translated into a Boolean expression which most users find awkward u The Boolean queries formulated by the users are most often too simplistic F As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

The Vector Model n Use of binary weights is too limiting u Non-binary weights provide consideration for partial matches n These term weights are used to compute a degree of similarity between a query and each document n Ranked set of documents provides for better matching

The Vector Model n Documents/Queries are seen as bags of words u Represented as vectors over keyword space u vec(dj) = (w1j, w2j,..., wtj) vec(q) = (w1q, w2q,..., wtq) wiq >= 0 associated with the pair (ki,q) –wij > 0 whenever ki  dj u To each term ki is associated a unitary vector vec(i) F The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) –Is this Reasonable?????? n The t unitary vectors vec(i) form an orthonormal basis for a t-dimensional space u Each vector holds a place for every term in the collection u Therefore, most vectors are sparse

t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Vector Space Example

a: System and human system engineering testing of EPS b: A survey of user opinion of computer system response time c: The EPS user interface management system d: Human machine interface for ABC computer applications e: Relation of user perceived response time to error measurement f: The generation of random, binary, ordered trees g: The intersection graph of paths in trees h: Graph minors IV: Widths of trees and well-quasi-ordering i: Graph minors: A survey

Similarity Function The similarity or closeness of a document d = ( w 1, …, w i, …, w n ) with respect to a query (or another document) q = ( q 1, …, q i, …, q n ) is computed using a similarity (distance) function. Many similarity functions exist Eucledian distance, dot product, normalized dot product (cosine-theta)

Eucledian distance n Given two document vectors d1 and d2 n

Dot Product distance sim(q, d) = dot(q, d) = q 1  w 1 + … + q n  w n Example: Suppose d = (0.2, 0, 0.3, 1) and q = (0.75, 0.75, 0, 1), then sim(q, d) = 0.15 + 0 + 0 + 1 = 1.15 Observations of the dot product function. n Documents having more terms in common with a query tend to have higher similarities with the query. n For terms that appear in both q and d, those with higher weights contribute more to sim(q, d) than those with lower weights. n It favors long documents over short documents. n The computed similarities have no clear upper bound.

A normalized similarity metric n Sim(q,dj) = cos(  ) = [vec(dj)  vec(q)] / |dj| * |q| = [  wij * wiq] / |dj| * |q| n Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 n A document is retrieved even if it matches the query terms only partially i j dj q 

t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Eucledian Cosine Comparison of Eucledian and Cosine distance metrics

Answering Queries n Represent query as vector n Compute distances to all documents n Rank according to distance n Example u “database index” t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Given Q={database, index} = {1,0,1,0,0,0}

Term Weights in the Vector Model n Sim(q,dj) = [  wij * wiq] / |dj| * |q| n How to compute the weights wij and wiq ? u Simple keyword frequencies tend to favor common words F E.g. Query: The Computer Tomography n A good weight must take into account two effects: u quantification of intra-document contents (similarity) F tf factor, the term frequency within a document u quantification of inter-documents separation (dissi- milarity) F idf factor, the inverse document frequency u wij = tf(i,j) * idf(i)

Tf-IDF n Let, u N be the total number of docs in the collection u ni be the number of docs which contain ki u freq(i,j) raw frequency of ki within dj n A normalized tf factor is given by u f(i,j) = freq(i,j) / max(freq(i,j)) F where the maximum is computed over all terms which occur within the document dj n The idf factor is computed as u idf(i) = log (N/ni) F the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.

Document/Query Representation using TF-IDF n The best term-weighting schemes use weights which are given by u wij = f(i,j) * log(N/ni) u the strategy is called a tf-idf weighting scheme n For the query term weights, several possibilities: u wiq = (0.5 + 0.5 * [freq(i,q) / max(freq(i,q)]) * log(N/ni) F Alternatively, just use the IDF weights (to give preference to rare words) u Let the user give the weights to the keywords to reflect her *real* preferences F Easier said than done... Users are often dunderheads..  Help them with “relevance feedback” techniques.

t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Given Q={database, index} = {1,0,1,0,0,0} Note: In this case, the weights used in query were 1 for t1 and t3, and 0 for the rest.

The Vector Model:Summary n The vector model with tf-idf weights is a good ranking strategy with general collections u The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute. n Advantages: u term-weighting improves quality of the answer set u partial matching allows retrieval of docs that approximate the query conditions u cosine ranking formula sorts documents according to degree of similarity to the query n Disadvantages: u assumes independence of index terms u Does not handle synonymy/polysemy u Query weighting may not reflect user relevance criteria.

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Similarity function used not be good enough. 2. Importance/weight of a term in representing a document and query may be inaccurate 3. Document representation loses information. 4. Users’ inability to describe queries precisely. 5. Same term may have multiple meanings and different terms may have similar meanings. Query expansion Relevance feedback LSI Co-occurrence analysis

Some improvements n Query expansion techniques (for 1) u relevance feedback u co-occurrence analysis (local and global thesauri) n Improving the quality of terms [(2), (3) and (5).] u Latent Semantic Indexing u Phrase-detection

Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query F and/or re-weight the terms already in the query u Two main approaches: F Automatic (psuedo-relevance feedback) F Users select relevant documents u Users/system select terms from an automatically- generated list

Relevance Feedback n Usually do both: u expand query with new terms u re-weight terms in query n There are many variations u usually positive weights for terms from relevant docs u sometimes negative weights for terms from non-relevant docs u Remove terms ONLY in non-relevant documents

Relevance Feedback for Vector Model Cr = Set of documents that are truly relevant to Q N = Total number of documents In the “ideal” case where we know the relevant Documents a priori

Rocchio Method Qo is initial query. Q 1 is the query after one iteration Dr are the set of relevant docs Dn are the set of irrelevant docs Alpha =1; Beta=.75, Gamma=.25 typically. Other variations possible, but performance similar

Rocchio/Vector Illustration Retrieval Information 0.5 1.0 0 0.51.0 D1D1 D2D2 Q0Q0 Q’ Q” Q 0 = retrieval of information = (0.7,0.3) D 1 = information science = (0.2,0.8) D 2 = retrieval systems = (0.9,0.1) Q’ = ½*Q 0 + ½ * D 1 = (0.45,0.55) Q” = ½*Q 0 + ½ * D 2 = (0.80,0.20)

Example Rocchio Calculation Relevant docs Non-rel doc Original Query Constants Rocchio Calculation Resulting feedback query

Rocchio Method n Rocchio automatically u re-weights terms u adds in new terms (from relevant docs) F have to be careful when using negative terms F Rocchio is not a machine learning algorithm n Most methods perform similarly u results heavily dependent on test collection n Machine learning methods are proving to work better than standard IR approaches like Rocchio

Using Relevance Feedback n Known to improve results u in TREC-like conditions (no user involved) n What about with a user in the loop? u How might you measure this? F Precision/Recall figures for the unseen documents need to be computed

1/22 --Are you getting e-mails sent to class mailing list? (if not, send me email) Agenda: --Intro wrapup --Start on text retrieval Link du jour: vivisimo.com.

Similar presentations

Presentation on theme: "1/22 --Are you getting e-mails sent to class mailing list? (if not, send me email) Agenda: --Intro wrapup --Start on text retrieval Link du jour: vivisimo.com."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1/22 --Are you getting e-mails sent to class mailing list? (if not, send me email) Agenda: --Intro wrapup --Start on text retrieval Link du jour: vivisimo.com.

Similar presentations

Presentation on theme: "1/22 --Are you getting e-mails sent to class mailing list? (if not, send me email) Agenda: --Intro wrapup --Start on text retrieval Link du jour: vivisimo.com."— Presentation transcript:

Similar presentations

About project

Feedback