C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watterscsci64031 Classical IR Models

C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer

C.Watterscsci64033 Models Boolean (based on set theory) –fuzzy logic –Extended Boolean Vector Space (based on algebra) –Latent semantic networks –Neural networks Probabilistic –Inference networks –Belief networks Hypertext

C.Watterscsci64034 Retrieval Ad hoc Repeated – filter –selective dissemination of information (SDI) –profile Browsing

C.Watterscsci64035 Index terms K={k 1,…,k n } Migration to Australia This page introduces information about migrating to Australia (as a migrant or refugee), which means travelling to Australia with a visa that gives you the right to live permanently in Australia. Please note: if you plan to visit Australia (that is, not stay permanently), and you want to work, please read the information about temporary entry.

C.Watterscsci64036 Index Term Weights For each index term k i in document d j a weight w i,j is assigned, (0..1) Generally assumed to be independent What does this tell us? –(0,1) –(0..1)

C.Watterscsci64037 Document as Set of Terms Document is represented by set of terms d j = {w 1,j, w 2,j, w 3,j, …. w n,j } Where w 1,j is the weight of term 1 in doc j So ?? If –w 1,j = 0 –w 1,j = 1 –w 1,j =.2

C.Watterscsci64038 Inverted File Term -> { occurrences} Organized for fast access by term Plus any extra information you need for your retrieval algorithm Size??

C.Watterscsci64039 Boolean Based on set theory using index terms –Term weights: w i,j = {0,1} –Document vector: d j = (0,1,0,…) –Boolean query: AND OR NOT –Q=t 1 AND t 2 OR t 3 Australia AND work AND papers Australia AND visa Australia OR visa

C.Watterscsci640310 Boolean Representation Sim(d j,q)={0,1} Sample (t 1 = Australia t 2 = visa t 3 =outback) –d 1 = (0,1,0) –d 2 =(1,1,0) –d 3 =(0,1,1) Australia and visa sim(d 1,q)= Australia or visa sim(d 2,q)= Australia not visa sim(d 3,q)=

C.Watterscsci640311 Index Structure Australia:1,4,77 Migrant: 1,5,87,97,123 Visa:4,19, 55, 97 Algorithm???

C.Watterscsci640312 Complex queries (red or blue) and (sedan or (suv and ford)) Efficiency?

C.Watterscsci640313 Problems Misinterpretation of query by users Mouse device Binary weights used for index terms Red BMW Convertible Elimination of partial results Binary results –Document either fits or doesn’t –Too few or too many results

C.Watterscsci640314 Dominance of this model Simple to implement Simple to use Examples?

C.Watterscsci640315 Vector Space Model Relax binary weight restriction Allow partial matches Provide ranking of results Goal: determine the degree of similarity between each document and the query

C.Watterscsci640316 Given n possible index terms For each document ith term in jth document Has term weight in jth doc w i,j = [0..1] Giving d j =(w 1,j, w 2,j,…w n,j ) For each query term kth term has a query weight w k,q = [0..1] Giving q=(w 1,q,w 2,q,…,w n,q )

C.Watterscsci640317

C.Watterscsci640318 Calculate similarities Rank Use threshold Q=Heat (.8) Film(.3) H’wood(.5) Result / Order Boolean result?

C.Watterscsci640319 Index Term Weights Given a set of documents Goals –Find features that describe document X –Find features that differentiate doc X from Y IR treats documents as clusters (bags) of terms –Intra-cluster similarity –Inter-cluster dissimilarity

C.Watterscsci640320 Intra document term similarity Raw frequency of terms within the doc tf or term frequency factor Problems –Common words –Size of document Normalized tf, f i,j = freq i,j max( freq k,j )

C.Watterscsci640321 Inter Document Dissimilarity Measure frequency of terms across doc set idf or inverse document frequency idf i = log N n i N is number of documents n i is number of documents with term k i Dampens the effect of increases in set size

C.Watterscsci640322

C.Watterscsci640323 So Term frequency -> more is better Document frequency -> less is better Together accentuate difference Migrate 3 times (10 docs out of 500) Australia 5 times (400 docs out of 500)

C.Watterscsci640324 OK Use term weights to calculate Document to document similarity (more high weight terms in common) And Query to Document similarity (query terms are high weight terms in doc)

C.Watterscsci640325 Document-Document Similarity

C.Watterscsci640326 Example Document 1: Australia sample document –Australia weight.05 –Migrate weight.56 Document 2: Geese Migration –Geese weight.45 –Migrate weight.55

C.Watterscsci640327 Vector Structure Doc1:.1, 0,0,.4, 0, 0, 0,.8,.7, 0,.7,.7 Doc2:.1,.1,0,.1, 0,.8,.7,.9,.7,.1,.2,.3 Doc3:.4,.1,0, 0,.9,.5,.5, 0, 0, 0,.9,.7 Algorithm???

C.Watterscsci640328 Query Document Similarity Sim(D,Q)=SUM(w i,q * w i,d ) So query = Australia (.5) Geese (.8) Sim(doc1,Q)= Sim(doc2,Q)=

C.Watterscsci640329 Doing Better! Augmented schemes Vector space similarity measures

C.Watterscsci640330 Query Term Weights Natural language query I am doing a paper on shipping for my class at Dalhousie. Are there any reports from this university on deep sea shipping. Frequency Part of speech

C.Watterscsci640331 Using Similarity: Partial Matches w i,j  and w i,q  then sim(q,d j )=[0…1] Every document has a similarity value to every query E.g., Dalhousie shipping What does OR mean What does AND mean How to manage this

C.Watterscsci640332 Using Similarity: Ranking Order results by similarity value Dalhousie Shipping ?? Query and documentTerm weights

C.Watterscsci640333 Similarity of Q to Docs (Normalize) djdj q  sim(d j,q)=cosine 

C.Watterscsci640334

C.Watterscsci640335

C.Watterscsci640336 So why do we need a vector???

C.Watterscsci640337

C.Watterscsci640338

C.Watterscsci640339 Other similarity measurements

C.Watterscsci640340 Cosine Similarity C= Terms in common, A terms in i, and B terms in j

C.Watterscsci640341 Dice similarity Measure C= Terms in common, A terms in i, and B terms in j

C.Watterscsci640342 Jaccard Similarity Measure C= Terms in common, A terms in i, and B terms in j

C.Watterscsci640343 Vector Space Model Advantages –Allows partial matches –Allows ranking Disadvantages –Need whole doc set to determine weights –Extra computation –Terms are assumed to be independent

C.Watterscsci640344 NeoClassical Models**** Probabilistic model Boolean variations –Fuzzy set model –Extended Boolean Vector space variations –Generalized vector space (term dependency) –Latent Semantic indexing –Neural net models

C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

Similar presentations

Presentation on theme: "C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

Similar presentations

Presentation on theme: "C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer."— Presentation transcript:

Similar presentations

About project

Feedback