אחזור מידע, מנועי חיפוש וספריות

אחזור מידע, מנועי חיפוש וספריות
שיעור 12 מדעי הרוח הדיגיטליים אוניברסיטת בן גוריון יעל נצר

ספריית נינווה

אחזור מידע מסמך - טקסט ׳קוהרנטי׳ (מכיל נושאים הקשורים אחד לשני)
אוסף מסמכים - קורפוס (נושאי או כללי) צורך ידע - נושא שעליו המחפשת רוצה לדעת יותר (תלוי במשתמשת! כל אחד יודע דברים אחרים) שאילתה - הייצוג לצורך הידע רלוונטיות - ערך המסמך ביחס לצורך הידע (האם מכיל ערך עבור הצורך --בינארי או מדורג.)

מודל ״שק מילים״ bag of words — BOW
התעלמות ממבנה רשימת מונחים terms אוצר מילים vocabulary

מודל המרחב הוקטורי הרעיון הוא שאנו מייצגים מסמך כוקטור של ביטויים
ביטוי term — מילה או צירוף כל מילה או צירוף - כלומר כל term הוא מימד בוקטור גודל אוצר המילים, כלומר ה terms קובע את מימד הוקטור

Dot product of vectors:
The projection of the vector A into the vector B. By Wikipedia.

http://blog. christianperone

מה הערכים של הוקטור? נסתכל על שלושה פרמטרים:
מספר המופעים של מילה במסמך מספר המסמכים שבהם מילה מופיעה באוסף המסמכים אורך המסמך

Weighting term frequency: tf
What is the relative importance of 0 vs. 1 occurrence of a term in a doc 1 vs. 2 occurrences 2 vs. 3 occurrences … Unclear: while it seems that more is better, a lot isn’t proportionally better than a few (The Kandy-Kolored Tangerine-Flake Streamline Baby) You’d have to let me know!

Effect of idf on ranking
Does idf have an effect on ranking for one-term queries, like iPhone idf has no effect on ranking one term queries idf affects the ranking of documents for queries with at least two terms For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person.

Comments on Vector Space Models
Simple, mathematically based approach. Considers both local (tf) and global (idf) word occurrence frequencies. Provides partial matching and ranked results. Tends to work quite well in practice despite obvious weaknesses. Allows efficient implementation for large document collections.

Problems with Vector Space Model
Missing semantic information (e.g. word sense). Missing syntactic information (e.g. phrase structure, word order, proximity information). Assumption of term independence (e.g. ignores synonomy). Lacks the control of a Boolean model (e.g., requiring a term to appear in a document). Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently.

Practical Implementation
Based on the observation that documents containing none of the query keywords do not affect the final ranking Try to identify only those documents that contain at least one query keyword Actual implementation of an inverted index

Step 1: Preprocessing Implement the preprocessing functions:
For tokenization For stop word removal For stemming Input: Documents that are read one by one from the collection Output: Tokens to be added to the index No punctuation, no stop-words, stemmed

Step 2: Indexing Build an inverted index, with an entry for each word in the vocabulary Input: Tokens obtained from the preprocessing module Output: An inverted index for fast access

Step 2 (cont’d) We need: One entry for each word in the vocabulary
For each such entry: Keep a list of all the documents where it appears together with the corresponding frequency  TF For each such entry, keep the total number of occurrences in all documents:  IDF

Step 2 (cont’d) Dj, tfj df Index terms computer 3 D7, 4 database D1, 3
system computer database science D2, 4 D5, 2 D1, 3 D7, 4 Index terms df 3 2 4 1 Dj, tfj Index file lists · · ·

Step 2 (cont’d) TF and IDF for each token can be computed in one pass
Cosine similarity also required document lengths Need a second pass to compute document vector lengths Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens. Remember the weight of a token is: TF * IDF Therefore, must wait until IDF’s are known (and therefore until all documents are indexed) before document lengths can be determined. Do a second pass over all documents: keep a list or hashtable with all document id-s, and for each document determine its length.

Step 3: Retrieval Use inverted index (from step 2) to find the limited set of documents that contain at least one of the query words. Incrementally compute cosine similarity of each indexed document as query words are processed one by one. To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where the document id is the key, and the partial accumulated score is the value. Input: Query and Inverted Index (from Step 2) Output: Similarity values between query and documents

Sec. 6.3 Queries as vectors Key idea 1: Do the same for queries: represent them as vectors in the space Key idea 2: Rank documents according to their proximity to the query in this space proximity = similarity of vectors proximity ≈ inverse of distance Recall: We do this because we want to get away from the you’re-either-in-or-out Boolean model. Instead: rank more relevant documents higher than less relevant documents

Step 4: Ranking Sort the hashtable including the retrieved documents based on the value of cosine similarity sort {$retrieved{$b} ⬄ $retrieved{$a}} keys %retrieved Return the documents in descending order of their relevance Input: Similarity values between query and documents Output: Ranked list of documented in reversed order of their relevance

Summary – vector space ranking
Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the )cosine( similarity score for the query vector and each document vector Rank documents with respect to the query by score Return the top K (e.g., K = 10) to the user

הערכה Evaluation של מנועי חיפוש

הערכת מערכת איחזור מידע
Information need צורך ידע — מתורגם לשאילתה אבל רלוונטיות מוכרעת ביחס ל-information need ולא ביחס לשאילתה. כלומר: Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. Query: wine red white heart attack effective הערכת הרלוונטיות היא ביחס לשאלה האם המסמך שהתקבל מתאים לצרכי- הידע ולא האם הוא מכיל את הביטויים בחיפוש.

Difficulties with gauging Relevancy
Relevancy, from a human standpoint, is: Subjective: Depends upon a specific user’s judgment. Situational: Relates to user’s current needs. Cognitive: Depends on human perception and behavior. Dynamic: Changes over time. Number of words page limit User need: determine equivalence Retrieved: How to limit number of words in an MS Word page IICAI/NLDB -> opt to change to IJCAI/VLDB Googling: Nano – Amazon (polysemy/synonymy)

Unranked retrieval evaluation: Precision and Recall
Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) Relevant Not Relevant Retrieved tp fp Not Retrieved fn tn

Another View Space of all documents Relevant + Retrieved Relevant
Not Relevant + Not Retrieved

Trade-offs Returns relevant documents but misses many useful ones too
The ideal 1 Precision Returns most relevant documents but includes lot of junk 1 Recall

Why not just use accuracy?
How to build a % accurate search engine on a low budget…. People doing information retrieval want to find something and have a certain tolerance for junk. Snoogle.com Search for: 0 matching results found.

Precision/Recall You can get high recall (but low precision) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved In a good system, precision decreases as either number of docs retrieved or recall increases A fact with strong empirical confirmation

Difficulties in using precision/recall
Should average over large corpus/query ensembles Need human relevance assessments People aren’t reliable assessors Assessments have to be binary Nuanced assessments? Heavily skewed by corpus/authorship Results may not translate from one domain to another

F-Measure Harmonic mean of recall and precision
ממוצע הרמוני הוא "ההפכי של הממוצע החשבוני של ההפכיים' נוטה להמעיט בערך מספרים גדולים ולהגדיל ערכם של מספרים קטנים Beta controls relative importance of precision and recall Beta = 1, precision and recall equally important Beta = 5, recall five times more important than precision

F1 and other averages This graph throws light on ways to combine precision and recall in various ways, and see how F-measure compares to these choices. X-axis is precision (with recall fized at 70%) Y-axis is various combinations such as max(P,R), AM(P,R), F- measure/HM(P,R), etc All lines meet at (70%, 70%)

אחזור מידע, מנועי חיפוש וספריות

Similar presentations

Presentation on theme: "אחזור מידע, מנועי חיפוש וספריות"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

אחזור מידע, מנועי חיפוש וספריות

Similar presentations

Presentation on theme: "אחזור מידע, מנועי חיפוש וספריות"— Presentation transcript:

Similar presentations

About project

Feedback