מבוא לאחזור מידע Information Retrieval בינה מלאכותית אבי רוזנפלד.

מבוא לאחזור מידע Information Retrieval בינה מלאכותית אבי רוזנפלד

Hopkins IR Workshop 2005Copyright © Victor Lavrenko מהי אחזור מידע ? חיפוש אחרי מידע באינטרנט –הרב ד " ר גוגל (BING?( –תחום מחקר פורח וחשובה... חיפוש אחרי מידע רלוונטי מתוך הרבה מקורות ( אתרים, תמונה, וכו ')

Hopkins IR Workshop 2005Copyright © Victor Lavrenko מה ההבדל בין שאילתות בבסיסי נתונים ואחזור מידע ? בסיסי נתוניםאחזור מידע Data Structured (key)Unstructured (web) Fields Clear semantics (SSN, age) No fields (other than text) Queries Defined (relational algebra, SQL) Free text (“natural language”), Boolean Matching Exact (results are always “correct”) Imprecise (need to measure effectiveness)

4 מערכת המידע של אחזור מידע IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.

שלבים למנוע חיפוש בניית המאגר מידע (Document Corpus) – Web crawler –ניקיון המידע מכפילות, STEMMING –תדירות המילים –חשיבות המילים – Td/IDF בניית האנדקסים ( לאנדקס Index) בניית התשובה –עיבוד השאלתה ( הורדת STOP WORDS) –דירוג תוצאות (PAGERANK) ניתוח התוצאות – FALSE POSITIVE / FALSE NEGATIVE – Recall / Precision

בניית ה INDEX

דוגמאות – שימו לב לזמן ומספר התוצאות !

Web Crawler / זחלן רשת – Identifies and acquires documents for search engine – http://en.wikipedia.org/wiki/Web_crawler http://en.wikipedia.org/wiki/Web_crawler זחלן רשת הוא סוג של בוט או תוכנה שסורקת באופן אוטומטי ושיטתי את ה WWW. מדיניות של בחירה אשר מגדירה איזה עמוד להוריד. מדיניות של ביקור חוזר אשר מגדירה מתי לבדוק שינויים בדפים. מדיניות נימוס אשר מגדירה איך להימנע מעומס יתר של אתרים ולגרום להפלה של השרת. מדיניות של הקבלה אשר מגדירה איך לתאם בין הזחלנים השונים.

ניקיון המידע -- דחיסה Text is highly redundant (or predictable) Compression techniques exploit this redundancy to make files smaller without losing any of the content Compression of indexes covered later Popular algorithms can compress HTML and XML text by 80% – e.g., DEFLATE (zip, gzip) and LZW (UNIX compress, PDF) – may compress large files in blocks to make access faster

ניקיון המידע -- רעש Many web pages contain text, links, and pictures that are not directly related to the main content of the page This additional material is mostly noise that could negatively affect the ranking of the page Techniques have been developed to detect the content blocks in a web page – Non-content material is either ignored or reduced in importance in the indexing process

Noise Example

Example Web Page

ניקיון המידע – מציאת מידע חשוב Tokenizer recognizes “words” in the text – Must consider issues like capitalization, hyphens, apostrophes, non-alpha characters, separators Markup languages such as HTML, XML often used to specify structure and metatags – Tags used to specify document elements – E.g., Overview – Document parser uses syntax of markup language (or other formatting) to identify structure

דירוג הנתונים מידע סטטיסטי של המסמכים –מספר הפעמים שמילים הופיעו, מיקום במסמך משקל למילים – tf.idf weight –שילוב של התדירות של מילה במסמך ובכל המאגר

Zipf חוק יש התפלגות לא נורמאלית למלים בשפה –הרבה מילים מופיעים הרבה ( ואפילו רוב הזמן ) – Stopwords = מילים שמופיעים הרבה ולכן לא חשובים בחיפוש – e.g., two most common words (“the”, “of”) make up about 10% of all word occurrences in text documents Zipf’s “law”: – observation that rank (r) of a word times its frequency (f) is approximately a constant (k) assuming words are ranked in order of decreasing frequency – i.e., r.f  k or r.P r  c, where P r is probability of word occurrence and c  0.1 for English

Zipf’s Law

18 tf = term frequency – frequency of a term/keyword in a document The higher the tf, the higher the importance (weight) for the doc. df = document frequency – no. of documents containing the term – distribution of the term idf = inverse document frequency – the unevenness of term distribution in the corpus – the specificity of term to a document The more the term is distributed evenly, the less it is specific to a document weight(t,D) = tf(t,D) * idf(t) tf*idf weighting schema

דוגמא נניח שהמילה example הופיע 3 פעמים בתוך מסמך א ' מתוך 4 מסמכים – Tf= 3 ( במסמך הזה ) – Idf = 1/0.25= 4 – Tf*idf=12 –מה יקרה אם המילה מופיע בכל מסמך ? מה יהיה ה tf*idf? (3) יש כמה וריאציות איך לחשב tf ו idf ( כמו ב LOG) אבל זה הכי פשט נשתמש בו בתוך התרגיל

20 function words do not bear useful information for IR of, in, about, with, I, although, … Stoplist: contain stopwords, not to be used as index – Prepositions – Articles – Pronouns – Some adverbs and adjectives – Some frequent words (e.g. document) The removal of stopwords usually improves IR effectiveness A few “standard” stoplists are commonly used. Stopwords / Stoplist

Top 50 Words from AP89

22 Stemming Reason: – Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: – Removing some endings of word computer compute computes computing computed computation comput

Stemming Generally a small but significant effectiveness improvement – can be crucial for some languages – e.g., 5-10% improvement for English, up to 50% in Arabic Words with the Arabic root ktb

Stemming Two basic types – Dictionary-based: uses lists of related words – Algorithmic: uses program to determine related words Algorithmic stemmers – suffix-s: remove ‘s’ endings assuming plural e.g., cats → cat, lakes → lake, wiis → wii Many false negatives: supplies → supplie Some false positives: ups → up

25 Porter algorithm (Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137) Step 1: plurals and past participles – SSES -> SS caresses -> caress – (*v*) ING -> motoring -> motor Step 2: adj->n, n->v, n->adj, … – (m>0) OUSNESS -> OUS callousness -> callous – (m>0) ATIONAL -> ATE relational -> relate Step 3: – (m>0) ICATE -> IC triplicate -> triplic Step 4: – (m>1) AL -> revival -> reviv – (m>1) ANCE -> allowance -> allow Step 5: – (m>1) E -> probate -> probat – (m > 1 and *d and *L) -> single letter controll -> control

N-Grams Frequent n-grams are more likely to be meaningful phrases N-grams form a Zipf distribution – Better fit than words alone Could index all n-grams up to specified length – Much faster than POS tagging – Uses a lot of storage e.g., document containing 1,000 words would contain 3,990 instances of word n-grams of length 2 ≤ n ≤ 5

עוד שיטות... דירוג לפי n-gram ( אוסף של מילים באורך n) דירוג לפי סוג המילה ( פעולה, שם עצם וכו ') – Part of Speech (POS)

Hopkins IR Workshop 2005Copyright © Victor Lavrenko דירוג תוצאות Early IR focused on set-based retrieval – Boolean queries, set of conditions to be satisfied – document either matches the query or not like classifying the collection into relevant / non-relevant sets – still used by professional searchers – “advanced search” in many systems Modern IR: ranked retrieval – free-form query expresses user’s information need – rank documents by decreasing likelihood of relevance – many studies prove it is superior

מבוא לאחזור מידע Information Retrieval בינה מלאכותית אבי רוזנפלד.

Similar presentations

Presentation on theme: "מבוא לאחזור מידע Information Retrieval בינה מלאכותית אבי רוזנפלד."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

מבוא לאחזור מידע Information Retrieval בינה מלאכותית אבי רוזנפלד.

Similar presentations

Presentation on theme: "מבוא לאחזור מידע Information Retrieval בינה מלאכותית אבי רוזנפלד."— Presentation transcript:

Similar presentations

About project

Feedback