מבוא לאחזור מידע Information Retrieval בינה מלאכותית אבי רוזנפלד.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
מבוא לאחזור מידע Information Retrieval בינה מלאכותית אבי רוזנפלד.
IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems.
Tries Standard Tries Compressed Tries Suffix Tries.
CS 430 / INFO 430 Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Information Retrieval in Practice
Architecture of a Search Engine
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
איחזור מידע אלגוריתמי חיפוש PageRank ד " ר אבי רוזנפלד.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Information Retrieval and Vector Space Model Presented by Jun Miao York University 1.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Vector Space Models.
Information Retrieval
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Chapter 4 Processing Text. n Modifying/Converting documents to index terms  Convert the many forms of words into more consistent index terms that represent.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Search Engine Architecture
Text Based Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
CS 430: Information Discovery
CS 430: Information Discovery
Text Categorization Assigning documents to a fixed set of categories
Inf 722 Information Organisation
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

מבוא לאחזור מידע Information Retrieval בינה מלאכותית אבי רוזנפלד

Hopkins IR Workshop 2005Copyright © Victor Lavrenko מהי אחזור מידע ? חיפוש אחרי מידע באינטרנט –הרב ד " ר גוגל (BING?( –תחום מחקר פורח וחשובה... חיפוש אחרי מידע רלוונטי מתוך הרבה מקורות ( אתרים, תמונה, וכו ')

Hopkins IR Workshop 2005Copyright © Victor Lavrenko מה ההבדל בין שאילתות בבסיסי נתונים ואחזור מידע ? בסיסי נתוניםאחזור מידע Data Structured (key)Unstructured (web) Fields Clear semantics (SSN, age) No fields (other than text) Queries Defined (relational algebra, SQL) Free text (“natural language”), Boolean Matching Exact (results are always “correct”) Imprecise (need to measure effectiveness)

4 מערכת המידע של אחזור מידע IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.

שלבים למנוע חיפוש בניית המאגר מידע (Document Corpus) – Web crawler –ניקיון המידע מכפילות, STEMMING –תדירות המילים –חשיבות המילים – Td/IDF בניית האנדקסים ( לאנדקס Index) בניית התשובה –עיבוד השאלתה ( הורדת STOP WORDS) –דירוג תוצאות (PAGERANK) ניתוח התוצאות – FALSE POSITIVE / FALSE NEGATIVE – Recall / Precision

בניית ה INDEX

דוגמאות – שימו לב לזמן ומספר התוצאות !

Web Crawler / זחלן רשת – Identifies and acquires documents for search engine – זחלן רשת הוא סוג של בוט או תוכנה שסורקת באופן אוטומטי ושיטתי את ה WWW. מדיניות של בחירה אשר מגדירה איזה עמוד להוריד. מדיניות של ביקור חוזר אשר מגדירה מתי לבדוק שינויים בדפים. מדיניות נימוס אשר מגדירה איך להימנע מעומס יתר של אתרים ולגרום להפלה של השרת. מדיניות של הקבלה אשר מגדירה איך לתאם בין הזחלנים השונים.

ניקיון המידע -- דחיסה Text is highly redundant (or predictable) Compression techniques exploit this redundancy to make files smaller without losing any of the content Compression of indexes covered later Popular algorithms can compress HTML and XML text by 80% – e.g., DEFLATE (zip, gzip) and LZW (UNIX compress, PDF) – may compress large files in blocks to make access faster

ניקיון המידע -- רעש Many web pages contain text, links, and pictures that are not directly related to the main content of the page This additional material is mostly noise that could negatively affect the ranking of the page Techniques have been developed to detect the content blocks in a web page – Non-content material is either ignored or reduced in importance in the indexing process

Noise Example

Example Web Page

ניקיון המידע – מציאת מידע חשוב Tokenizer recognizes “words” in the text – Must consider issues like capitalization, hyphens, apostrophes, non-alpha characters, separators Markup languages such as HTML, XML often used to specify structure and metatags – Tags used to specify document elements – E.g., Overview – Document parser uses syntax of markup language (or other formatting) to identify structure

דירוג הנתונים מידע סטטיסטי של המסמכים –מספר הפעמים שמילים הופיעו, מיקום במסמך משקל למילים – tf.idf weight –שילוב של התדירות של מילה במסמך ובכל המאגר

Zipf חוק יש התפלגות לא נורמאלית למלים בשפה –הרבה מילים מופיעים הרבה ( ואפילו רוב הזמן ) – Stopwords = מילים שמופיעים הרבה ולכן לא חשובים בחיפוש – e.g., two most common words (“the”, “of”) make up about 10% of all word occurrences in text documents Zipf’s “law”: – observation that rank (r) of a word times its frequency (f) is approximately a constant (k) assuming words are ranked in order of decreasing frequency – i.e., r.f  k or r.P r  c, where P r is probability of word occurrence and c  0.1 for English

Zipf’s Law

18 tf = term frequency – frequency of a term/keyword in a document The higher the tf, the higher the importance (weight) for the doc. df = document frequency – no. of documents containing the term – distribution of the term idf = inverse document frequency – the unevenness of term distribution in the corpus – the specificity of term to a document The more the term is distributed evenly, the less it is specific to a document weight(t,D) = tf(t,D) * idf(t) tf*idf weighting schema

דוגמא נניח שהמילה example הופיע 3 פעמים בתוך מסמך א ' מתוך 4 מסמכים – Tf= 3 ( במסמך הזה ) – Idf = 1/0.25= 4 – Tf*idf=12 –מה יקרה אם המילה מופיע בכל מסמך ? מה יהיה ה tf*idf? (3) יש כמה וריאציות איך לחשב tf ו idf ( כמו ב LOG) אבל זה הכי פשט נשתמש בו בתוך התרגיל

20 function words do not bear useful information for IR of, in, about, with, I, although, … Stoplist: contain stopwords, not to be used as index – Prepositions – Articles – Pronouns – Some adverbs and adjectives – Some frequent words (e.g. document) The removal of stopwords usually improves IR effectiveness A few “standard” stoplists are commonly used. Stopwords / Stoplist

Top 50 Words from AP89

22 Stemming Reason: – Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: – Removing some endings of word computer compute computes computing computed computation comput

Stemming Generally a small but significant effectiveness improvement – can be crucial for some languages – e.g., 5-10% improvement for English, up to 50% in Arabic Words with the Arabic root ktb

Stemming Two basic types – Dictionary-based: uses lists of related words – Algorithmic: uses program to determine related words Algorithmic stemmers – suffix-s: remove ‘s’ endings assuming plural e.g., cats → cat, lakes → lake, wiis → wii Many false negatives: supplies → supplie Some false positives: ups → up

25 Porter algorithm (Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) : ) Step 1: plurals and past participles – SSES -> SS caresses -> caress – (*v*) ING -> motoring -> motor Step 2: adj->n, n->v, n->adj, … – (m>0) OUSNESS -> OUS callousness -> callous – (m>0) ATIONAL -> ATE relational -> relate Step 3: – (m>0) ICATE -> IC triplicate -> triplic Step 4: – (m>1) AL -> revival -> reviv – (m>1) ANCE -> allowance -> allow Step 5: – (m>1) E -> probate -> probat – (m > 1 and *d and *L) -> single letter controll -> control

N-Grams Frequent n-grams are more likely to be meaningful phrases N-grams form a Zipf distribution – Better fit than words alone Could index all n-grams up to specified length – Much faster than POS tagging – Uses a lot of storage e.g., document containing 1,000 words would contain 3,990 instances of word n-grams of length 2 ≤ n ≤ 5

עוד שיטות... דירוג לפי n-gram ( אוסף של מילים באורך n) דירוג לפי סוג המילה ( פעולה, שם עצם וכו ') – Part of Speech (POS)

Google N-Grams Web search engines index n-grams Google sample: Most frequent trigram in English is “all rights reserved” – In Chinese, “limited liability corporation”

Hopkins IR Workshop 2005Copyright © Victor Lavrenko דירוג תוצאות Early IR focused on set-based retrieval – Boolean queries, set of conditions to be satisfied – document either matches the query or not like classifying the collection into relevant / non-relevant sets – still used by professional searchers – “advanced search” in many systems Modern IR: ranked retrieval – free-form query expresses user’s information need – rank documents by decreasing likelihood of relevance – many studies prove it is superior