| 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Introduction to Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Information Retrieval in Practice
Search Engines and Information Retrieval
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
CIS101 Introduction to Computing Week 11. Agenda Your questions Copy and Paste Assignment Practice Test JavaScript: Functions and Selection Lesson 06,
Evaluating the Performance of IR Sytems
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
An introduction to databases In this module, you will learn: What exactly a database is How a database differs from an internet search engine How to find.
Search Engines and Information Retrieval Chapter 1.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
LIS618 lecture 2 the Boolean model Thomas Krichel
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Endangered Species A Collaborative Teaching Unit.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
Information Retrieval
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Definition, purposes/functions, elements of IR systems Lesson 1.
CS315 Introduction to Information Retrieval Boolean Search 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Information Retrieval in Practice
Take-away Administrativa
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Lecture 1: Introduction and the Boolean Model Information Retrieval
Information Retrieval (in Practice)
Slides from Book: Christopher D
Text Based Information Retrieval
CS 430: Information Discovery
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
CSCE 561 Information Retrieval System Models
Basic Information Retrieval
CS 430: Information Discovery
Data Mining Chapter 6 Search Engines
Information Retrieval and Web Search Lecture 1: Boolean retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to Information Retrieval
Lecture 8 Information Retrieval Introduction
Information Retrieval and Web Design
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

| 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction

Agenda for today Who’s who Intro to the course Chapter 1 of Introduction to Information Retrieval Homework/lab assignment

3 Intro to the course What is IR? What will we study and how? Objectives of the course Exercises and lab sessions Final exam

4 What is IR? Individuals, administrations, organizations have lots of digital information how to organize and store it? how to retrieve documents? how to retrieve info inside them? An IR system is a tool to facilitate retrieval of such information

Book’s definition Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). 5

... finding material (usually documents)... What else can you think of? 6 parts of documents facts, like the day of birth of Rembrandt a book in the library a work of art in a museum

... from within large collections (usually stored on computers)… WWW? What else? 7 Specific collections, like legal information or scientific medical papers (Medline) Information on your own computer Information within a company Subparts of the www, like one domain

… of an unstructured nature (usually text) Can you explain this? 8 Unstructured: differences between text and databases is a text document really unstructured? how about XML? Beyond text: image, sound, video, ….

9 Database search vs. IR structured semantic info: fields datatypes validation relations search of fields exact search for data order of found records alfanumerical no semantic structure no fixed format, but text structure metadata XML full text search not-exact search for data or information order of found documents often by similarity with query

Book’s definition Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). 10

11 that satisfies an information need... What information needs can we discern? Try to formulate some different types of goals of a search -facts and question answering -definitions -information on a subject -retrieving a known document and in websearch?

12 User needs in web search Navigational. The immediate intent is to reach a particular site. Informational. The intent is to acquire some information assumed to be present on one or more web pages. Transactional. The intent is to perform some web-mediated activity. Broder, A A taxonomy of Web search. SIGIR Forum 36, no.23-10

13 Translation of info need Each information need has to be translated into the "language" of the IR system reality document info need query

14 Translation of info need Query: Hilton, Paris

15 Translation of info need Query: champagne

Translation of info need Query: Rene Froger “Een eigen huis”

Translation of info need Information need: Query: ??

Translation of info need Information need: Query: ??

Are the results satisfying? Search engines produce often a lot of results When are you satisfied with the results? How can we evaluate a system? the most relevant results are easy to find (on top of the list, and/or sorted by subject, …) only few results are not relevant new information info corrobarated (more sources) relevant documents that I know are presented

Precision and recall Key statistics for evaluation with a test set ( fixed questions, set of documents, evaluations of documents for the queries available) Precision: what fraction of the results are relevant to the information need? Recall: what fraction of the relevant documents in the collection were returned by the system?

Precision and recall But how relevant are Precision and Recall if you search for e.g. The date of birth of Vincent van Gogh?

Overview of an IR system (book: Baeza-Yates:Modern IR)

Web site Overview and exercises: * * Nestor

Course Book Introduction to Information Retrieval D. Manning, P. Raghaven and H. Schütze Online version NB: the book is also used for the Information Retrieval course The book is written for CS students, we will skip sections and exercises that are a bit too technical

Schedule for this course wk 1ch 1 boolean retrieval, posting lists wk 2ch 2 decoding, tokenization and normalization, sublinear posting list intersection wk 3ch 3 dictionaries, wild cards, spell correction wk 4ch 6 scoring and term weighting, term and document frequency weighting, vector space models wk 5ch 8 evalutation wk 6ch 21 link analysis, page rank wk 7ch 9 relevance feedback and query analysis

HOW will we study the book? Homework: read the chapter thoroughly Lectures: overview of chapter Labtime/homework: do exercises Next lecture: remaining questions Full slide presentation of the chapters by one of the authors available as well author's slidesauthor's slides

Labtime 1.Exercises (from the book + more) 2.Try out simple techniques in Python 3.More... 4.More...

Course objectives knowledge of IR terminology insight in IR models and IR processes knowledge of methods of indexing, querying, retrieving and ranking knowledge of methods of evaluation of IR systems practical experience with use, adaptation and testing of some of the basic IR algorithms and techniques

Chapter 1: Boolean retrieval 1.General introduction on IR 2.Boolean systems 3.Representation of information 4.Retrieving documents 5.Efficiency aspects

Boolean retrieval The first IR systems were Boolean systems Queries are formulated with the Boolean operators AND, OR and NOT: Brutus AND Caesar (Brutus OR Caesar) AND NOT Cleopatra Brutus OR (Caesar AND NOT Cleopatra) NOT Brutus How about Google queries?

Information from documents Each document in the system needs a unique docID Tokenization is the process of splitting a text into separate tokens (not trivial!) For a simple boolean system we just need to know which terms are present in which doc

Term document incidence matrix Doc 1Doc 2Doc 3Doc 4 Antony1100 Brutus1110 Caesar1101 Cleopatra1000 Antony AND Brutus AND NOT Cleopatra? in huge collections > 99% of entries are 0 not a good representation, no efficient processing

Building an inverted file 1.Give DocIDs and tokenize the texts 2.Gather terms with their docID 3.Sort on terms and docID 4.Now list the unique terms with their document frequency and link to the postings list with docIDs term docfreq postings list [Caesar, 3]  [1,2,4]

Inverted file / index Antony AND Brutus AND NOT Cleopatra? efficient processing if sorted on DocID simple merging algorithms for AND / OR (term)(df)(postings list) Antony2  1, 2, 6 Brutus3  1, 2, 3 Caesar3  1, 2, 4, 5, 6 Cleopatra1  1

Distributive laws a AND (b OR c) = (a AND b) OR (a AND c) (a OR b) AND (c OR d) = ?? NOT(a OR b) = NOT(a) AND NOT(b) NOT(a AND b) = ??

Conjunctive and disjunctive queries The outer level of processing can be either conjunctive (AND) or disjunctive (OR): Conjunctive normal form: a conjunction of disjunctions (a OR NOT b) AND (c OR d) AND e Disjunctive normal form: a disjunction of conjunctions (a AND NOT b) OR (c AND d) OR e

The order of the size Example f(x) = 2x 3 + 5x 2 +x + 9 This is a function of O(x 3 ): if x grows to infinity the factor x 3 is what really determines the size of the outcome, the rest can be neglected

The order of time complexity Example To find similar elements in two ordered lists, the number of steps depends on the size of both lists: O(x + y) (linear) Need to check all combinations O(x * y) (quadratic)

Big O notation Used to classify algorithms by how they respond (e.g. in their processing time or working space requirements) to changes in input Best case, worst case, average case? Big O represents the upper bound (worst case) Other symbols used for lower bound, tight bound, ….

Guidance/questions on the text Write down and try to find explanations of terms you don’t know p4 you know the KB, MB, GB.. etc sizes? p5 fig 1.3: look back to fig 1.1 p7 what types of linguistic preprocessing do you see in the examples in step 3? p11/12 do you understand the algorithms? Are you able to explain now what an inverted file is and how it is constructed?

Homework …. is on the web site ….