Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Information Retrieval Review
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Text Analysis Everything Data CompSci Spring 2014.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Information Retrieval Lecture 2: The term vocabulary and postings lists.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Introduction to Information Retrieval Boolean Retrieval.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS315 Introduction to Information Retrieval Boolean Search 1.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Take-away Administrativa
Information Retrieval : Intro
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Large Scale Search: Inverted Index, etc.
Search Engine Architecture
CS122B: Projects in Databases and Web Applications Winter 2017
Lecture 1: Introduction and the Boolean Model Information Retrieval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Slides from Book: Christopher D
Text Based Information Retrieval
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Boolean Retrieval.
CSCE 561 Information Retrieval System Models
Document ingestion.
Information Retrieval
What is a Search Engine EIT, Author Gay Robertson, 2017.
Data Mining Chapter 6 Search Engines
Boolean Retrieval.
Information Retrieval and Web Search Lecture 1: Boolean retrieval
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
PageRank GROUP 4.
Boolean Retrieval.
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Search Engine Architecture
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS276 Information Retrieval and Web Search
Junghoo “John” Cho UCLA
Basic Text Processing Word tokenization.
Information Retrieval and Web Design
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to Search Engines
Presentation transcript:

Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics

Information Retrieval Information Retrieval: is finding material of an unstructured nature that satisfies an information need from within large collections. Used to reduce information overload (aka Infoxication, infobesity, data smog, and info glut.) Used to retrieve information from a large collection of data. Think library system. Most visible use is Web Searches. Documents are typically transformed into a usable representation before I.R. Documents represented as words or phrases Documents and queries represented as vectors, matrices, or tuples etc.

Incident matrix An example of how documents can be represented: Can be used for a Boolean Retrieval Simply a boolean stating on whether a word has been used in a document. The statement “Brutus AND Ceasar AND NOT Calpurnia” becomes 110100 AND 110111 AND 101111 = 100100 Therefore “Anthony and Cleopatra” and “Hamlet” is returned

Inverted Index Too large to work with. An inverted Index is created to help search efficiency. Documents where each term occurs and places in posting list.

Boolean Retrieval Model Arguably the simplest model to base an I.R. system on. It is the first and very commonly adopted. Boolean Retrieval queries are boolean expressions and the search engine returns returns all documents that satisfy that boolean expression Boolean queries are queries that use AND, OR, or NOT. Consider the query: BRUTUS AND CALPURNIA The steps will be: Locate BRUTUS in the dictionary Retrieve its posting list from the postings file Locate CALPURNIA in the dictionary Retrieve its posting list from the Postings file Intersect the two posting lists.

Term Vocabulary Tokenization Stop words Normalization Stemming Lemmatization

Tokenization Tokens: Sequence of characters in a document Issues: Ex: Input: “Friends, Romans, Countrymen” Output: Friends Romans Countrymen Issues: Spacing Issues: San Francisco – one token or two? Language issues: in French : L'ensemble – one token or two? In Chinese and Japanese: No spaces in between words.

Stop Words Stop words - Dropping common terms which would appear to be have little value in helping selecting documents a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with Stop List: the members of which are then discarded during indexing

Normalization “normalize” words in indexed text as well as query words into the same form Ex: match U.S.A with USA (deleting periods) anti‐discriminatory to antidiscriminatory (deleting hyphens) Accents - résumé vs. resume. Date forms - 7月30日vs. 7/30 Case folding – Reduce all letters to lower case since every user will search using lower case letters instead of correct capitalization Ex: Searching for UNC Charlotte – unc charlotte

Stemming and Lemmatization Stemming – Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping Language dependent e.g.: automate(s), automatic, automation all reduced to automat. Porter’s Algorithm is the most used algorithm for stemming Lemmatization - Reduce inflectional/variant forms to base form Ex: am, are, is → be car, cars, car's, cars’ → car

Posting Lists If the list lengths are m and n, the merge takes O(m+n)operations. Can we do it better? Yes Postings with Skip Pointers To skip postings that will not figure in the search results This makes intersecting postings lists more efficient Some postings contains several million entries, so efficiency can be an issue even if basic intersection is linear

Web Search Basics Anchor text is the clickable text in a hyperlink. Anchor text is relevant to the page the user is linking to, rather than generic text. The keywords in anchor text are one of the many signals search engines use to determine the topic of a web page. The words contained in the anchor text help determine the ranking that the page will receive by search engines such as Google or Yahoo and Bing.

Example of anchor text: The blue hyperlinked texts are the anchor text:

How web search would be without search engine? The World Wide Web contains more than ten million websites with thousands more being added daily from all over the world, and search engines are tasked with presenting the most relevant pages based on the search criteria entered. Without search engine, content is hard to find: relevant information retrieval becomes hard. Without search engine, there is no incentive to create content. Somebody needs to pay for the web. Search engines are a key enabler for interest aggregation: a small number of geographically dispersed people with similar interests can find each other.

IR in web vs IR in general The web is a chaotic und uncoordinated collection. → lots of duplicates – need to detect duplicates. No control / restrictions on who can author content → lots of spam – need to detect spam.

PageRank •PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Why it is needed? •To rank the pages that retrieved and making sure the information retrieved is more relevant. •How to rank efficiently/ spam proof? •Google come with an algorithm called PageRank. •Every page is ranked based upon number of links available to the page and Page rank of the page in which the link is available. •for ex: Say we have a web page X that has n links including one link to Y, the contribution of this pageX to page rank of Y is PR(X)/n. •PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

PageRank (Cont..)