CSCI 5417 Information Retrieval Systems Jim Martin Lecture 9 9/20/2011.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
LIBRA: Lightweight Data Skew Mitigation in MapReduce
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
CpSc 881: Information Retrieval. 2 Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
Language Modeling Approaches for Information Retrieval Rong Jin.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Overview of Search Engines
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 4 9/1/2011.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 11 9/29/2011.
Document Indexing: SPIMI
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search and Text Mining Lecture 3. Outline Distributed programming: MapReduce Distributed indexing Several other examples using MapReduce Zones in.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Concurrent Algorithms. Summing the elements of an array
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
SINGULAR VALUE DECOMPOSITION (SVD)
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 5 9/6/2011.
Chapter 23: Probabilistic Language Models April 13, 2004.
Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.
Language Models. Language models Based on the notion of probabilities and processes for generating text Documents are ranked based on the probability.
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Probabilistic Retrieval LBSC 708A/CMSC 838L Session 4, October 2, 2001 Philip Resnik.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at
Lecture 13: Language Models for IR
Map Reduce.
CSCI 5417 Information Retrieval Systems Jim Martin
CSCI 5832 Natural Language Processing
Language Models for Information Retrieval
6. Implementation of Vector-Space Retrieval
Language Model Approach to IR
INF 141: Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 9 9/20/2011

10/2/2015 CSCI IR 2 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval

10/2/2015 CSCI IR 3 Where we are... Basics of ad hoc retrieval Indexing Term weighting/scoring Cosine Evaluation Document classification Clustering Information extraction Sentiment/Opinion mining

10/2/2015 CSCI IR 4 But First: Back to Distributed Indexing splits Parser Master a-fg-pq-z a-fg-pq-z a-fg-pq-z Inverter Postings a-f g-p q-z assign

Huh? That was supposed to be an explanation of MapReduce (Hadoop)... Maybe not so much... Here’s another try

MapReduce MapReduce is a distributed programming framework that is intended to facilitate applications that are Data intensive Parallelizable in a certain sense In a commodity-cluster environment MapReduce is the original internal Google model Hadoop is the open source version

Inspirations MapReduce elegantly and efficiently combines inspirations from a variety of sources, including Functional programming Key/value association lists Unix pipes

Functional Programming The focus is on side-effect free specifications of input/output mappings There are various idioms, but map and reduce are two central ones... Mapping refers to applying an identical function to each of the elements of a list and constructing a list of the outputs Reducing refers to receiving the elements of a list and aggregating the elements according to some function.

Python Map/Reduce Say you wanted to compute simple sum of squares of a list of numbers >>> z [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> z2 = map(lambda x: x**2, l) >>> z2 [1, 4, 9, 16, 25, 36, 49, 64, 81] >>> reduce(lambda x,y: x+y, z2) 285 >>> reduce(lambda x,y: x+y, map(lambda x: x**2, z)) 285

Association Lists (key/value) The notion of association lists goes way back to early lisp/ai programming. The basic idea is to try to view problems in terms of sets of key/value pairs. Most major languages now provide first-class support for this notion (usually via hashes on keys) We’ve seen this a lot this semester Tokens and term-ids Terms and document ids Terms and posting lists Docids and tf/idf values Etc.

MapReduce MapReduce combines these ideas in the following way There are two phases of processing mapping and reducing. Each phase consists of multiple identical copies of map and reduce methods Map methods take individual key/value pairs as input and return some function of those pairs to produce a new key/value pair Reduce methods take key/ pairs as input, and return some aggregate function of the values as an answer.

Map Phase map Key/valu e Key’/valu e’

Reduce Phase Distribute by keys Key’/valu e’ Key’’/valu e’’ reduce Sort by keys and collate values Key’/ reduce Key’’/valu e’’

Example Simple example used in all the tutorials Get the counts of each word type across a bunch of docs Let’s assume each doc is a big long string For map Input: Filenames are keys; content string is values Output: Term tokens are keys; values are 1’s For map Input: Filenames are keys; content string is values Output: Term tokens are keys; values are 1’s For reduce Input: Terms tokens are keys, 1’s are values Output: Term types are keys, summed counts are values For reduce Input: Terms tokens are keys, 1’s are values Output: Term types are keys, summed counts are values

Dumbo Example def map(docid, contents): for term in contents.split(): yield term, 1 def reduce(term, counts): sum = 0 for count in counts: sum = sum + count yield term, sum Key Value Key Value

Hidden Infrastructure Partitioning the incoming data Hadoop has default methods By file, given a bunch of files By line, given a file full of lines Sorting/collating the mapped key/values Moving the data among the nodes Distributed file system Don’t move the data; just assign mappers/reducers to nodes

Example 2 Given our normal postings term -> list of (doc-id, tf) tuples Generate the vector length normalization for each document in the index Map Input: terms are keys, posting lists are values Output: doc-ids are keys, squared weights are values Map Input: terms are keys, posting lists are values Output: doc-ids are keys, squared weights are values Reduce Input: doc-ids are keys, list of squared weights are values Output: doc-ids are keys, square root of the summed weights are the values Reduce Input: doc-ids are keys, list of squared weights are values Output: doc-ids are keys, square root of the summed weights are the values

Example 2 def map(term, postings): for post in postings: yield post.docID(), post.weight() ** 2 def reduce(docID, sqWeights): sum = 0 for weight in sqWeights: sum = sum + weight yield docID, math.sqrt(sum)

10/2/2015 CSCI IR 19 Break Thursday we’ll start discussion of projects. So come to class ready to say something about projects. Part 2 of the HW is due next Thursday Feel free to mail me updated results (R- precisions) as you get them...

Probabilistic pproaches. The following is a mix of chapters 12 and 13. Only the material from 12 will be on the quiz 10/2/2015CSCI IR20

An Alternative Basic vector space model uses a geometric metaphor/framework for the ad hoc retrieval problem One dimension for each word in the vocab Weights are usually tf-idf based An alternative is to use a probabilistic approach So we’ll take a short detour into probabilistic language modeling 10/2/2015 CSCI IR 21

22 Using Language Models for ad hoc Retrieval  Each document is treated as (the basis for) a language model.  Given a query q  Rank documents based on P(d|q) via  P(q) is the same for all documents, so ignore  P(d) is the prior – often treated as the same for all d  But we can give a higher prior to “high-quality” documents  PageRank, click through, social tags, etc.  P(q|d) is the probability of q given d.  So to rank documents according to relevance to q, rank according to P(q|d)

23 Where we are  In the LM approach to IR, we attempt to model the query generation process.  Think of a query as being generated from a model derived from a document (or documents)  Then we rank documents by the probability that a query would be observed as a random sample from the respective document model.  That is, we rank according to P(q|d).  Next: how do we compute P(q|d)?

10/2/2015 CSCI IR 24 Stochastic Language Models Models probability of generating strings (each word in turn) in the language (commonly all strings over ∑). E.g., unigram model 0.2the 0.1a 0.01man 0.01woman 0.03said 0.02likes … themanlikesthewoman multiply Model M P(s | M) =

10/2/2015 CSCI IR 25 Stochastic Language Models Model probability of generating any string (for example, a query) 0.2the 0.01class sayst pleaseth yon maiden 0.01woman Model M1Model M2 maidenclasspleasethyonthe P(s|M2) > P(s|M1) 0.2the class 0.03sayst 0.02pleaseth 0.1yon 0.01maiden woman

26 How to compute P(q|d)  This kind of conditional independence assumption is often called a Markov model (|q|: length ofr q; t k : the token occurring at position k in q)  This is equivalent to:  tf t,q : term frequency (# occurrences) of t in q

So.... LMs for ad hoc Retrieval Use each individual document as the corpus for a language model For a given query, assess P(q|d) for each document in the collection Return docs in the ranked order of P(q|d) Think about how scoring works for this model... Term at a time? Doc at a time? 10/2/2015CSCI IR27

10/2/2015 CSCI IR 28 Unigram and higher-order models Unigram Language Models Bigram (generally, n-gram) Language Models Other Language Models Grammar-based models (PCFGs), etc. Probably not the first thing to try in IR = P ( ) P ( | ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | ) Easy. Effective!

Next time Lots of practical issues Smoothing 10/2/2015CSCI IR29