© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Information retrieval; Ranking; TF-IDF March 28, 2016.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Modern Information Retrieval Chapter 1: Introduction
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Motivation and Outline
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.
 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Modern Information Retrieval Computer engineering department Fall 2005.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Basic ranking Models Boolean and Vector Space Models.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Vector Space Models.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Data Integration and Information Retrieval Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems February 18, 2008 Some slides.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSE 635 Multimedia Information Retrieval
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação
Information Retrieval and Web Design
Advanced information retrieval
Presentation transcript:

© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Information retrieval; Ranking; TF-IDF March 28, 2016

© 2016 A. Haeberlen, Z. Ives Announcements It is time to form project groups! A draft of the project handout is now available One person from each team should send me an by Friday with the names and SEAS logins of the team members When recruiting members for your team, please discuss your expectations (work style, schedule, technologies, goals,...) carefully to ensure that you are a good match Reading for next time: Brin and Page: The PageRank Citation Ranking: Bringing Order to the Web Brin and Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine Kleinberg: Authoritative Sources in a Hyperlinked Environment [optional] 2 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives The team project Task: Build a cloud-based search engine Should consist of four components: Crawler, based on your crawler (HW2) and MapReduce (HW3) Indexer, based on MapReduce (HW3) and BerkeleyDB PageRank, based on MapReduce (HW3) Search engine and user interface (HW1) Draft specs are available on the course web page Deploy & evaluate on Amazon EC2 Need to evaluate performance and write final report You can use AWS Educate credits for this assignment Did the signup work? Did you get $100, or still only $35? 3 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Some example projects 4 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Logistics Rough timeline (preliminary): April 1st: Form project groups; initial planning April 4 th -8 th : Check/review sessions April 11th: Initial project plan due April 27th: Official* code submission deadline May 2nd - May 6th: Project demos May 6 th : Final report due (hard deadline!) Todo: Form project groups Each team should have 4 members There will be one 5-member group. This requires approval & the group will need to do some extra credit tasks One person from each group should send me a list of group members by April 1st (and CC the other members!) I may have to split or merge some groups 5 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives The Google award The team with the best search engine will receive an award (sponsored by ) Criteria: Architecture/design, speed, reliability, quality of search results, user interface, written final report Winning team gets four Nexus 7 tablets * Winners will be added to the CIS455 'hall of fame' * 6 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Some 'lessons learned' from last year The most common mistakes were: Started too late; tried to do everything at the last minute You need to leave enough time at the end to a) crawl a sufficiently large corpus, and b) tweak the ranking function to get good results Underestimated amount of integration work Suggestion: Define clean interfaces, build dummy components for testing, exchange code early and throughout the project Performance issues Example: Congestion due to very large data transfers between nodes Underestimated EC2 deployment Try your code on EC2 as early as possible Unbalanced team You need to pick your teammates wisely, make sure everyone pulls their weight, keep everyone motivated,... 7 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Plan for today Information retrieval Basics Precision and recall Taxonomy of IR models Classic IR models Boolean model Vector model TF/IDF HITS and PageRank 8 University of Pennsylvania NEXT

© 2016 A. Haeberlen, Z. Ives 9 Web search Goal is to find information relevant to a user’s interests - and this is hard! Challenge 1: Data quality A significant amount of content on the web is not quality information Many pages contain nonsensical rants, etc. The web is full of misspellings, multiple languages, etc. Many pages are designed not to convey information – but to get a high ranking (e.g., “search engine optimization”) Challenge 2: Scale Billions of documents Challenge 3: Very little structure No explicit schemata However, hyperlinks encode information!

© 2016 A. Haeberlen, Z. Ives 10 Our discussion of web search Begin with traditional information retrieval Document models Stemming and stop words Web-specific issues Crawlers and robots.txt (already discussed) Scalability Models for exploiting hyperlinks in ranking Google and PageRank Latent Semantic Indexing

© 2016 A. Haeberlen, Z. Ives 11 Information Retrieval Traditional information retrieval is basically text search A corpus or body of text documents, e.g., in a document collection in a library or on a CD Documents are generally high-quality and designed to convey information Documents are assumed to have no structure beyond words Searches are generally based on meaningful phrases, perhaps including predicates over categories, dates, etc. The goal is to find the document(s) that best match the search phrase, according to a search model Assumptions are typically different from Web: quality text, limited-size corpus, no hyperlinks

© 2016 A. Haeberlen, Z. Ives 12 Motivation for Information Retrieval Information Retrieval (IR) is about: Representation Storage Organization of And access to “information items” Focus is on user’s information need rather than a precise query: User enters: “March Madness” Goal: Find information on college basketball teams which (1) are maintained by a US university and (2) participate in the NCAA tournament Emphasis is on the retrieval of information (not data)

© 2016 A. Haeberlen, Z. Ives 13 Data vs. Information Retrieval Data retrieval, analogous to database querying: which docs contain a set of keywords? Well-defined, precise logical semantics Example: All documents with (('CIS455' OR 'CIS555') AND ('midterm')) A single erroneous object implies failure! Information retrieval: Information about a subject or topic Semantics is frequently loose; we want approximate matches Small errors are tolerated (and in fact inevitable) IR system: Interpret contents of information items Generate a ranking which reflects relevance Notion of relevance is most important – needs a model

© 2016 A. Haeberlen, Z. Ives 14 Docs ? Information Need Index Terms doc query Ranking match Basic model

© 2016 A. Haeberlen, Z. Ives 15 Information Retrieval as a field IR addressed many issues in the last 30 years: Classification and categorization of documents Systems and languages for searching User interfaces and visualization of results Area was seen as of narrow interest – libraries, mainly And then – the advent of the web: Universal “library” Free (low cost) universal access No central editorial board Many problems in finding information: IR seen as key to finding the solutions!

© 2016 A. Haeberlen, Z. Ives 16 Browser / UI Text Processing and Modeling Query Operations Indexing Searching Ranking Index Text query user interest user feedback ranked docs retrieved docs logical view inverted index Documents (Web or DB) Text The full Information Retrieval process Crawler / Data Access

© 2016 A. Haeberlen, Z. Ives 17 Terminology IR systems usually adopt index terms to process queries Index term: a keyword or group of selected words any word (more general) Stemming might be used: connect: connecting, connection, connections An inverted index is built for the chosen index terms

© 2016 A. Haeberlen, Z. Ives 18 What is a meaningful result? Matching at index term level is quite imprecise Users are frequently dissatisfied One problem: users are generally poor at formulating queries Frequent dissatisfaction of Web users (who often give single- keyword queries) Issue of deciding relevance is critical for IR systems: ranking Show more relevant documents first May leave out documents with low relevance

© 2016 A. Haeberlen, Z. Ives Precision and recall How good is our IR system? Two common metrics: Precision: What fraction of the returned documents is relevant? Recall: What fraction of the relevant documents are returned? How can you build trivial systems that optimize one of them? Tradeoff: Increasing precision will usually lower recall, and vice versa Evaluate in a p-r graph (vary, e.g., number of results returned) 19 University of Pennsylvania p r better ideal typical

© 2016 A. Haeberlen, Z. Ives 20 Rankings A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query A ranking is based on fundamental premises regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of relevance Each set of premises leads to a distinct IR model

© 2016 A. Haeberlen, Z. Ives 21 Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext Types of IR Models

© 2016 A. Haeberlen, Z. Ives 22 Classic IR models – Basic concepts Each document represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document's main themes Traditionally, index terms were nouns because nouns have meaning by themselves Search engines assume that all words are index terms (full text representation)

© 2016 A. Haeberlen, Z. Ives 23 Classic IR Models – Weights Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents The importance of the index terms is represented by weights associated to them Let k i be an index term d j be a document w ij be a weight associated with (k i, d j ) The weight w ij quantifies the importance of the index term for describing the document contents

© 2016 A. Haeberlen, Z. Ives 24 Classic IR Models – Notation k i is an index term (keyword) d j is a document t is the total number of index terms K = (k 1, k 2, …, k t ) is the set of all index terms w ij  0is a weight associated with (k i,d j ) w ij = 0indicates that term does not belong to doc d j = (w 1j, w 2j, …, w tj ) is a weighted vector associated with the document d j g i (d j ) = w ij is a function which returns the weight associated with pair (k i, d j )

© 2016 A. Haeberlen, Z. Ives Plan for today Information retrieval Basics Precision and recall Taxonomy of IR models Classic IR models Boolean model Vector model TF/IDF HITS and PageRank 25 University of Pennsylvania NEXT

© 2016 A. Haeberlen, Z. Ives 26 Boolean model Simple model based on set theory Queries specified as boolean expressions precise semantics neat formalism q = k a  (k b   k c ) Terms are either present or absent. Thus, w ij  {0,1} An example query q = k a  (k b   k c )

© 2016 A. Haeberlen, Z. Ives 27 q = k a  (k b   k c ) (1,1,1) (1,0,0) (1,1,0) KaKa KbKb KcKc Boolean model for similarity if otherwise In disjunctive normal form: q = (k a  k b  k c )  (k a  k b  k c )  (k a  k b  k c ) conjunctive components Query:

© 2016 A. Haeberlen, Z. Ives 28 Drawbacks of boolean model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Information need has to be translated into a Boolean expression, which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

© 2016 A. Haeberlen, Z. Ives Plan for today Information retrieval Basics Precision and recall Taxonomy of IR models Classic IR models Boolean model Vector model TF/IDF HITS and PageRank 29 University of Pennsylvania NEXT

© 2016 A. Haeberlen, Z. Ives 30 Vector model A refinement of the boolean model, which does not focus strictly on exact matching Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document Ranked set of documents provides for better matching

© 2016 A. Haeberlen, Z. Ives 31 Vector model Define: w ij > 0 whenever k i  d j w iq  0 associated with the pair (k i,q) d j = (w 1j, w 2j,..., w tj ) q = (w 1q, w 2q,..., w tq ) With each term k i, associate a unit vector vec(i) The unit vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) Does this assumption ("independence assumption") hold in practice? What influence do you think this has on performance? The t unit vectors vec(i) form an orthonormal basis for a t-dimensional space In this space, queries and documents are represented as weight vectors

© 2016 A. Haeberlen, Z. Ives Bag of words In this model, w ij > 0 whenever k i  d j Exact ordering of terms in the document is ignored This is called the "bag of words" model What will be the vectors for the following two documents? "Ape eats banana" "Banana eats ape" What needs to be done to fix this? 32 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Term Sense and Sensibility Pride and Prejudice Wuthering Heights affection jealous gossip Similarity In the vector model, queries may return documents that are not a 'perfect match' Hence, we need a metric for the similarity between different documents, or between a document and a query Could we simply subtract the vectors? (L1 norm) Could we use a dot product? Does normalization help? 33 University of Pennsylvania Term Sense and Sensibility Pride and Prejudice Wuthering Heights affection jealous10711 gossip206 From: An Introduction to Information Retrieval, Cambridge UP

© 2016 A. Haeberlen, Z. Ives 34 i j dj q  Cosine similarity All weights are nonnegative; hence, 0 ≤ sim(q,d j ) ≤1

© 2016 A. Haeberlen, Z. Ives Plan for today Information retrieval Basics Precision and recall Taxonomy of IR models Classic IR models Boolean model Vector model TF/IDF HITS and PageRank 35 University of Pennsylvania NEXT

© 2016 A. Haeberlen, Z. Ives An example What would be a good match for this query? 36 University of Pennsylvania The University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 37 Weights in the vector model How do we compute the weights w ij and w iq ? A good weight must take into account two effects: quantification of intra-document contents (similarity) tf factor, the term frequency within a document quantification of inter-documents separation (dissimilarity) idf factor, the inverse document frequency w ij = tf(i,j) * idf(i)

© 2016 A. Haeberlen, Z. Ives 38 TF and IDF Factors Let: N be the total number of docs in the collection n i be the number of docs which contain k i freq(i,j) raw frequency of k i within d j A normalized tf factor is given by f(i,j) = a + (1-a) * freq(i,j) / max(freq(l,j)) where the maximum is computed over all terms which occur within the document d j. (a is usually set to 0.4 or 0.5) The idf factor is computed as idf(i) = log (N / n i ) the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term k i

© 2016 A. Haeberlen, Z. Ives 39 d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Vector Model Example 1 No weights Query: k1 k2 k3

© 2016 A. Haeberlen, Z. Ives 40 d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Vector Model Example 2 Query weights Query: k3 k2 k3 k1 k2 k3

© 2016 A. Haeberlen, Z. Ives 41 d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Vector Model Example 3 Document + query weights Query: k3 k2 k3 k1 k2 k3

© 2016 A. Haeberlen, Z. Ives Putting it all together: Scoring Example: Query is for 'best car insurance' Document: Use tf weighting without idf, but with Euclidean normalization Query: Use idf Net score for this document is sum of w t,d *w t,q : 0.41*0 + 0* * *3.0 = Termdfidftfw t,d tfw t,q product auto best car insurance QueryDocum. From: An Introduction to Information Retrieval, Cambridge UP Corpus Computed across all documents Specific document being scored N=100000

© 2016 A. Haeberlen, Z. Ives Stop words What do we do about very common words ('the', 'of', 'is', 'may', 'a',...)? Do not appear to be very useful in general... though they may be in phrase searches "President of the United States" "To be or not to be" We can use a stop list to remove these entirely Typically small ( terms or less) Ongoing trend is towards even smaller lists, or even no list at all (web search engines generally do not use them) 43 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Stemming and lemmatization What if the document contains many similar word forms? View, viewing, viewer, viewed, views, viewable,... Democracy, democratization,... Can use stemming to 'normalize' words A somewhat rough heuristic; chops off ends of words etc. Most common algorithm: Porter stemmer Far from perfect Example: Operate, operating, operates, operation, operative, operatives, operational,... are all stemmed to 'oper' Better: Use NLP tools (lemmatizer) May use a vocabulary (e.g., 'are/is/were' -> 'be') 44 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Example: Porter stemmer Above: Simple example of stemmer output Example rules: SSES  SS, IES  I,... Entire algorithm is fairly long and complex 45 University of Pennsylvania Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret Porter stemmer From: "An introduction to Information Retrieval", Cambridge University Press, page 34

© 2016 A. Haeberlen, Z. Ives 46 Summary: Vector Model The best term-weighting schemes use tf-idf weights: For the query term weights, a suggestion is This model is very good in practice: tf-idf works well with general collections Simple and fast to compute Vector model is usually as good as the known ranking alternatives

© 2016 A. Haeberlen, Z. Ives 47 Advantages: Term-weighting improves quality of the answer set Partial matching allows retrieval of docs that approximate the query conditions Cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: Assumes independence of index terms; not clear if this is a good or bad assumption Pros & Cons of the vector model

© 2016 A. Haeberlen, Z. Ives 48 Comparison of classic models Boolean model does not provide for partial matches and is considered to be the weakest classic model Some experiments indicate that the vector model outperforms the third alternative, the probabilistic model, in general IR research has focused on improving probabilistic models for some time – but these haven’t made their way to Web search Generally we use a variation of the vector model in most text search systems

© 2016 A. Haeberlen, Z. Ives Further reading "An introduction to Information Retrieval" Christopher D. Manning, Prabhakar Rhagavan, Hinrich Schuetze; Cambridge University Press, 2009 Available as a PDF from: Contains more details on many topics covered in this lecture Examples: Scoring, tokenization, lemmatization,... If you're the ranking expert in your final project team, you should have a look!... and possibly even if you're not (interesting!) 49 University of Pennsylvania