© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Information retrieval; Ranking; TF-IDF March 28, 2016
© 2016 A. Haeberlen, Z. Ives Announcements It is time to form project groups! A draft of the project handout is now available One person from each team should send me an by Friday with the names and SEAS logins of the team members When recruiting members for your team, please discuss your expectations (work style, schedule, technologies, goals,...) carefully to ensure that you are a good match Reading for next time: Brin and Page: The PageRank Citation Ranking: Bringing Order to the Web Brin and Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine Kleinberg: Authoritative Sources in a Hyperlinked Environment [optional] 2 University of Pennsylvania
© 2016 A. Haeberlen, Z. Ives The team project Task: Build a cloud-based search engine Should consist of four components: Crawler, based on your crawler (HW2) and MapReduce (HW3) Indexer, based on MapReduce (HW3) and BerkeleyDB PageRank, based on MapReduce (HW3) Search engine and user interface (HW1) Draft specs are available on the course web page Deploy & evaluate on Amazon EC2 Need to evaluate performance and write final report You can use AWS Educate credits for this assignment Did the signup work? Did you get $100, or still only $35? 3 University of Pennsylvania
© 2016 A. Haeberlen, Z. Ives Some example projects 4 University of Pennsylvania
© 2016 A. Haeberlen, Z. Ives Logistics Rough timeline (preliminary): April 1st: Form project groups; initial planning April 4 th -8 th : Check/review sessions April 11th: Initial project plan due April 27th: Official* code submission deadline May 2nd - May 6th: Project demos May 6 th : Final report due (hard deadline!) Todo: Form project groups Each team should have 4 members There will be one 5-member group. This requires approval & the group will need to do some extra credit tasks One person from each group should send me a list of group members by April 1st (and CC the other members!) I may have to split or merge some groups 5 University of Pennsylvania
© 2016 A. Haeberlen, Z. Ives The Google award The team with the best search engine will receive an award (sponsored by ) Criteria: Architecture/design, speed, reliability, quality of search results, user interface, written final report Winning team gets four Nexus 7 tablets * Winners will be added to the CIS455 'hall of fame' * 6 University of Pennsylvania
© 2016 A. Haeberlen, Z. Ives Some 'lessons learned' from last year The most common mistakes were: Started too late; tried to do everything at the last minute You need to leave enough time at the end to a) crawl a sufficiently large corpus, and b) tweak the ranking function to get good results Underestimated amount of integration work Suggestion: Define clean interfaces, build dummy components for testing, exchange code early and throughout the project Performance issues Example: Congestion due to very large data transfers between nodes Underestimated EC2 deployment Try your code on EC2 as early as possible Unbalanced team You need to pick your teammates wisely, make sure everyone pulls their weight, keep everyone motivated,... 7 University of Pennsylvania
© 2016 A. Haeberlen, Z. Ives Plan for today Information retrieval Basics Precision and recall Taxonomy of IR models Classic IR models Boolean model Vector model TF/IDF HITS and PageRank 8 University of Pennsylvania NEXT
© 2016 A. Haeberlen, Z. Ives 9 Web search Goal is to find information relevant to a user’s interests - and this is hard! Challenge 1: Data quality A significant amount of content on the web is not quality information Many pages contain nonsensical rants, etc. The web is full of misspellings, multiple languages, etc. Many pages are designed not to convey information – but to get a high ranking (e.g., “search engine optimization”) Challenge 2: Scale Billions of documents Challenge 3: Very little structure No explicit schemata However, hyperlinks encode information!
© 2016 A. Haeberlen, Z. Ives 10 Our discussion of web search Begin with traditional information retrieval Document models Stemming and stop words Web-specific issues Crawlers and robots.txt (already discussed) Scalability Models for exploiting hyperlinks in ranking Google and PageRank Latent Semantic Indexing
© 2016 A. Haeberlen, Z. Ives 11 Information Retrieval Traditional information retrieval is basically text search A corpus or body of text documents, e.g., in a document collection in a library or on a CD Documents are generally high-quality and designed to convey information Documents are assumed to have no structure beyond words Searches are generally based on meaningful phrases, perhaps including predicates over categories, dates, etc. The goal is to find the document(s) that best match the search phrase, according to a search model Assumptions are typically different from Web: quality text, limited-size corpus, no hyperlinks
© 2016 A. Haeberlen, Z. Ives 12 Motivation for Information Retrieval Information Retrieval (IR) is about: Representation Storage Organization of And access to “information items” Focus is on user’s information need rather than a precise query: User enters: “March Madness” Goal: Find information on college basketball teams which (1) are maintained by a US university and (2) participate in the NCAA tournament Emphasis is on the retrieval of information (not data)
© 2016 A. Haeberlen, Z. Ives 13 Data vs. Information Retrieval Data retrieval, analogous to database querying: which docs contain a set of keywords? Well-defined, precise logical semantics Example: All documents with (('CIS455' OR 'CIS555') AND ('midterm')) A single erroneous object implies failure! Information retrieval: Information about a subject or topic Semantics is frequently loose; we want approximate matches Small errors are tolerated (and in fact inevitable) IR system: Interpret contents of information items Generate a ranking which reflects relevance Notion of relevance is most important – needs a model
© 2016 A. Haeberlen, Z. Ives 14 Docs ? Information Need Index Terms doc query Ranking match Basic model
© 2016 A. Haeberlen, Z. Ives 15 Information Retrieval as a field IR addressed many issues in the last 30 years: Classification and categorization of documents Systems and languages for searching User interfaces and visualization of results Area was seen as of narrow interest – libraries, mainly And then – the advent of the web: Universal “library” Free (low cost) universal access No central editorial board Many problems in finding information: IR seen as key to finding the solutions!
© 2016 A. Haeberlen, Z. Ives 16 Browser / UI Text Processing and Modeling Query Operations Indexing Searching Ranking Index Text query user interest user feedback ranked docs retrieved docs logical view inverted index Documents (Web or DB) Text The full Information Retrieval process Crawler / Data Access
© 2016 A. Haeberlen, Z. Ives 17 Terminology IR systems usually adopt index terms to process queries Index term: a keyword or group of selected words any word (more general) Stemming might be used: connect: connecting, connection, connections An inverted index is built for the chosen index terms
© 2016 A. Haeberlen, Z. Ives 18 What is a meaningful result? Matching at index term level is quite imprecise Users are frequently dissatisfied One problem: users are generally poor at formulating queries Frequent dissatisfaction of Web users (who often give single- keyword queries) Issue of deciding relevance is critical for IR systems: ranking Show more relevant documents first May leave out documents with low relevance
© 2016 A. Haeberlen, Z. Ives Precision and recall How good is our IR system? Two common metrics: Precision: What fraction of the returned documents is relevant? Recall: What fraction of the relevant documents are returned? How can you build trivial systems that optimize one of them? Tradeoff: Increasing precision will usually lower recall, and vice versa Evaluate in a p-r graph (vary, e.g., number of results returned) 19 University of Pennsylvania p r better ideal typical
© 2016 A. Haeberlen, Z. Ives 20 Rankings A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query A ranking is based on fundamental premises regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of relevance Each set of premises leads to a distinct IR model
© 2016 A. Haeberlen, Z. Ives 21 Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext Types of IR Models
© 2016 A. Haeberlen, Z. Ives 22 Classic IR models – Basic concepts Each document represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document's main themes Traditionally, index terms were nouns because nouns have meaning by themselves Search engines assume that all words are index terms (full text representation)
© 2016 A. Haeberlen, Z. Ives 23 Classic IR Models – Weights Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents The importance of the index terms is represented by weights associated to them Let k i be an index term d j be a document w ij be a weight associated with (k i, d j ) The weight w ij quantifies the importance of the index term for describing the document contents
© 2016 A. Haeberlen, Z. Ives 24 Classic IR Models – Notation k i is an index term (keyword) d j is a document t is the total number of index terms K = (k 1, k 2, …, k t ) is the set of all index terms w ij 0is a weight associated with (k i,d j ) w ij = 0indicates that term does not belong to doc d j = (w 1j, w 2j, …, w tj ) is a weighted vector associated with the document d j g i (d j ) = w ij is a function which returns the weight associated with pair (k i, d j )
© 2016 A. Haeberlen, Z. Ives Plan for today Information retrieval Basics Precision and recall Taxonomy of IR models Classic IR models Boolean model Vector model TF/IDF HITS and PageRank 25 University of Pennsylvania NEXT
© 2016 A. Haeberlen, Z. Ives 26 Boolean model Simple model based on set theory Queries specified as boolean expressions precise semantics neat formalism q = k a (k b k c ) Terms are either present or absent. Thus, w ij {0,1} An example query q = k a (k b k c )
© 2016 A. Haeberlen, Z. Ives 27 q = k a (k b k c ) (1,1,1) (1,0,0) (1,1,0) KaKa KbKb KcKc Boolean model for similarity if otherwise In disjunctive normal form: q = (k a k b k c ) (k a k b k c ) (k a k b k c ) conjunctive components Query:
© 2016 A. Haeberlen, Z. Ives 28 Drawbacks of boolean model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Information need has to be translated into a Boolean expression, which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query
© 2016 A. Haeberlen, Z. Ives Plan for today Information retrieval Basics Precision and recall Taxonomy of IR models Classic IR models Boolean model Vector model TF/IDF HITS and PageRank 29 University of Pennsylvania NEXT
© 2016 A. Haeberlen, Z. Ives 30 Vector model A refinement of the boolean model, which does not focus strictly on exact matching Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document Ranked set of documents provides for better matching
© 2016 A. Haeberlen, Z. Ives 31 Vector model Define: w ij > 0 whenever k i d j w iq 0 associated with the pair (k i,q) d j = (w 1j, w 2j,..., w tj ) q = (w 1q, w 2q,..., w tq ) With each term k i, associate a unit vector vec(i) The unit vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) Does this assumption ("independence assumption") hold in practice? What influence do you think this has on performance? The t unit vectors vec(i) form an orthonormal basis for a t-dimensional space In this space, queries and documents are represented as weight vectors
© 2016 A. Haeberlen, Z. Ives Bag of words In this model, w ij > 0 whenever k i d j Exact ordering of terms in the document is ignored This is called the "bag of words" model What will be the vectors for the following two documents? "Ape eats banana" "Banana eats ape" What needs to be done to fix this? 32 University of Pennsylvania
© 2016 A. Haeberlen, Z. Ives Term Sense and Sensibility Pride and Prejudice Wuthering Heights affection jealous gossip Similarity In the vector model, queries may return documents that are not a 'perfect match' Hence, we need a metric for the similarity between different documents, or between a document and a query Could we simply subtract the vectors? (L1 norm) Could we use a dot product? Does normalization help? 33 University of Pennsylvania Term Sense and Sensibility Pride and Prejudice Wuthering Heights affection jealous10711 gossip206 From: An Introduction to Information Retrieval, Cambridge UP
© 2016 A. Haeberlen, Z. Ives 34 i j dj q Cosine similarity All weights are nonnegative; hence, 0 ≤ sim(q,d j ) ≤1
© 2016 A. Haeberlen, Z. Ives Plan for today Information retrieval Basics Precision and recall Taxonomy of IR models Classic IR models Boolean model Vector model TF/IDF HITS and PageRank 35 University of Pennsylvania NEXT
© 2016 A. Haeberlen, Z. Ives An example What would be a good match for this query? 36 University of Pennsylvania The University of Pennsylvania
© 2016 A. Haeberlen, Z. Ives 37 Weights in the vector model How do we compute the weights w ij and w iq ? A good weight must take into account two effects: quantification of intra-document contents (similarity) tf factor, the term frequency within a document quantification of inter-documents separation (dissimilarity) idf factor, the inverse document frequency w ij = tf(i,j) * idf(i)
© 2016 A. Haeberlen, Z. Ives 38 TF and IDF Factors Let: N be the total number of docs in the collection n i be the number of docs which contain k i freq(i,j) raw frequency of k i within d j A normalized tf factor is given by f(i,j) = a + (1-a) * freq(i,j) / max(freq(l,j)) where the maximum is computed over all terms which occur within the document d j. (a is usually set to 0.4 or 0.5) The idf factor is computed as idf(i) = log (N / n i ) the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term k i
© 2016 A. Haeberlen, Z. Ives 39 d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Vector Model Example 1 No weights Query: k1 k2 k3
© 2016 A. Haeberlen, Z. Ives 40 d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Vector Model Example 2 Query weights Query: k3 k2 k3 k1 k2 k3
© 2016 A. Haeberlen, Z. Ives 41 d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Vector Model Example 3 Document + query weights Query: k3 k2 k3 k1 k2 k3
© 2016 A. Haeberlen, Z. Ives Putting it all together: Scoring Example: Query is for 'best car insurance' Document: Use tf weighting without idf, but with Euclidean normalization Query: Use idf Net score for this document is sum of w t,d *w t,q : 0.41*0 + 0* * *3.0 = Termdfidftfw t,d tfw t,q product auto best car insurance QueryDocum. From: An Introduction to Information Retrieval, Cambridge UP Corpus Computed across all documents Specific document being scored N=100000
© 2016 A. Haeberlen, Z. Ives Stop words What do we do about very common words ('the', 'of', 'is', 'may', 'a',...)? Do not appear to be very useful in general... though they may be in phrase searches "President of the United States" "To be or not to be" We can use a stop list to remove these entirely Typically small ( terms or less) Ongoing trend is towards even smaller lists, or even no list at all (web search engines generally do not use them) 43 University of Pennsylvania
© 2016 A. Haeberlen, Z. Ives Stemming and lemmatization What if the document contains many similar word forms? View, viewing, viewer, viewed, views, viewable,... Democracy, democratization,... Can use stemming to 'normalize' words A somewhat rough heuristic; chops off ends of words etc. Most common algorithm: Porter stemmer Far from perfect Example: Operate, operating, operates, operation, operative, operatives, operational,... are all stemmed to 'oper' Better: Use NLP tools (lemmatizer) May use a vocabulary (e.g., 'are/is/were' -> 'be') 44 University of Pennsylvania
© 2016 A. Haeberlen, Z. Ives Example: Porter stemmer Above: Simple example of stemmer output Example rules: SSES SS, IES I,... Entire algorithm is fairly long and complex 45 University of Pennsylvania Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret Porter stemmer From: "An introduction to Information Retrieval", Cambridge University Press, page 34
© 2016 A. Haeberlen, Z. Ives 46 Summary: Vector Model The best term-weighting schemes use tf-idf weights: For the query term weights, a suggestion is This model is very good in practice: tf-idf works well with general collections Simple and fast to compute Vector model is usually as good as the known ranking alternatives
© 2016 A. Haeberlen, Z. Ives 47 Advantages: Term-weighting improves quality of the answer set Partial matching allows retrieval of docs that approximate the query conditions Cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: Assumes independence of index terms; not clear if this is a good or bad assumption Pros & Cons of the vector model
© 2016 A. Haeberlen, Z. Ives 48 Comparison of classic models Boolean model does not provide for partial matches and is considered to be the weakest classic model Some experiments indicate that the vector model outperforms the third alternative, the probabilistic model, in general IR research has focused on improving probabilistic models for some time – but these haven’t made their way to Web search Generally we use a variation of the vector model in most text search systems
© 2016 A. Haeberlen, Z. Ives Further reading "An introduction to Information Retrieval" Christopher D. Manning, Prabhakar Rhagavan, Hinrich Schuetze; Cambridge University Press, 2009 Available as a PDF from: Contains more details on many topics covered in this lecture Examples: Scoring, tokenization, lemmatization,... If you're the ranking expert in your final project team, you should have a look!... and possibly even if you're not (interesting!) 49 University of Pennsylvania