Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
■ Google’s Ad Distribution Network ■ Primary Benefits of AdWords ■ Online Advertising Stats and Trends ■ Appendix: Basic AdWords Features ■ Introduction.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval in Practice
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
CS276A Information Retrieval Lecture 8. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
1 COS 425: Database and Information Management Systems XML and information exchange.
Dictionary search Making one-side errors Paper on Bloom Filter.
Algoritmi per IR Ranking. The big fight: find the best ranking...
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
CS276B Text Retrieval and Mining Winter 2005 Lecture 12.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Information Retrieval
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Search Engines and Information Retrieval Chapter 1.
XML Retrieval with slides of C. Manning und H.Schutze 04/12/2008.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Search Engine Optimization 101 What is SEM? SEO? How can I use SEO on my blogs and/or my personal web space?
Chapter 6: Information Retrieval and Web Search
ITCS 6265 Information Retrieval & Web Mining Lecture 18-A Fall 2009.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
Information Retrieval
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 8: Evaluation.
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Top-K documents Exact retrieval
Information Retrieval (in Practice)
Quality of a search engine
7CCSMWAL Algorithmic Issues in the WWW
XML QUESTIONS AND ANSWERS
Lecture 10 Evaluation.
Lecture 6 Evaluation.
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
CS276B Text Retrieval and Mining Winter 2005
Presentation transcript:

Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8

Is it good ? How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of the query language

Measures for a search engine All of the preceding criteria are measurable The key measure: user happiness …useless answers won’t make a user happy

Happiness: elusive to measure Commonest approach is given by the relevance of search results How do we measure it ? Requires 3 elements: 1.A benchmark document collection 2.A benchmark suite of queries 3.A binary assessment of either Relevant or Irrelevant for each query-doc pair

Evaluating an IR system Standard benchmarks TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant  On the Web everything is more complicated since we cannot mark the entire corpus !!

General scenario Relevant Retrieved collection

Precision: % docs retrieved that are relevant [issue “junk” found] Precision vs. Recall Relevant Retrieved collection Recall: % docs relevant that are retrieved [issue “info” found]

How to compute them Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) RelevantNot Relevant Retrievedtp (true positive) fp (false positive) Not Retrievedfn (false negative) tn (true negative)

Some considerations Can get high recall (but low precision) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved Precision usually decreases

Precision-Recall curve We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries precision recall x x x x

A common picture precision recall x x x x

F measure Combined measure (weighted harmonic mean) : People usually use balanced F 1 measure i.e., with  = ½ thus 1/F = ½ (1/P + 1/R) Use this if you need to optimize a single measure that balances precision and recall.

Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa

Recommendations We have a list of restaurants with  and  ratings for some Which restaurant(s) should I recommend to Dave?

Basic Algorithm Recommend the most popular restaurants say # positive votes minus # negative votes What if Dave does not like Spaghetti?

Smart Algorithm Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes. Perhaps recommend Straits Cafe to Dave  Do you want to rely on one person’s opinions?

Main idea U V W d1 d2 d5 d3 d4 d6 Y d7 What do we suggest to U ?

A glimpse on XML retrieval (eXtensible Markup Language) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 10

XML vs HTML HTML is a markup language for a specific purpose (display in browsers) XML is a framework for defining markup languages HTML has fixed markup tags, XML no HTML can be formalized as an XML language (XHTML)

XML Example (visual)

XML Example (textual) FileCab This chapter describes the commands that manage the FileCab inet application.

Basic Structure An XML doc is an ordered, labeled tree character data: leaf nodes contain the actual data (text strings) element nodes: each labeled with a name (often called the element type), and a set of attributes, each consisting of a name and a value, can have child nodes

XML: Design Goals Separate syntax from semantics to provide a framework for structuring information Allow tailor-made markup for any imaginable application domain Support internationalization (Unicode) and platform independence Be the standard of (semi)structured information (do some of the work now done by databases)

Why Use XML? Represent semi-structured XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free

Data vs. Text-centric XML Data-centric XML: used for messaging between enterprise applications Mainly a recasting of relational data Text-centric XML: used for annotating content Rich in text Demands good integration of text retrieval functionality E.g., find me the ISBN #s of Book s with at least three Chapter s discussing cocoa production, ranked by Price

IR Challenges in XML There is no document unit in XML How do we compute tf and idf? Indexing granularity Need to go to document for retrieving or displaying a fragment E.g., give me the Abstract s of Paper s on existentialism Need to identify similar elements in different schemas Example: employee

Xquery: SQL for XML ? Simple attribute/value /play/title contains “hamlet” Path queries title contains “hamlet” /play//title contains “hamlet” Complex graphs Employees with two managers What about relevance ranking?

Data structures for XML retrieval Inverted index: give me all elements matching text query Q We know how to do this – treat each element as a document Give me all elements below any instance of the Book element (Parent/child relationship is not enough)

Positional containment Doc: Play Verse Term:droppeth 720 droppeth under Verse under Play. Containment can be viewed as merging postings.

Summary of data structures Path containment etc. can essentially be solved by positional inverted indexes Retrieval consists of “merging” postings All the compression tricks are still applicable Complications arise from insertion/deletion of elements, text within elements Beyond the scope of this course

Search Engines Advertising

Classic approach… Socio-demo Geographic Contextual

Search Engines vs Advertisement First generation -- use only on-page, web-text data Word frequency and language Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) Third generation -- answer “ the need behind the query ” Focus on “ user need ”, rather than on query Integrate multiple data-sources Click-through data Pure search vs Paid search Ads show on search (who pays more), Goto/Overture 2003 Google/Yahoo New model All players now have: SE, Adv platform + network

The new scenario SEs make possible aggregation of interests unlimited selection (Amazon, Netflix,...) Incentives for specialized niche players The biggest money is in the smallest sales !!

Two new approaches Sponsored search : Ads driven by search keywords (and user-profile issuing them) AdWords

-$ +$

Two new approaches Sponsored search : Ads driven by search keywords (and user-profile issuing them) Context match : Ads driven by the content of a web page (and user-profile reaching that page) AdWords AdSense

How does it work ? 1)Match Ads to query or pg content 2)Order the Ads 3)Pricing on a click-through IR Econ

Visited Pages Clicked Banner Web Searches Clicks on Search Results Web usage data !!!

Dictionary problem

A new game For advertisers: What words to buy, how much to pay SPAM is an economic activity For search engines owners: How to price the words Find the right Ad Keyword suggestion, geo-coding, business control, language restriction, proper Ad display Similar to web searching, but: Ad-DB is smaller, Ad-items are small pages, ranking depends on clicks