Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8

Is it good ? How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of the query language

Measures for a search engine All of the preceding criteria are measurable The key measure: user happiness …useless answers won’t make a user happy

Happiness: elusive to measure Commonest approach is given by the relevance of search results How do we measure it ? Requires 3 elements: 1.A benchmark document collection 2.A benchmark suite of queries 3.A binary assessment of either Relevant or Irrelevant for each query-doc pair

Evaluating an IR system Standard benchmarks TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant  On the Web everything is more complicated since we cannot mark the entire corpus !!

General scenario Relevant Retrieved collection

Precision: % docs retrieved that are relevant [issue “junk” found] Precision vs. Recall Relevant Retrieved collection Recall: % docs relevant that are retrieved [issue “info” found]

How to compute them Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) RelevantNot Relevant Retrievedtp (true positive) fp (false positive) Not Retrievedfn (false negative) tn (true negative)

Some considerations Can get high recall (but low precision) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved Precision usually decreases

Precision-Recall curve We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries precision recall x x x x

A common picture precision recall x x x x

F measure Combined measure (weighted harmonic mean) : People usually use balanced F 1 measure i.e., with  = ½ thus 1/F = ½ (1/P + 1/R) Use this if you need to optimize a single measure that balances precision and recall.

Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa

Recommendations We have a list of restaurants with  and  ratings for some Which restaurant(s) should I recommend to Dave?

Basic Algorithm Recommend the most popular restaurants say # positive votes minus # negative votes What if Dave does not like Spaghetti?

Smart Algorithm Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes. Perhaps recommend Straits Cafe to Dave  Do you want to rely on one person’s opinions?

Main idea U V W d1 d2 d5 d3 d4 d6 Y d7 What do we suggest to U ?

A glimpse on XML retrieval (eXtensible Markup Language) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 10

XML vs HTML HTML is a markup language for a specific purpose (display in browsers) XML is a framework for defining markup languages HTML has fixed markup tags, XML no HTML can be formalized as an XML language (XHTML)

XML Example (visual)

XML Example (textual) FileCab This chapter describes the commands that manage the FileCab inet application.

Basic Structure An XML doc is an ordered, labeled tree character data: leaf nodes contain the actual data (text strings) element nodes: each labeled with a name (often called the element type), and a set of attributes, each consisting of a name and a value, can have child nodes

XML: Design Goals Separate syntax from semantics to provide a framework for structuring information Allow tailor-made markup for any imaginable application domain Support internationalization (Unicode) and platform independence Be the standard of (semi)structured information (do some of the work now done by databases)

Why Use XML? Represent semi-structured XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free

Data vs. Text-centric XML Data-centric XML: used for messaging between enterprise applications Mainly a recasting of relational data Text-centric XML: used for annotating content Rich in text Demands good integration of text retrieval functionality E.g., find me the ISBN #s of Book s with at least three Chapter s discussing cocoa production, ranked by Price

IR Challenges in XML There is no document unit in XML How do we compute tf and idf? Indexing granularity Need to go to document for retrieving or displaying a fragment E.g., give me the Abstract s of Paper s on existentialism Need to identify similar elements in different schemas Example: employee

Xquery: SQL for XML ? Simple attribute/value /play/title contains “hamlet” Path queries title contains “hamlet” /play//title contains “hamlet” Complex graphs Employees with two managers What about relevance ranking?

Data structures for XML retrieval Inverted index: give me all elements matching text query Q We know how to do this – treat each element as a document Give me all elements below any instance of the Book element (Parent/child relationship is not enough)

Positional containment Doc:1 27112220335790 Play 431867 Verse Term:droppeth 720 droppeth under Verse under Play. Containment can be viewed as merging postings.

Summary of data structures Path containment etc. can essentially be solved by positional inverted indexes Retrieval consists of “merging” postings All the compression tricks are still applicable Complications arise from insertion/deletion of elements, text within elements Beyond the scope of this course

Search Engines Advertising

Classic approach… Socio-demo Geographic Contextual

Search Engines vs Advertisement First generation -- use only on-page, web-text data Word frequency and language Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) Third generation -- answer “ the need behind the query ” Focus on “ user need ”, rather than on query Integrate multiple data-sources Click-through data Pure search vs Paid search Ads show on search (who pays more), Goto/Overture 2003 Google/Yahoo New model All players now have: SE, Adv platform + network

The new scenario SEs make possible aggregation of interests unlimited selection (Amazon, Netflix,...) Incentives for specialized niche players The biggest money is in the smallest sales !!

Two new approaches Sponsored search : Ads driven by search keywords (and user-profile issuing them) AdWords

Two new approaches Sponsored search : Ads driven by search keywords (and user-profile issuing them) Context match : Ads driven by the content of a web page (and user-profile reaching that page) AdWords AdSense

How does it work ? 1)Match Ads to query or pg content 2)Order the Ads 3)Pricing on a click-through IR Econ

Visited Pages Clicked Banner Web Searches Clicks on Search Results Web usage data !!!

Dictionary problem

A new game For advertisers: What words to buy, how much to pay SPAM is an economic activity For search engines owners: How to price the words Find the right Ad Keyword suggestion, geo-coding, business control, language restriction, proper Ad display Similar to web searching, but: Ad-DB is smaller, Ad-items are small pages, ranking depends on clicks

Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Similar presentations

Presentation on theme: "Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Similar presentations

Presentation on theme: "Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8."— Presentation transcript:

Similar presentations

About project

Feedback