Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.

Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10CS572-Summer2010CAM-2 Outline Information Retrieval Ranking Approaches Challenges

May-20-10CS572-Summer2010CAM-3 You’ve found some data…now what? What order should it be delivered back to the user? Using a Database and SQL this is easy –You only include those rows (results) that exactly match your given query –Example: select first_name, last_name from Persons where last_name LIKE ‘%Mattmann%’ first_name | last_name Chris | Mattmann Joe | Mattmann –Order is random unless you specify an ORDER BY

May-20-10CS572-Summer2010CAM-4 Ordering your results Example: select first_name, last_name from Persons where last_name LIKE ‘%Mattmann%’ ORDER BY first_name DESC –first_name | last_name Joe | Mattmann Chris | Mattmann Problems –Rigidity – hard to control the ordering at a fine grained level (coarse grained ability to sort on attributes) –Boolean – ranking defined on on those results that exactly match

May-20-10CS572-Summer2010CAM-5 Information Retrieval Queries are a bit more flexible –Can specify terms to include (or exclude) –Often evaluation of keyword queries is OR-based for inclusion of more results, and refined via ranking Notion of relative importance –Partial matches, with lower score –Closer, more accurate matches with higher score –Everything else, in-between Effective in exploration of data rather than reporting or transaction based querying

May-20-10CS572-Summer2010CAM-6 The Notion of Score …means different things to different people How to score web pages? –Entirely based on link structure –Entirely based on page contents/structure –Hybrid mixes of the two, e.g., Saxena, Gupta et al. What types of pages do you care about seeing first? –It’s a difficult question to answer for you, let alone all the users of the Internet!

May-20-10CS572-Summer2010CAM-7 Link Structure Scoring Models Focus on the way that pages reference one another Inspired by academic research In large part ignore the internal structure of the page Most scoring techniques can be computed offline and are based on the value of the web graph collected

May-20-10CS572-Summer2010CAM-8 Web Ranking Models* PageRank –Popularized by Google –Influence of page’s importance in the form of its ingoing and outgoing links Independent of query Susceptible to trickery involving mocked up page importance and citation trail –Compute rank at index time HITS (Hyperlink-Induced Topic Search) –John Kleinberg –Compute Hubs and Authorities –Compute rank at query time *Great talk on this from Andrzej Bialecki, see: http://bit.ly/agiEu3http://bit.ly/agiEu3

May-20-10CS572-Summer2010CAM-9 Web Ranking Models OPIC (Online Page Importance Computation) –Distribute “cash” to each page or node in the web graph –See how cash changes Since last crawl Since we have ran the algorithm –Constantly redistribute cash to outgoing pages and reduce each origin page’s cash to 0 –Rewards maintenance of links to pages TrustRank –Only start out with set of seeds trusted by experts, and keep within the outgoing links from those

May-20-10CS572-Summer2010CAM-10 Page Structure Scoring Models Keyword optimization –Meta keywords can influence overall ranking Not just limited to HTML anymore –Title –Other factors (domain name) Social factors –Who is being referenced and mentioned within the page TFIDF –Basic term frequency over inverse document frequency

May-20-10CS572-Summer2010CAM-11 What does Google use? Combination of both approaches Can’t know for sure

May-20-10CS572-Summer2010CAM-12 What does Lucene/Solr use? Open source Exposes underlying ranking model of Lucene –Allows for boost values Set at indexing time Set at query time –Score is computed based on boosts and on TFIDF model –Example: social_service:”Medicaid Applications”^200 AND zipcode:90042 Each Time Medicaid Applications hits the TFIDF increases, coupled with the boost factor, makes that term heavily weighted

May-20-10CS572-Summer2010CAM-13 What does Lucene/Solr use? Allowing for index-time scoring –Affords link-graph based ranking –Afford ranking based on content Query-time scoring allows for users to indicate their relative emphasis on important fields –What happens if the text medicaid applications matches in the service name field AND ALSO in the service description field AND ALSO in the service aliases field A user can say that the service name field matches are more important than matches in other fields You can’t do this per se with Google

May-20-10CS572-Summer2010CAM-14 Challenges Link-graph pagerank is computationally intensive –Billions of links –…but typically fairly accurate Page content based mechanisms are computationally efficient –But suffer from local maxima –And are typically focused on a single user community and its definition of importance Can be fooled with less effort Combining the two approaches leads to accuracy, but at computational cost

May-20-10CS572-Summer2010CAM-15 Wrapup Ranking is extremely important –Will make or break the assessment of your search engine’s quality Models for ranking boil down to –Link-graph based –Content-based –Hybrid Best approach is usually to combine the two, and then refine

Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.

Similar presentations

Presentation on theme: "Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.

Similar presentations

Presentation on theme: "Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010."— Presentation transcript:

Similar presentations

About project

Feedback