Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval in Practice
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Search engines. The number of Internet hosts exceeded in in in in in
Chapter 19: Information Retrieval
Link Structure and Web Mining Shuying Wang
Information Retrieval
Overview of Search Engines
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Search Engines and Information Retrieval Chapter 1.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
SMX Madrid 2008 Uncovering the Algorithm A Peek Inside How Google Evaluates and Ranks Pages.
Ranking Link-based Ranking (2° generation) Reading 21.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
The World Wide Web: Information Resource. How a Search Engine works… How Search Works - YouTube
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Internet Search Tips by Monica Stoilov. Pre-search Considerations What do I want to do? Browse all available types of topics Find a specific piece of.
Query Models CSCI 572: Information Retrieval and Search Engines Summer 2010.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The Future of SEO: 2015 Ranking Factors survey (source moz.com)
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Search Engine Optimization
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Information Retrieval (in Practice)
Text Based Information Retrieval
HITS Hypertext-Induced Topic Selection
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
A Comparative Study of Link Analysis Algorithms
Information Retrieval
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Junghoo “John” Cho UCLA
Information Retrieval and Web Design
Presentation transcript:

Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10CS572-Summer2010CAM-2 Outline Information Retrieval Ranking Approaches Challenges

May-20-10CS572-Summer2010CAM-3 You’ve found some data…now what? What order should it be delivered back to the user? Using a Database and SQL this is easy –You only include those rows (results) that exactly match your given query –Example: select first_name, last_name from Persons where last_name LIKE ‘%Mattmann%’ first_name | last_name Chris | Mattmann Joe | Mattmann –Order is random unless you specify an ORDER BY

May-20-10CS572-Summer2010CAM-4 Ordering your results Example: select first_name, last_name from Persons where last_name LIKE ‘%Mattmann%’ ORDER BY first_name DESC –first_name | last_name Joe | Mattmann Chris | Mattmann Problems –Rigidity – hard to control the ordering at a fine grained level (coarse grained ability to sort on attributes) –Boolean – ranking defined on on those results that exactly match

May-20-10CS572-Summer2010CAM-5 Information Retrieval Queries are a bit more flexible –Can specify terms to include (or exclude) –Often evaluation of keyword queries is OR-based for inclusion of more results, and refined via ranking Notion of relative importance –Partial matches, with lower score –Closer, more accurate matches with higher score –Everything else, in-between Effective in exploration of data rather than reporting or transaction based querying

May-20-10CS572-Summer2010CAM-6 The Notion of Score …means different things to different people How to score web pages? –Entirely based on link structure –Entirely based on page contents/structure –Hybrid mixes of the two, e.g., Saxena, Gupta et al. What types of pages do you care about seeing first? –It’s a difficult question to answer for you, let alone all the users of the Internet!

May-20-10CS572-Summer2010CAM-7 Link Structure Scoring Models Focus on the way that pages reference one another Inspired by academic research In large part ignore the internal structure of the page Most scoring techniques can be computed offline and are based on the value of the web graph collected

May-20-10CS572-Summer2010CAM-8 Web Ranking Models* PageRank –Popularized by Google –Influence of page’s importance in the form of its ingoing and outgoing links Independent of query Susceptible to trickery involving mocked up page importance and citation trail –Compute rank at index time HITS (Hyperlink-Induced Topic Search) –John Kleinberg –Compute Hubs and Authorities –Compute rank at query time *Great talk on this from Andrzej Bialecki, see:

May-20-10CS572-Summer2010CAM-9 Web Ranking Models OPIC (Online Page Importance Computation) –Distribute “cash” to each page or node in the web graph –See how cash changes Since last crawl Since we have ran the algorithm –Constantly redistribute cash to outgoing pages and reduce each origin page’s cash to 0 –Rewards maintenance of links to pages TrustRank –Only start out with set of seeds trusted by experts, and keep within the outgoing links from those

May-20-10CS572-Summer2010CAM-10 Page Structure Scoring Models Keyword optimization –Meta keywords can influence overall ranking Not just limited to HTML anymore –Title –Other factors (domain name) Social factors –Who is being referenced and mentioned within the page TFIDF –Basic term frequency over inverse document frequency

May-20-10CS572-Summer2010CAM-11 What does Google use? Combination of both approaches Can’t know for sure

May-20-10CS572-Summer2010CAM-12 What does Lucene/Solr use? Open source Exposes underlying ranking model of Lucene –Allows for boost values Set at indexing time Set at query time –Score is computed based on boosts and on TFIDF model –Example: social_service:”Medicaid Applications”^200 AND zipcode:90042 Each Time Medicaid Applications hits the TFIDF increases, coupled with the boost factor, makes that term heavily weighted

May-20-10CS572-Summer2010CAM-13 What does Lucene/Solr use? Allowing for index-time scoring –Affords link-graph based ranking –Afford ranking based on content Query-time scoring allows for users to indicate their relative emphasis on important fields –What happens if the text medicaid applications matches in the service name field AND ALSO in the service description field AND ALSO in the service aliases field A user can say that the service name field matches are more important than matches in other fields You can’t do this per se with Google

May-20-10CS572-Summer2010CAM-14 Challenges Link-graph pagerank is computationally intensive –Billions of links –…but typically fairly accurate Page content based mechanisms are computationally efficient –But suffer from local maxima –And are typically focused on a single user community and its definition of importance Can be fooled with less effort Combining the two approaches leads to accuracy, but at computational cost

May-20-10CS572-Summer2010CAM-15 Wrapup Ranking is extremely important –Will make or break the assessment of your search engine’s quality Models for ranking boil down to –Link-graph based –Content-based –Hybrid Best approach is usually to combine the two, and then refine