CC5212-1 P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Link Analysis: PageRank
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
PageRank + Inverted Index. Un Motor de Búsqueda “obama”
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Chapter 19: Information Retrieval
The Vector Space Model …and applications in Information Retrieval.
Information Retrieval
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Amy N. Langville Mathematics Department College of Charleston Math Meet 2/20/10.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Wrap-Up.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 11: Conclusion Aidan Hogan
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Wiki Search. Un Motor de Búsqueda Apache Lucene Inverted Index – They built one so you don’t have to! Open Source in Java Super fast, super great, super.
Cassandra. Estudiantes Documentation
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430: Information Discovery Lecture 5 Ranking.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
PageRank. Un Motor de Búsqueda “obama” PageRank Model: Final Version The Web: a directed graph Vertices (pages) Edges (links) fa eb dc.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Automated Information Retrieval
CC Procesamiento Masivo de Datos Otoño Lecture 12: Conclusion
Information Retrieval
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Information Retrieval
CS 440 Database Management Systems
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Chapter 31: Information Retrieval
Chapter 19: Information Retrieval
Presentation transcript:

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan

How does Google crawl the Web?

Inverted Indexing Term ListPosting Lists a(1,[21,96,103,…]), (2,[…]), … american(1,[28,123]), (5,[…]), … and(1,[57,139,…]), (2,[…]), … by(1,[70,157,…]), (2,[…]), … directed(1,[61,212,…]), (4,[…]), … drama(1,[38,87,…]), (16,[…]), … …… Inverted index: 1 Fruitvale Station is a 2013 American drama film written and directed by Ryan Coogler.drama filmRyan Coogler

INFORMATION RETRIEVAL: RANKING

How Does Google Get Such Good Results?

Two Sides to Ranking: Relevance ≠

Two Sides to Ranking: Importance >

RANKING: RELEVANCE

Example Query

Matches in a Document freedom 7 occurrences

Matches in a Document freedom 7 occurrences movie 16 occurrences

Matches in a Document freedom 7 occurrences movie 16 occurrences wallace 88 occurrences

Usefulness of Words movie occurs very frequently freedom occurs frequently wallace occurs occassionally

Estimating Relevance Rare words more important than common words – wallace (49M) more important than freedom (198M) more important than movie (835M) Words occurring more frequently in a document indicate higher relevance – wallace (88) more matches than movie (16) more matches than freedom (7)

Relevance Measure: TF–IDF TF: Term Frequency – Measures occurrences of a term in a document – … various options Raw count of occurrences Logarithmically scaled Normalised by document length A combination / something else

Relevance Measure: TF–IDF IDF: Inverse Document Frequency – Measures how rare/common a term is across all documents – … Logarithmically scaled document occurrences

Relevance Measure: TF–IDF TF–IDF: Combine Term Frequency and Inverse Document Frequency: Score for a query – Let query – Score for a query:

Relevance Measure: TF–IDF Term FrequencyInverse Document Frequency

Relevance Measure: TF–IDF Term FrequencyInverse Document Frequency

Relevance Measure: TF–IDF Term FrequencyInverse Document Frequency

Relevance Measure: TF–IDF Term FrequencyInverse Document Frequency

Relevance Measure: TF–IDF Term FrequencyInverse Document Frequency

Relevance Measure: TF–IDF Term FrequencyInverse Document Frequency

Relevance Measure: TF–IDF Term FrequencyInverse Document Frequency

Vector Space Model (a mention)

Cosine Similarity Note:

Two Sides to Ranking: Relevance ≠

Field-Based Boosting Not all text is equal: titles, headers, etc.

Anchor Text See how the Web views/tags a page

Information Retrieval & Relevance

Apache to the rescue again! Lucene: An Inverted Index Engine Open Source Java Project Will play with it in the labs

RANKING: IMPORTANCE

Two Sides to Ranking: Importance >

Link Analysis Which will have more links: Barack Obama’s Wikipedia Page or Mount Obama’s Wikipedia Page?

Link Analysis Consider links as votes of confidence in a page A hyperlink is the open Web’s version of … (… even if the page is linked in a negative way.)

Link Analysis So if we just count the number of inlinks a web-page receives we know its importance, right?

Link Spamming

Link Importance Which is more “important”: a link from Barack Obama’s Wikipedia page or a link from buyv1agra.com?

PageRank

Not just a count of inlinks – A link from a more important page is more important – A link from a page with fewer links is more important ∴ A page with lots of inlinks from important pages (which have few outlinks) is more important

PageRank is Recursive

PageRank Model The Web: a directed graph Vertices (pages) Edges (links) fa eb dc Which is the most “important” vertex?

PageRank Model The Web: a directed graph Vertices (pages) Edges (links) fa eb dc

PageRank Model Vertices (pages) Edges (links) f eb d

PageRank Model Vertices (pages) Edges (links) f eb dc a

PageRank: Random Surfer Model fa eb dc = someone surfing the web, clicking links randomly What is the probability of being at page x after n hops?

PageRank: Random Surfer Model fa eb dc = someone surfing the web, clicking links randomly What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node

PageRank: Random Surfer Model fa eb dc g What would happen with g over time? = someone surfing the web, clicking links randomly What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops

PageRank: Random Surfer Model fa eb dc g = someone surfing the web, clicking links randomly What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops If the surfer reaches a page without links, the surfer randomly jumps to another page

PageRank: Random Surfer Model fa eb dc gi = someone surfing the web, clicking links randomly What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops If the surfer reaches a page without links, the surfer randomly jumps to another page What would happen with g and i over time?

PageRank: Random Surfer Model fa eb dc gi What would happen with g and i over time? = someone surfing the web, clicking links randomly What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops If the surfer reaches a page without links, the surfer randomly jumps to another page

PageRank: Random Surfer Model = someone surfing the web, clicking links randomly fa eb dc What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops If the surfer reaches a page without links, the surfer randomly jumps to another page The surfer will jump to a random page at any time with a probability 1 – d … this avoids traps and ensures convergence! gi

PageRank Model: Final Version The Web: a directed graph Vertices (pages) Edges (links) fa eb dc

PageRank: Benefits More robust than a simple link count Scalable to approximate (for sparse graphs) Convergence guaranteed

Two Sides to Ranking: Importance >

INFORMATION RETRIEVAL: RECAP

How Does Google Get Such Good Results?

Ranking in Information Retrieval Relevance: Is the document relevant for the query? – Term Frequency * Inverse Document Frequency – Touched on Cosine similarity Importance: Is the document an important/prominent one? – Links analysis – PageRank

Ranking: Science or Art?

Information Retrieval & Relevance

CLASS PROJECTS

Course Marking 45% for Weekly Labs (~3% a lab!) 35% for Final Exam 20% for Small Class Project

Class Project Done in pairs (typically) Goal: Use what you’ve learned to do something cool (basically) Expected difficulty: A bit more than a lab’s worth – But without guidance (can extend lab code) Marked on: Difficulty, appropriateness, scale, good use of techniques, presentation, coolness – Ambition is appreciated, even if you don’t succeed: feel free to bite off more than you can chew! Process: – Pair up (default random) by Wednesday, the end of the lab – Start thinking up topics – If you need data or get stuck, I will (try to) help out Deliverables: 5 minute presentation & 3-page report

Datasets to play with Wikipedia information IMDb (including ratings, directors, etc.) ArnetMiner (CS research papers w/ citations) Wikidata (like Wikipedia for data!) Twitter World Bank Find others, e.g., at

Open Government Data Chile

Next Week (May 4 th, 6 th ) No official classes or labs next week but … Good opportunity to meet with your lab partner to explore project ideas! Deadline for finding a topic: May 13th

Questions