INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Slides:

Advertisements

Similar presentations

Lecture 18: Link analysis

Advertisements

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.

CS Lecture 9 Storeing and Querying Large Web Graphs.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:

CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.

Link Analysis, PageRank and Search Engines on the Web

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

Adversarial Information Retrieval The Manipulation of Web Content.

1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.

ITCS 6265 Lecture 17 Link Analysis This lecture Anchor text Link analysis for ranking Pagerank and variants HITS.

Lecture 14: Link Analysis

Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.

Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 20: Crawling and web indexes.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and.

Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture.

Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 21: Link analysis.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Web Search and Tex Mining Lecture 9 Link Analysis.

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.

Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 

COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Modified by Dongwon Lee from slides by

Lecture 17 Crawling and web indexes

Quality of a search engine

Search Engines and Link Analysis on the Web

Link analysis and Page Rank Algorithm

Information Retrieval Christopher Manning and Prabhakar Raghavan

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

CS276 Information Retrieval and Web Search

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Presentation transcript:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 36 Link Analysis

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎ “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline Hypertext and Links The Web as a Directed Graph Anchor Text Indexing anchor text PageRank scoring

Hypertext and Links We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority to some pages? Is this useful for ranking? How likely is it that a page pointed to by the CERN home page is about high energy physics Big application areas The Web Email Social networks 00:00:50  00:01:10 00:06:30  00:06:46 00:09:40  00:09:53 00:11:15  00:11:30

Links are everywhere Powerful sources of authenticity and authority Mail spam – which email accounts are spammers? Host quality – which hosts are “bad”? Phone call logs The Good, The Bad and The Unknown ? Good Bad 00:15:10  00:15:30

Example 1: Good/Bad/Unknown The Good, The Bad and The Unknown Good nodes won’t point to Bad nodes All other combinations plausible ? 00:15:40  00:16:30 00:17:30  00:19:10 (Graph to slide NO#09) Good ? ? Bad ?

Simple iterative logic Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good ? Good ? ? Bad ?

Simple iterative logic Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good Good ? Bad ?

Simple iterative logic Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good Good Bad Sometimes need probabilistic analogs – e.g., mail spam

Our primary interest in this course Link analysis for most IR functionality thus far based purely on text Scoring and ranking Link-based clustering – topical structure from links Links as features in classification – documents that link to one another are likely to be on the same subject Crawling Based on the links seen, where do we crawl next? 00:20:00  00:21:05 (layering)

The Web as a Directed Graph Page A hyperlink Page B Anchor Hypothesis 1: A hyperlink between pages denotes a conferral of authority (quality signal) 00:21:20  00:21:40 (figure) 00:24:00  00:24:20 (hap-01) 00:24:45  00:25:05 (hap-02) Hypothesis 2: The text in the anchor of the hyperlink on page A describes the target page B

Assumption 1: reputed sites 00:27:40  00:28:15

Assumption 2: annotation of target 00:28:20  00:29:00

Anchor Text WWW Worm - McBryan [Mcbr94] For ibm how to distinguish between: IBM’s home page (mostly graphical) IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.) “IBM home page” “ibm.com” “ibm” 00:29:15  00:29:45 A million pieces of anchor text with “ibm” send a strong signal www.ibm.com

Indexing anchor text When indexing a document D, include (with some weight) anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today www.ibm.com 00:33:30  00:34:00 Joe’s computer hardware links Sun HP IBM Big Blue today announced record profits for the quarter

Adjacency lists The set of neighbors of a node Assume each URL represented by an integer E.g., for a 4 billion page web, need 32 bits per node Naively, this demands 64 bits to represent each hyperlink Boldi/Vigna get down to an average of ~3 bits/link Further work achieves 2 bits/link 00:35:30  00:35:50 00:36:15  00:36:45 00:36:50  00:38:00

Main ideas of Boldi/Vigna Consider lexicographically ordered list of all URLs, e.g., www.stanford.edu/alchemy www.stanford.edu/biology www.stanford.edu/biology/plant www.stanford.edu/biology/plant/copyright www.stanford.edu/biology/plant/people www.stanford.edu/chemistry 00:38:15  00:38:50

Encode as (-2), remove 9, add 8 Boldi/Vigna Each of these URLs has an adjacency list Main idea: due to templates, the adjacency list of a node is similar to one of the 7 preceding URLs in the lexicographic ordering Express adjacency list in terms of one of these E.g., consider these adjacency lists 1, 2, 4, 8, 16, 32, 64 1, 4, 9, 16, 25, 36, 49, 64 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 1, 4, 8, 16, 25, 36, 49, 64 Why 7? 00:39:25  00:40:55 Encode as (-2), remove 9, add 8

Compress each integer using a code Gap encodings Given a sorted list of integers x, y, z, …, represent by x, y-x, z-y, … Compress each integer using a code  code - Number of bits = 1 + 2 lg x d code: … Information theoretic bound: 1 + lg x bits z code: Works well for integers from a power law Boldi Vigna DCC 2004 00:41:20  00:41:55

Citation Analysis Citation frequency Bibliographic coupling frequency Articles that co-cite the same articles are related Citation indexing Who is this author cited by? (Garfield 1972) Pagerank preview: Pinsker and Narin ’60s Asked: which journals are authoritative? 00:42:10  00:42:52

The web isn’t scholarly citation Millions of participants, each with self interests Spamming is widespread Once search engines began to use links for ranking (roughly 1998), link spam grew You can join a link farm – a group of websites that heavily link to one another 00:43:10  00:43:40

PageRank scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably “In the long run” each page has a long-term visit rate - use this as the page’s score. 1/3 00:44:15  00:45:05 00:47:10  00:47:25

Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates. 00:47:30  00:47:55 ??

Teleporting At a dead end, jump to a random web page. At any non-dead end, with probability 10%, jump to a random web page. With remaining probability (90%), go out on a random link. 10% - a parameter. 00:49:00  00:49:25

Result of teleporting Now cannot get stuck locally. There is a long-term rate at which any page is visited (not obvious, will show this). How do we compute this visit rate? 00:49:50  00:50:15

Resources IIR Chap 21 http://www2004.org/proceedings/docs/1p309.pdf http://www2004.org/proceedings/docs/1p595.pdf http://www2003.org/cdrom/papers/refereed/p270/kamvar-270- xhtml/index.html http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641- mccurley.html The WebGraph framework I: Compression techniques (Boldi et al. 2004)