Inf 723 Information & Computing

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Item Selection By “Hub-Authority” Profit Ranking Presented by: Thomas Su.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Information Fusion Yu Cai. Research Paper Johan Schubert, “Clustering belief functions based on attracting and conflicting meta level evidence”, July.
Link Structure and Web Mining Shuying Wang
(hyperlink-induced topic search)
The Further Mathematics network
Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Using Hyperlink structure information for web search.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Web Intelligence Web Communities and Dissemination of Information and Culture on the www.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Automated Information Retrieval
DATA MINING Introductory and Advanced Topics Part III – Web Mining
WEB SPAM.
HITS Hypertext-Induced Topic Selection
Data Mining K-means Algorithm
7CCSMWAL Algorithmic Issues in the WWW
Informetrics, Webometrics and Web Use metrics
Text & Web Mining 9/22/2018.
CSE 454 Advanced Internet Systems University of Washington
A Comparative Study of Link Analysis Algorithms
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
HITS Hypertext Induced Topic Selection
CS 440 Database Management Systems
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Inf 723 Information & Computing
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
HITS Hypertext Induced Topic Selection
Jiawei Han Department of Computer Science
Junghoo “John” Cho UCLA
Discussion Class 9 Google.
Presentation transcript:

Inf 723 Information & Computing Jagdish S. Gangolly Interdisciplinary PhD Program in Information Science Department of Informatics, College of Computing & Information State University of New York at Albany 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Entropy-Based Link Analysis for Mining Web Informative Structures Abundance problem: Number of pages returned too large for humans to handle One way to handle the problem is to filter the pages returned based on how authoritative a page is There is no endogenous way to measure how authoritative a page is Authorities do not use the relevant search terms 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Entropy-Based Link Analysis for Mining Web Informative Structures A common method used to analyse web page information is the Hypertext Induced Topic Selection (HITS) algorithm due to Jon Kleinberg (http://www.cs.cornell.edu/home/kleinber/auth.pdf) HITS algorithm ranks documents based on the link in formation 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Inf 723 Information & Computing (Gangolly) Spring 2008 Entropy-based….. Nodes can be of two types Authorities: documents that other documents point to Hubs: documents that points to many other documents A node may exhibit characteristics of both Authorities and hubs reinforce each other 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Inf 723 Information & Computing (Gangolly) Spring 2008 Entropy-based….. Original graph Base Sub-graph containing the t highest ranked pages from querying the search engine Augment the base sub-graph above by adding to it pages pointed to by pages in the base sub-graph and d pages that point to pages in the base sub-graph 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Inf 723 Information & Computing (Gangolly) Spring 2008 Entropy-Based Hubs and authorities exhibit what could be called a mutually reinforcing relationship: a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs. HITS is an iterative algorithm that computes and updates authority and hub weights for each page 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Inf 723 Information & Computing (Gangolly) Spring 2008 Deficiencies of HITS advertisement banners, browsing menus, catalogs of services, announcements of copyright and privacy policy, contents tagged with hyperlinks for easy access to related information navigational links 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Topic selection vs. Informative structure TS distills authorities and hubs; IS mines TOC and article pages In TS the augmented base sub-graph reflects topic distilled, in IS the entire set of web pages are mined TS ignores intra-links and nepotistic links, IS considers them 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Inf 723 Information & Computing (Gangolly) Spring 2008 LAMIS tfij : Frequency of term i in page j tf is the term-document matrix Weights: 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Inf 723 Information & Computing (Gangolly) Spring 2008 LAMIS Entropy of term i is: The paper normalises entropy by taking the logarithm to the base n, the number of documents 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Inf 723 Information & Computing (Gangolly) Spring 2008 LAMIS Entropy of anchor text: Let an anchor AN consist of terms T1, T2,……..Tk Entropy of the anchor AN is: 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Inf 723 Information & Computing (Gangolly) Spring 2008 LAMIS Adjusts the authority and hub weights computed by algorithms such as HITS and SALSA by entropy to yield results that are appealing 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Inf 723 Information & Computing (Gangolly) Spring 2008 Belief Functions Probability chance or propension that some event will happen (Popper) degree of belief that some event will happen (Savage) degree of support that one proiposition gives another proposition (Carnap) 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Inf 723 Information & Computing (Gangolly) Spring 2008 Belief Functions Murder suspects: Peter, Paul, Mary Frame of discernment: {Peter}, {Paul}, {Mary}, {Peter, Paul},{Peter,Mary}, {Paul, Mary},{Peter, Paul, Mary}, {none of the three} 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Inf 723 Information & Computing (Gangolly) Spring 2008 Belief Functions Probabilists invoke the principle of insufficient reason to assign equal probabilities. eg., if you know that the murderer is a male, equal probabilities are assigned to Peter and Paul. Belief functions do not invoke the principle of insufficient reason, and so assign the mass to the proposition “murderer is a male” 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Belief Functions (Glen Shafer) Suppose, for example, that Betty tells me a tree limb fell on my car. My subjective probability that Betty is reliable is 90%; my subjective probability that she is unreliable is 10%. Since they are probabilities, these numbers add to 100%. 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Belief Functions (Glen Shafer) But Betty's statement, which must be true if she is reliable, is not necessarily false if she is unreliable. From her testimony alone, I can justify a 90% degree of belief that a limb fell on my car, but only a 0% (not 10%) degree of belief that no limb fell on my car. 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Belief Functions (Glen Shafer) This 0% does not mean that I am sure that no limb fell on my car, (as a 0% probability would) It merely means that Betty's testimony gives me no reason to believe that no limb fell on my car.) The 90% and the 0%, which do not add to 100%, together constitute a “belief function.” 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Belief Functions (Srivastava) Ignorance is represented by assigning equal probability to both the outcomes under a probability framework: P(fraud) = 0.5, P(no fraud) = 0.5. These probability numbers, in general, represent uncertainty about the outcome of the event. Under belief functions, the same situation, where there is no evidence, is represented by assigning zero beliefs to both the outcomes: Bel(fraud) = 0, and Bel(no fraud) = 0. 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008

Belief Functions (Srivastava) However, under belief functions, plausibility that fraud is present or not present is one, i.e., Pl(fraud) = 1, Pl(no fraud) = 1 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008