Inf 723 Information & Computing Jagdish S. Gangolly Interdisciplinary PhD Program in Information Science Department of Informatics, College of Computing & Information State University of New York at Albany 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Entropy-Based Link Analysis for Mining Web Informative Structures Abundance problem: Number of pages returned too large for humans to handle One way to handle the problem is to filter the pages returned based on how authoritative a page is There is no endogenous way to measure how authoritative a page is Authorities do not use the relevant search terms 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Entropy-Based Link Analysis for Mining Web Informative Structures A common method used to analyse web page information is the Hypertext Induced Topic Selection (HITS) algorithm due to Jon Kleinberg (http://www.cs.cornell.edu/home/kleinber/auth.pdf) HITS algorithm ranks documents based on the link in formation 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Inf 723 Information & Computing (Gangolly) Spring 2008 Entropy-based….. Nodes can be of two types Authorities: documents that other documents point to Hubs: documents that points to many other documents A node may exhibit characteristics of both Authorities and hubs reinforce each other 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Inf 723 Information & Computing (Gangolly) Spring 2008 Entropy-based….. Original graph Base Sub-graph containing the t highest ranked pages from querying the search engine Augment the base sub-graph above by adding to it pages pointed to by pages in the base sub-graph and d pages that point to pages in the base sub-graph 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Inf 723 Information & Computing (Gangolly) Spring 2008 Entropy-Based Hubs and authorities exhibit what could be called a mutually reinforcing relationship: a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs. HITS is an iterative algorithm that computes and updates authority and hub weights for each page 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Inf 723 Information & Computing (Gangolly) Spring 2008 Deficiencies of HITS advertisement banners, browsing menus, catalogs of services, announcements of copyright and privacy policy, contents tagged with hyperlinks for easy access to related information navigational links 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Topic selection vs. Informative structure TS distills authorities and hubs; IS mines TOC and article pages In TS the augmented base sub-graph reflects topic distilled, in IS the entire set of web pages are mined TS ignores intra-links and nepotistic links, IS considers them 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Inf 723 Information & Computing (Gangolly) Spring 2008 LAMIS tfij : Frequency of term i in page j tf is the term-document matrix Weights: 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Inf 723 Information & Computing (Gangolly) Spring 2008 LAMIS Entropy of term i is: The paper normalises entropy by taking the logarithm to the base n, the number of documents 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Inf 723 Information & Computing (Gangolly) Spring 2008 LAMIS Entropy of anchor text: Let an anchor AN consist of terms T1, T2,……..Tk Entropy of the anchor AN is: 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Inf 723 Information & Computing (Gangolly) Spring 2008 LAMIS Adjusts the authority and hub weights computed by algorithms such as HITS and SALSA by entropy to yield results that are appealing 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Inf 723 Information & Computing (Gangolly) Spring 2008 Belief Functions Probability chance or propension that some event will happen (Popper) degree of belief that some event will happen (Savage) degree of support that one proiposition gives another proposition (Carnap) 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Inf 723 Information & Computing (Gangolly) Spring 2008 Belief Functions Murder suspects: Peter, Paul, Mary Frame of discernment: {Peter}, {Paul}, {Mary}, {Peter, Paul},{Peter,Mary}, {Paul, Mary},{Peter, Paul, Mary}, {none of the three} 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Inf 723 Information & Computing (Gangolly) Spring 2008 Belief Functions Probabilists invoke the principle of insufficient reason to assign equal probabilities. eg., if you know that the murderer is a male, equal probabilities are assigned to Peter and Paul. Belief functions do not invoke the principle of insufficient reason, and so assign the mass to the proposition “murderer is a male” 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Belief Functions (Glen Shafer) Suppose, for example, that Betty tells me a tree limb fell on my car. My subjective probability that Betty is reliable is 90%; my subjective probability that she is unreliable is 10%. Since they are probabilities, these numbers add to 100%. 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Belief Functions (Glen Shafer) But Betty's statement, which must be true if she is reliable, is not necessarily false if she is unreliable. From her testimony alone, I can justify a 90% degree of belief that a limb fell on my car, but only a 0% (not 10%) degree of belief that no limb fell on my car. 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Belief Functions (Glen Shafer) This 0% does not mean that I am sure that no limb fell on my car, (as a 0% probability would) It merely means that Betty's testimony gives me no reason to believe that no limb fell on my car.) The 90% and the 0%, which do not add to 100%, together constitute a “belief function.” 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Belief Functions (Srivastava) Ignorance is represented by assigning equal probability to both the outcomes under a probability framework: P(fraud) = 0.5, P(no fraud) = 0.5. These probability numbers, in general, represent uncertainty about the outcome of the event. Under belief functions, the same situation, where there is no evidence, is represented by assigning zero beliefs to both the outcomes: Bel(fraud) = 0, and Bel(no fraud) = 0. 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008
Belief Functions (Srivastava) However, under belief functions, plausibility that fraud is present or not present is one, i.e., Pl(fraud) = 1, Pl(no fraud) = 1 11/24/2018 Inf 723 Information & Computing (Gangolly) Spring 2008