Mining social networks for knowledge management Prabhakar Raghavan.

1 Mining social networks for knowledge management Prabhakar Raghavan

2 Overview Sampling of social network studies The knowledge management challenge › Enterprise complications Extended social networks › Network and tensor-based models Power law behaviors › Why, and what they mean for mining A research agenda

3 Milgram’s experiments Began with volunteers from Omaha, NE. Asked to get a letter to a physician near Boston. Could only send to first-name acquaintance, to be forwarded etc. Median path length of successful deliveries was 6. Led to famous “6 degrees of separation” folklore.

4 eMail cliques: Schwartz/Wood Studied eMail (sub)graph. Proposed metrics for groups of people to share interests; cluster analysis. Qualitatively “good” results. Raised issues of ethical use of data and privacy.

5 Various other projects PHOAKS › Extracting heavily cited resources in newsgroups, etc. Call graphs › Discerning home, business and fax lines › Calling circles. Recommendation systems › Input: users’ product endorsements. › Output: product recommendations to each user.

6 Trawling bipartite cliques Take the (directed) Web link graph. Enumerate all (small) bipartite cliques. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins (1998).

7 Insights from hubs Link-based hypothesis: Dense bipartite subgraph  Web community. Hub Authority FansCenters

8 Communities from cores What is a “dense bipartite subgraph”? Define (i,j)-core: complete bipartite subgraph with i nodes all of which point to each of j others. Enumerate (i,j)-cores for various small i,j. (2,3) core

9 Results for cores

11 Yenta: Forman Analyzes documents “associated” with each user. Distils significant “interests” for each. Matches/clusters groups of users with overlapping interests. Decentralized; aims for privacy protection. Elements of peer-to-peer operation.

12 ReferralWeb: Kautz/Selman Establishes links between people, e.g., › co-authorship › colleagues in an organization Allows search through this social network, e.g., › find me someone within distance 2 who will referee a paper on xyz...

13 Who can I ask to review a paper on “expander graphs”? Source: H. Kautz & B. Selman

14 Paths to Experts Source: H. Kautz & B. Selman

15 Observations Source: H. Kautz & B. Selman Official company hierarchy only a sparse subset of the corporate social network Shortest (and often best) paths involve a combination of official and unofficial links › Conditions for trust and evaluation may greatly differ › Global social network is the union of many different kinds of sub-networks Search greatly aided when user can choose different views of the network types of edge strength of edge

16 Knowledge management The big challenge: Increase productivity in knowledge workers by getting them the expertise they need at all times > the right information (documents?) > the right experts. Enterprise: group of people engaged in a collective endeavour, typically with proprietary content.

17 Enterprise knowledge mgmt. Examples - Schwartz/Wood, ReferralWeb, Yenta, PHOAKS all have some applicability. › ReferralWeb was originally devised and deployed at AT&T Labs. Enterprise knowledge management introduces some novel challenges.

18 Challenges in the enterprise Information resides in heterogeneous › formats (email, pdf, word, …) › repositories (Lotus, Exchange, Documentum, databases …) › applications (HR, ERP, Siebel, …) Need to combine structured relations (from applications) with learning.

19 Challenges in the enterprise Data security: information units have many different access classes. › e.g., compound documents have pieces, each with its own access lists. › My search should hit the doc only if it hits the pieces I can see. Knowledge security is the deeper issue. › Learning: consider class models learned from security-limited information. › Inferences (what does a recommendation tell me about confidential data?)

20 General formulation How do we combine different sources of content and context? › terms in docs › links between docs › users’ access patterns › users’ profile information.

21 General formulation Every item of interest - each term, query, doc, person, treated as a node. Impose similarity metric between pairs of nodes. Need to be able to measure proximity from sets of nodes (a person+a doc they’re viewing+a query they’ve issued) to nodes of a target type (a person).

22 Issues in formulation If a user is close to two docs d1 and d2, are the docs d1 and d2 close to each other? How do you measure proximity from a set of nodes? How do you capture collaborative (as opposed to content and context-based filtering). How do you succinctly represent and manipulate similarities?

23 Graph-based models Each node an entity, associated with a set of features. Pairwise similarities based on feature matches. Issues: › Not easy to do proximity from sets of nodes. › Have to maintain (quadratic) pairwise information. › Consistency.

24 Tensor-based models Turn every entity into a vector. Axes are terms, profile features, … Combination of user, context++ becomes a tensor. Measure proximity to tensors of a certain type (e.g., user, doc recommendation).

25 Context with content Docs’ content captured in term axes. Other attributes (user profile, etc.) captured in other axes. A probe consists of 1 : a tensor t (say, a user vector plus a query) 2 : a type of vector to be retrieved (say, a user). Result = vectors of chosen type closest to t.

26 “Standard” mining tricks Dimensionality reduction - for collaborative filtering. Hierarchical clustering - for fast near- neighbor search. Incremental indexes - real-time updates.

27 Upshot Verity social networks project Screenshot. Screenshot Security issues remain thorny. What aspects of social behavior can we exploit in the algorithms? › Power laws

28 Power laws in mining

29 Recurring phenomena Many interesting distributions › term frequencies in a corpus › citations › in-links to web pages › document access frequencies … follow an inverse polynomial function.

30 Zipf versus power laws We call a distribution on the positive integers › a power law if it’s of the form p(i) ~ 1/i . › a Zipf law if p(i)~1/j  where j is the rank of i. Typically  >1.

31 Other Zipf/power laws Populations of US cities Degrees of internet nodes See

32 What leads to power laws? “Scale free” growth. “Highly optimized tolerance”. Behavioral models. › Model behavior of individuals in social network.

33 In-degrees on the Web graph Web in-degrees are distributed as p(i) ~ 1/i 2.1. › Consistently across many independent studies. Erdos-Renyi random graphs would not lead to such a power law. Need a new stochastic model for such graphs.

34 Random replication graphs Central thesis - random replication in an evolving graph. › Some page creators create content without regard to what exists on the web. › Many are inspired by pre-existing content. › i.e., some links are random, others are copied from pre-existing pages.

35 Model details Evolution: Nodes are created in a sequence of discrete time steps › e.g. at each time step, a new node is created with d=O(1) out-links Probabilistic copying › links go to random nodes with probability › copy d links from a random “existing” node with probability

36 Theoretical Results New model yields › convergence to power-law in-degrees; › › number of bipartite cliques that grows with time; › evolution without copying would not yield these phenomena. R. Kumar, P. Raghavan, S. Rajagopalan, R. Sivakumar, A. Tomkins, E. Upfal (2000).

37 Compound structures “First order” structures (terms frequencies, in-degrees, citations) exhibit power laws. What about “higher order” structures (pairs of terms, bipartite cliques, etc.)? Motivations: › Criteria for mining interesting higher order structures. › Turning algorithms for higher order mining.

38 Pair frequencies for terms Analyzed several corpora of news items. Studied frequencies of k-tuples of terms (k=1, 2, …) in › corpus › documents › sentences › windows of width w. Ongoing work with P. Tsaparas.

39 Sentence log-rank vs. log- frequency

40 Pair distributions Based on term frequencies, compute pair frequencies under independence assumption. Measure actual pair frequencies. › Outliers under mutual information measure. Higher order outliers: useful for building clusters/concept maps in corpora.

41 Pairs: independence vs. actual

42 Computational speedup Inspired by pruning algorithms in trawling. As higher order associations are built up, keep discarding obviated terms. › Docs keep getting shorter. › Fit in memory quickly. › Not an issue with relational tables?

43 A research agenda New ways of combining content, context and collaboration in the social network. Analyze and model structures in social networks. › Tune algorithms on models. › Build on “standard” mining paradigms: associations, clustering... Incorporate enterprise constraints: › Roles and profile information from apps › Security and Privacy!

