Mining social networks for knowledge management Prabhakar Raghavan
Overview Sampling of social network studies The knowledge management challenge › Enterprise complications Extended social networks › Network and tensor-based models Power law behaviors › Why, and what they mean for mining A research agenda
Milgram’s experiments Began with volunteers from Omaha, NE. Asked to get a letter to a physician near Boston. Could only send to first-name acquaintance, to be forwarded etc. Median path length of successful deliveries was 6. Led to famous “6 degrees of separation” folklore.
cliques: Schwartz/Wood Studied (sub)graph. Proposed metrics for groups of people to share interests; cluster analysis. Qualitatively “good” results. Raised issues of ethical use of data and privacy.
Various other projects PHOAKS › Extracting heavily cited resources in newsgroups, etc. Call graphs › Discerning home, business and fax lines › Calling circles. Recommendation systems › Input: users’ product endorsements. › Output: product recommendations to each user.
Trawling bipartite cliques Take the (directed) Web link graph. Enumerate all (small) bipartite cliques. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins (1998).
Insights from hubs Link-based hypothesis: Dense bipartite subgraph Web community. Hub Authority FansCenters
Communities from cores What is a “dense bipartite subgraph”? Define (i,j)-core: complete bipartite subgraph with i nodes all of which point to each of j others. Enumerate (i,j)-cores for various small i,j. (2,3) core
Results for cores
Japanese Elementary Schools The American School in Japan The Link Page ‰ªès—§ˆä“c¬ŠwZƒz[ƒƒy[ƒW Kids' Space ˆÀés—§ˆÀé¼”¬ŠwZ ‹{鋳ˆç‘åŠw‘®¬ŠwZ KEIMEI GAKUEN Home Page ( Japanese ) Shiranuma Home Page fuzoku-es.fukui-u.ac.jp welcome to Miasa E&J school _“Þ쌧E‰¡ls— §’†ì¼¬ŠwZ‚̃y fukui haruyama-es HomePage Torisu primary school goo Yakumo Elementary,Hokkaido,Japan FUZOKU Home Page Kamishibun Elementary School... schools LINK Page-13 “ú–{‚ÌŠwZ a‰„¬ŠwZƒz[ƒƒy[ƒW 100 Schools Home Pages (English) K-12 from Japan 10/...rnet and Education ) ‚l‚f‚j¬ŠwZ‚U”N‚P‘g¨Œê ÒŠ—’¬—§ÒŠ—“Œ¬ŠwZ Koulutus ja oppilaitokset TOYODA HOMEPAGE Education Cay's Homepage(Japanese) –y“쬊wZ‚̃z[ƒƒy[ƒW UNIVERSITY ‰J—³¬ŠwZ DRAGON97-TOP ‰ª¬ŠwZ‚T”N‚P‘gƒz[ƒƒy[ƒW ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼ CentersFans
Yenta: Forman Analyzes documents “associated” with each user. Distils significant “interests” for each. Matches/clusters groups of users with overlapping interests. Decentralized; aims for privacy protection. Elements of peer-to-peer operation.
ReferralWeb: Kautz/Selman Establishes links between people, e.g., › co-authorship › colleagues in an organization Allows search through this social network, e.g., › find me someone within distance 2 who will referee a paper on xyz...
Who can I ask to review a paper on “expander graphs”? Source: H. Kautz & B. Selman
Paths to Experts Source: H. Kautz & B. Selman
Observations Source: H. Kautz & B. Selman Official company hierarchy only a sparse subset of the corporate social network Shortest (and often best) paths involve a combination of official and unofficial links › Conditions for trust and evaluation may greatly differ › Global social network is the union of many different kinds of sub-networks Search greatly aided when user can choose different views of the network types of edge strength of edge
Knowledge management The big challenge: Increase productivity in knowledge workers by getting them the expertise they need at all times > the right information (documents?) > the right experts. Enterprise: group of people engaged in a collective endeavour, typically with proprietary content.
Enterprise knowledge mgmt. Examples - Schwartz/Wood, ReferralWeb, Yenta, PHOAKS all have some applicability. › ReferralWeb was originally devised and deployed at AT&T Labs. Enterprise knowledge management introduces some novel challenges.
Challenges in the enterprise Information resides in heterogeneous › formats ( , pdf, word, …) › repositories (Lotus, Exchange, Documentum, databases …) › applications (HR, ERP, Siebel, …) Need to combine structured relations (from applications) with learning.
Challenges in the enterprise Data security: information units have many different access classes. › e.g., compound documents have pieces, each with its own access lists. › My search should hit the doc only if it hits the pieces I can see. Knowledge security is the deeper issue. › Learning: consider class models learned from security-limited information. › Inferences (what does a recommendation tell me about confidential data?)
General formulation How do we combine different sources of content and context? › terms in docs › links between docs › users’ access patterns › users’ profile information.
General formulation Every item of interest - each term, query, doc, person, treated as a node. Impose similarity metric between pairs of nodes. Need to be able to measure proximity from sets of nodes (a person+a doc they’re viewing+a query they’ve issued) to nodes of a target type (a person).
Issues in formulation If a user is close to two docs d1 and d2, are the docs d1 and d2 close to each other? How do you measure proximity from a set of nodes? How do you capture collaborative (as opposed to content and context-based filtering). How do you succinctly represent and manipulate similarities?
Graph-based models Each node an entity, associated with a set of features. Pairwise similarities based on feature matches. Issues: › Not easy to do proximity from sets of nodes. › Have to maintain (quadratic) pairwise information. › Consistency.
Tensor-based models Turn every entity into a vector. Axes are terms, profile features, … Combination of user, context++ becomes a tensor. Measure proximity to tensors of a certain type (e.g., user, doc recommendation).
Context with content Docs’ content captured in term axes. Other attributes (user profile, etc.) captured in other axes. A probe consists of 1 : a tensor t (say, a user vector plus a query) 2 : a type of vector to be retrieved (say, a user). Result = vectors of chosen type closest to t.
“Standard” mining tricks Dimensionality reduction - for collaborative filtering. Hierarchical clustering - for fast near- neighbor search. Incremental indexes - real-time updates.
Upshot Verity social networks project Screenshot. Screenshot Security issues remain thorny. What aspects of social behavior can we exploit in the algorithms? › Power laws
Power laws in mining
Recurring phenomena Many interesting distributions › term frequencies in a corpus › citations › in-links to web pages › document access frequencies … follow an inverse polynomial function.
Zipf versus power laws We call a distribution on the positive integers › a power law if it’s of the form p(i) ~ 1/i . › a Zipf law if p(i)~1/j where j is the rank of i. Typically >1.
Other Zipf/power laws Populations of US cities Degrees of internet nodes See
What leads to power laws? “Scale free” growth. “Highly optimized tolerance”. Behavioral models. › Model behavior of individuals in social network.
In-degrees on the Web graph Web in-degrees are distributed as p(i) ~ 1/i 2.1. › Consistently across many independent studies. Erdos-Renyi random graphs would not lead to such a power law. Need a new stochastic model for such graphs.
Random replication graphs Central thesis - random replication in an evolving graph. › Some page creators create content without regard to what exists on the web. › Many are inspired by pre-existing content. › i.e., some links are random, others are copied from pre-existing pages.
Model details Evolution: Nodes are created in a sequence of discrete time steps › e.g. at each time step, a new node is created with d=O(1) out-links Probabilistic copying › links go to random nodes with probability › copy d links from a random “existing” node with probability
Theoretical Results New model yields › convergence to power-law in-degrees; › › number of bipartite cliques that grows with time; › evolution without copying would not yield these phenomena. R. Kumar, P. Raghavan, S. Rajagopalan, R. Sivakumar, A. Tomkins, E. Upfal (2000).
Compound structures “First order” structures (terms frequencies, in-degrees, citations) exhibit power laws. What about “higher order” structures (pairs of terms, bipartite cliques, etc.)? Motivations: › Criteria for mining interesting higher order structures. › Turning algorithms for higher order mining.
Pair frequencies for terms Analyzed several corpora of news items. Studied frequencies of k-tuples of terms (k=1, 2, …) in › corpus › documents › sentences › windows of width w. Ongoing work with P. Tsaparas.
Sentence log-rank vs. log- frequency
Pair distributions Based on term frequencies, compute pair frequencies under independence assumption. Measure actual pair frequencies. › Outliers under mutual information measure. Higher order outliers: useful for building clusters/concept maps in corpora.
Pairs: independence vs. actual
Computational speedup Inspired by pruning algorithms in trawling. As higher order associations are built up, keep discarding obviated terms. › Docs keep getting shorter. › Fit in memory quickly. › Not an issue with relational tables?
A research agenda New ways of combining content, context and collaboration in the social network. Analyze and model structures in social networks. › Tune algorithms on models. › Build on “standard” mining paradigms: associations, clustering... Incorporate enterprise constraints: › Roles and profile information from apps › Security and Privacy!