Concept Switching Azadeh Shakery
Concept Switching: Problem Definition C1C2Ck …
Past Work: A Programming Language for Mining Fuzzy ER Graphs ForagerRover bee fly g1 g2 g3 Behavior Term gene1gene2 …
Past Work: A Programming Language for Mining Fuzzy ER Graphs Operators: –Neighbor Finding: NBSet WNBSet –Path Finding: Shortestpath Wpath –Set Operators: Union Intersect Cardinality topk Added Features –Type Definition –Function Definition –Seq. Operators Project Reverse Seq2Set Aggregate
Past Work: High Level Scripts for Entity Comparison Based on intersection and union of neighbors: NB(e1) NB(e2) / NB(e1) NB(e2) –Tehran, Iran: 27/52 –Baghdad, Iran: 11/52 –Washington, Iran: 0 Based on the shortest path between the two entities –gpcr__g_protein__plc__diacylglycerol –bush__leader__khomeini Based on the length of the shortest path to a base entity Connection to a center node NB(e) NB(c) / NB(e) NB(c)
Current Work: Topic/Concept Map Alternative way of accessing information Create an index of information which resides outside that information The topic map describes the information in the documents and databases
Multi-Resolution Topic Maps WORDS Word Net High Level Concepts Low resolution High resolution
Multi-Resolution Topic Map Static –Discrete Navigation –Challenges: Define resolution Community finding algorithm Summarize Communities Define distance between communities Between which communities do we allow the navigation? Dynamic –Continuous Navigation –Challenge: Define Resolution Online community finding algorithm Summarize communities
Challenges Resolution definition – : Resolution –{C 1, C 2, …, C k }: Communities at this level –One way is to define as the link strength threshold – 0 : all links, : No links Community finding algorithm Community distance: –C1, C2 , Similarity(C1, C2) =? |C1 C2| / |C1 C2| Works if communities are allowed to have intersection Community summarization Low resolution Low threshold High resolution High threshold
Community Summarization Use the documents to do the summarization Summarize based on the community nodes –Define center nodes to do the summarization: Based on the average MI distance to the other nodes in the community –Slow on very large communities Based on the degree of the nodes –Counts all neighbors as equally important Based on a PageRank like algorithm: –Each node has a centrality value –In each step, each node distributes its centrality to its neighbors proportional to the strength of the link –Do this iteratively until the centrality values converge
Community Finding Algorithms: Newman’s Algorithm Newman’s algorithm for detecting community structure in networks: –Modularity: A measure of the quality of a particular division of a network –Modularity measure measures the fraction of the edges in the network that connect vertices of the same type (within community) minus the expected value of the same quantity in the same network with random connections –Consider different divisions of the graph to communities and find the community which maximizes the modularity measure –The number of distinct community divisions grows exponentially in the number of nodes –They use a greedy algorithm to solve the problem –The algorithm is of O((m + n)n)
Newman’s Algorithm Communities are of very different sizes –A few very large communities and a lot of small communities No overlapping communities –Definition of neighbor communities is hard Experiments on bee data: –1200 records about apis mellifera (honey bee) –Thr = Results
Community Finding Algorithms: CPM Clique Percolation Method (CPM) –Locates the kclique communities of unweighted, undirected networks. –Observation: A typical member in a community is linked to many other members, but not necessarily to all other nodes. –A community can be interpreted as a union of smaller complete subgraphs that share nodes. –k-clique community is defined as the union of all k- cliques that can be reached from each other through a series of adjacent k-cliques. –Two k-cliques are said to be adjacent if they share k-1 nodes.
Properties of CPM Not too restrictive (compared to cliques) Based on the density of links Local Does not yield cut-nodes or cut-links (whose removal would disjoin the community) Allows overlaps
Results thr = 0.05 –228 nodes –1197 edges –CPM: 0 min secNewman: 0 min 0.11 sec –16 communities of more that one nodes thr = 0.04 –312 nodes –1483 edges –CPM: 0 min secNewman: 0 min 0.21 sec –20 communities of more than one nodes thr = 0.03 –507 nodes –2924 edges –CPM: 0 min secNewman: 0 min 0.49 sec –29 communities of more than one node thr = 0.01 –4349 nodes –28595 edges –CPM: 5 min secNewman: 1 min sec –103 communities of more than one node
Sample of Resolution Change neural nervous coordination brain proboscis extension conditioning learning system mushroom Homeostasis olfactory juvenile hormone endocrine bodies antennal conditioned chemical reflex proboscis extension conditioning learning conditioned olfactory reflex neural Nervous coordination brain system mushroom bodies neurons homeostasis chemical coordination juvenile hormone jh endocrine
Concept Switching Construct a topic map for each collection separately Construct one universal topic map
Discussion Better ideas for community summarization? Dynamic via static topic maps? Alternative ways of defining resolution
Thank you Questions?