JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy
Introduction Deal with large size of internet through using better categorization techniques Goal: Optimize search time by grouping pages using clusters Wikipedia is the data source
Problem Take the Wikipedia data and create a clustering algorithm that leads to a the data being clustered. This creates a reduction in search space for related information.
Solution If documents contain several similar links then similar data. Focused on the link data set: Link data:
Overall solution Determine sub-communities in the graph using Max-Flow/Min-Cut community Discovery Heuristics used to find relevant seeds
Max Flow – Min Cut Edge Capacity – similar to edge weight. Represents the “amount” of information that can be pushed along. Flow – The sum of minimum capacity of all paths from one node to another.
Max Flow – Min Cut (cont.) The flow between two nodes in the same cluster should be larger than flow between two nodes in separate clusters.
Max Flow – Min Cut (cont.)
Max-Flow Community Discovery
Implementation
Implementation (Parsing) Links parsed into a Graph. Graph: HashMap Document Id to HashMap of Link Ids to Capacity. Links structure was created Links[0] = 3244,2645,791 Links[1] = 10293,432,2, Links[max] = 1012
Implementation (Initialization of Community Seeds) Using the Links structure, a percentage of nodes with highest links are chosen as seeds
Implementation (Finding Communities) Idea, why it didn’t work? robots
Implementation (Visualization) Walrus is an interactive 3D visualization tool that works on large directed graphs. Input and output Parsing. Grouped clusters by colors.
Results The INEX links data was composed of 54,000 nodes and 15 million links Average running time on a DELL Duo Core 2.0 GHz Pentium Laptop to retrieve one cluster was 5.9 hours Cluster size is between K
Results Visual Images of clusters
Conclusion It worked... kinda. Looks great! See pretty pictures.
References [1] Inex 2009 mining track. mine/wiki-mine.asp, October [2] The standard maximum flow problem. ow, November ow [3] Walrus - graph visualization tool. December [4] Mark C. Chu-Carroll. Maximum flow and minimum cut. and_minimum_cut_1.php, December [5] Fordfulkerson algorithm. October [6] Max-flow Min-cut theorem. min-cut_theorem, November 2009.
Questions? O really?