Download presentation
Presentation is loading. Please wait.
Published byJasmine Pearson Modified over 9 years ago
1
JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy
2
Introduction Deal with large size of internet through using better categorization techniques Goal: Optimize search time by grouping pages using clusters Wikipedia is the data source
3
Problem Take the Wikipedia data and create a clustering algorithm that leads to a the data being clustered. This creates a reduction in search space for related information.
4
Solution If documents contain several similar links then similar data. Focused on the link data set: Link data: 39484 2039 4952 1029 39 1920 10233 30197
5
Overall solution Determine sub-communities in the graph using Max-Flow/Min-Cut community Discovery Heuristics used to find relevant seeds
6
Max Flow – Min Cut Edge Capacity – similar to edge weight. Represents the “amount” of information that can be pushed along. Flow – The sum of minimum capacity of all paths from one node to another.
7
Max Flow – Min Cut (cont.) The flow between two nodes in the same cluster should be larger than flow between two nodes in separate clusters.
8
Max Flow – Min Cut (cont.)
9
Max-Flow Community Discovery
10
Implementation
11
Implementation (Parsing) Links parsed into a Graph. Graph: HashMap Document Id to HashMap of Link Ids to Capacity. Links structure was created Links[0] = 3244,2645,791 Links[1] = 10293,432,2,1230... Links[max] = 1012
12
Implementation (Initialization of Community Seeds) Using the Links structure, a percentage of nodes with highest links are chosen as seeds
13
Implementation (Finding Communities) Idea, why it didn’t work? robots
14
Implementation (Visualization) Walrus is an interactive 3D visualization tool that works on large directed graphs. Input and output Parsing. Grouped clusters by colors.
15
Results The INEX links data was composed of 54,000 nodes and 15 million links Average running time on a DELL Duo Core 2.0 GHz Pentium Laptop to retrieve one cluster was 5.9 hours Cluster size is between 2-2.5 K
16
Results Visual Images of clusters
17
Conclusion It worked... kinda. Looks great! See pretty pictures.
18
References [1] Inex 2009 mining track. http://www.inex.otago.ac.nz/tracks/wiki- mine/wiki-mine.asp, October 2009. [2] The standard maximum flow problem. http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=maxFl ow, November 2009. http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=maxFl ow [3] Walrus - graph visualization tool. http://www.caida.org/tools/visualization/walrus, December 2009. [4] Mark C. Chu-Carroll. Maximum flow and minimum cut. http://scienceblogs.com/goodmath/2007/08/maximum_flow_ and_minimum_cut_1.php, December 2009. [5] Fordfulkerson algorithm. http://en.wikipedia.org/wiki/FordFulkersos_algorithm, October 2009. [6] Max-flow Min-cut theorem. http://en.wikipedia.org/wiki/Max-flow_ min-cut_theorem, November 2009.
19
Questions? O really?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.