Download presentation
Presentation is loading. Please wait.
1
INEX 2009 XML Mining Track James Reed Jonathan McElroy Brian Clevenger
2
Introduction INEX is An initiative looking into use of XML retrieval The clustering task uses Information Retrieval, Data Mining, Machine Learning and XML fields Goal: To measure how well clustering methods work for retrieving collections from large sets of documents. Also to measure performance specifically for XML IR
3
Problem Task: to test the Jardine Hypothesis which states: “documents that cluster together have a similar relevance to a given query.” If (true) {a small fraction of clusters need to be searched, increasing the throughput of an IR system;}
4
Data Wikipedia is the source 60 Gigabytes with about 2.7 million documents in XML format Provide Complete and Subsets of the meta-data
5
Data Files Tags and trees: :... : 14052 0 0 15 1 2 3 -1 4 -1 5 -1 -1 6 7 -1 8 -1 -1 Links:... Entities: :... : Bag-of-Words (BOW...Wow!): –BOW File: :... : –Term Index File: 1472,bracelet 547,depend
6
Solution: A Two Pronged Approach First Prong: –Analyze Links to discover maximum flow communities –Using Ford-Fulkerson Algorithm Second Prong: –Use information from BOW and Entities to develop similarity measures between documents within clusters –Attempt to refine and develop more better clusters
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.