Presentation is loading. Please wait.

Presentation is loading. Please wait.

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.

Similar presentations


Presentation on theme: "SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL."— Presentation transcript:

1 SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL PATANKAR MADHURI WUDALI

2 DOCUMENT CLUSTERING Process of grouping documents with similar contents into a common cluster

3 ADVANTAGES OF DOCUMENT CLUSTERING If a collection is well clustered, we can search only the cluster that will contain relevant documents Clustering also improves browsing through the document collection

4 DOCUMENT COLLECTION META SEARCH ENGINE CLUSTERING TRADITIONALTEXT-BASED CLUSTERING ALGORITHM BUCKSHOT FRACTIONATION STC SCATTER /GATHER GROUPER WORD BASED SIMILARITY PHRASE BASED SIMILARITY A TOOL FOR SEARCHING A TOOL FOR BROWSING INTERFACESINTERFACES USER

5 SCATTER /GATHER INTERFACE

6 SCATTER /GATHER SESSION User is presented with short summaries of a small number of document groups. User selects one or more groups for further study Continue this process until the individual document level

7 Fractionation Buckshot Cluster Digest

8 HOW IS SCATTER/GATHER DONE? Static offline partitioning phase Fractionation Algorithm Online Reclustering phase Buckshot Algorithm Step 1:Group average agglomerative clustering Step 2: K-Means

9 Clustering Partitional Hybrid Hierarchical Single link Complete Link Group Average Link K-Means Buckshot Fractionation AgglomerativeDivisive

10 HIERARCHICAL AGGLOMERATIVE CLUSTERING Create NxN doc-doc similarity matrix Each document starts as a cluster of size one. Do Until there is only one cluster. – combine the two clusters with the greatest similarity – update the doc-doc matrix

11 Example A B C D E A _ 2 7 6 4 B 2 _ 9 11 14 C 7 9 _ 4 8 D 6 11 4 _ 2 E 4 14 8 2 _ ABCDE A BEBE CD SC(A,BE) = 4 if we are using single link (take max) SC(A,BE) = 2 if we are using complete linkage (take min) SC(A,BE) = 3 if we are using group average (take average) Note: C - BE is now the highest link

12 Example A BE C D A _ 3 7 6 BE 3 _ 8.5 6.5 C 7 8.5 _ 4 D 6 6.5 4 _ COMBINING SC(C,B)=9 SC(C,E)=8 SC(C,BE)=8.5 BEACD BEC

13 Example A BEC D A _ 5 6 BEC 5 _ 5.75 D 6 5.75 _ COMBINING BEC A D A,D

14 SCATTER/GATHER SESSION STAGE 1 FRACTIONATION Corpus C is broken into N/m buckets of fixed size m>k Apply Group average agglomerative clustering on each bucket Generate document groups, given as input to next iteration Repeat till ‘k’ centers remain

15 SCATTER/GATHER SESSION STAGE 2 BUCKSHOT STEP1 : HAC First, randomly takes sample of size sqrt(kn) Apply the Group average agglomerative clustering till we obtain ‘k’ clusters Return the obtained clusters

16 SCATTER /GATHER STAGE 2 BUCKSHOT STEP2 : K -Means Arbitrary select K documents as seeds, they are the initial centroids of each cluster. Assign all other documents to the closest centroid Compute the centroid of each cluster again. Get new centroid of each cluster Repeat step2,3, until the centroid of each cluster doesn’t change.

17 ACHGFEDB FEDCAHGB Bucket 1Bucket 2 A BG HCF DEDE AHAH DEDE CF AH BGCFDE :::::: Group Average Agglomerative Clustering Fractionation Contd…

18 ADGE GA DEDE Documents in Sample Group Average Agglomerative Clustering AGAG DE Buckshot Assign remaining documents to these clusters using K-means

19 GENESIS OF GROUPER

20 GROUPER A dynamic,web-interface to Husky Search meta- search engine Clusters the top retrieved results of Husky Meta search engine Dynamically group search results into clusters Uses STC Algorithm for Clustering

21 Grouper’s query interface.

22 Grouper Interface

23 STC (Suffix Tree Clustering) A Fast, incremental algorithm Operates on web document- snippets. Relies on Suffix Tree to identify common phrases Uses the common information to create clusters 23

24 WHAT IS A SUFFIX TREE? 24 A suffix tree is a rooted, directed tree Each internal node has at least 2 children Each edge is labeled with a non-empty sub-string of S. The label of a node is the concatenation of the edge-labels on the path from the root to that node. No two edges out of the same node can have edge-labels that begin with the same word.

25 Step-1: Document “Cleaning” Step-2: Identifying Base Clusters Step-3: Combining Base Clusters Step-4: Score clusters 25 STEPS OF STC

26 DOCUMENT CLEANING Stemming Striping of HTML, Punctuation and numbers 2 Cats ate cheese. Cat ate cheese

27 Identifying Base Clusters Create an inverted index of strings from the web document collection with using a suffix tree Each node of the suffix tree represents a group of documents and a string that is common to all of them The label of the node represents the common string Each node represents a base cluster.

28 too cheese too ate mouse too cheese too cat ate mouse too cheese too mouse ate cheese too 2,3 1,2 1,2,3 1,3 2,3 1,2 2.mouse ate cheese too cat 1.cat ate cheese mouse 3.cat ate mouse too cheese catatecheese atecheesetoo atemousetoo cheese too ate cheese too

29 29 BASE CLUSTERS IDENTIFIED!! NodePhraseDocuments acat ate 1,3 bate1,2,3 ccheese1,2 dmouse2,3 etoo2,3 fate cheese1,2 Table 1: Six nodes and their corresponding base clusters

30 SCORING BASE CLUSTERS Scoring clusters |P| is the number of words in Phrase P |B| is the number of documents in base cluster B S(B) = |B |. f (|P|)

31 Combining Base Clusters | B m Λ B n | > 0.5 |B m | |B n | Documents which are in both Clusters Documents in cluster ‘m’ Documents in Cluster ‘n’ Binary similarity measure: SIMILARITY 1 IF CONDITION SATISFIED OTHERWISE O

32 mouse cat ate cheese ate too ate cheese 1,2 1,3 2,3 1,2,3 1,2 COMBINING THE BASE CLUSTERS Base cluster graph

33 STC is Incremental As each document arrives from the web, we “clean” it Add it to the suffix tree. Each node that is updated/created as a result of this is tagged Update the relevant base clusters and recalculate the similarity of these base clusters to the rest of k highest scoring base clusters Check any changes to the final clusters Score and sort the final clusters, choose top 10

34 STC allows cluster overlap… Why overlap is reasonable? a document often has 1+ topics STC allows a document to appear in 1+ clusters, since documents may share 1+ phrases with other documents

35 REFERENCES http://www.math.unipd.it/~aiolli/corsi/0708/IR/Lez18.pdf http://www.ir.iit.edu/~dagr/cs529/files/handouts/08Cluste ring.pdf http://www.ir.iit.edu/~dagr/cs529/files/handouts/08Cluste ring.pdf http://www.cs.washington.edu/research/projects/WebWar e1/www/metacrawler/ http://www.cs.washington.edu/research/projects/WebWar e1/www/metacrawler/ http://sils.unc.edu/research/publications/reports/TR- 2007-06.pdf http://sils.unc.edu/research/publications/reports/TR- 2007-06.pdf http://www.ir.iit.edu/~dagr/cs529/files/handouts/08Cluste ring.pdf http://www.ir.iit.edu/~dagr/cs529/files/handouts/08Cluste ring.pdf


Download ppt "SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL."

Similar presentations


Ads by Google