Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification and clustering methods development and implementation for unstructured documents collections by Osipova Nataly St.Petesburg State University.

Similar presentations


Presentation on theme: "Classification and clustering methods development and implementation for unstructured documents collections by Osipova Nataly St.Petesburg State University."— Presentation transcript:

1 Classification and clustering methods development and implementation for unstructured documents collections by Osipova Nataly St.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology

2 Contents Introduction Methods description Information Retrieval System Experiments

3 Contextual Document Clustering was developed in joined project of Applied Mathematics and Control Processes Faculty, St. Petersburg State University and Northern Ireland Knowledge Engineering Laboratory (NIKEL), University of Ulster.

4 Definitions Document Terms dictionary Dictionary Cluster Word context Context or document conditional probability distribution Entropy

5 Document conditional probability distribution Document x y word1 word2 word3 … wordn tf(y) 5 10 6 16 p(y|x) 5/m 10/m 6/m 16/m y – words tf(y) – y frequency p(y|x) – y conditional probability in document x m – document x size (5/m, 10/m,6/m,…,16/m ) – document conditional probability distribution

6 Word context Word w Document x1Document x2Document xk y word1 word2 … wordn1 tf(y) 5 10 16 p(y|x1) 5/m1 10/m1 16/m1 y word1 word3 … wordn2 tf(y) 7 12 4 p(y|x1) 7/m1 12/m1 4/m1 y word1 word4 … wordnk tf(y) 20 9 3 p(y|x1) 20/mk 9/mk 3/mk … y word1 word2 word3 … wordnk tf(y) 5+7+20=32 10 12 3 p(y|w) 32/m 10/m 12/m 3/m … Context conditional probability distribution

7 Contents Introduction Methods description Information Retrieval System Experiments

8 Methods document clustering method dictionary build methods document classification method using training set Information retrieval methods: keyword search method cluster based search method similar documents search method

9 Contextual Documents Clustering Documents DictionaryNarrow context words Clusters Distances calculation

10 Entropy p1 pn p2 y context conditional probability distribution p1+p2+…+pn=1 p1 pn p2 Uncertainly measure, here it is used to characterize commonness (narrowness) of the word context.

11 Contextual Document Clustering maxH(y)=H ()

12 Entropy α 0 10.5 H() ) )

13 Word Context - Document Distance y context conditional probability distribution Document x conditional probability distribution Average conditional probability distribution

14 Word Context - Document Distance JS[p1,p2]=H( ) - 0.5H() )

15 Jensen-Shannon divergence

16 Dictionary construction Why: - big volumes: 60,000 documents, 50,000 words => 15,000 words in a context - narrow context words importance

17 Dictionary construction Delete words with 1. High or low frequency 2. High or low document frequency 3. 1. and 2.

18 Retrieval algorithms keyword search method cluster based search method search by example method

19 Keyword search method Document 1 word 1 word 2 word 3 … word n1 Document 2 word 10 word 25 word 30 … word n2 Document 3 word 15 word 2 word 32 … word n3 Document 4 word 11 word 21 word 3 … word n4 Request: word 2Result set: document 1 document3

20 Cluster based search method Documents Cluster 3 word 1 word 23 … word n3 Documents Cluster 2 word 12 word 26 … word n2 Cluster 1 word 1 word 2 … word n1 Cluster context words Request: word 1Result set: Cluster 1 Cluster 3

21 Similar documents search document 1Cluster name Cluster Minimal Spanning Tree document 2 document 3 document 4 document 5 document 6 document 7 Request: document 3Result set: document 6 document 7

22 Document classification: method 1 Clusters List of topics Training set Topics contexts Distances between topics and clusters contexts Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30 Test documents

23 Clusters Topics list Training set Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30 Document classification: method 2 Test documents All documents set

24 Contents Introduction Methods description Information Retrieval System Experiments

25 Information Retrieval System Architecture Features Use

26 Information Retrieval System architecture. data base server client

27 IRS architecture Data Base Data Base Server MS SQL Server 2000 Local Area Network Local Area Network “thick” client C#

28 IRS architecture DBMS MS SQL Server 2000: High-performance Scalable Secure Huge volumes of data treat T/SQL Stored procedures

29 IRS features In the IRS the following problems are solved: document clustering keyword search method cluster based search method similar documents search method document classification with the use of training set

30 DB structure The Data Base of the IRS consists of the following tables: documents all words dictionary dictionary table of relations between documents and words: document-word words contexts words with narrow contexts clusters intermediate tables for main tables build and for retrieve realization

31 DictionaryDocuments Table “document-word” Words contexts ClustersCentroid Cluster based search Keyword search Words with narrow contexts All words dictionary Similar documents search Algorithms implementation

32 document1 document2 document5document3 document4 Cluster 0,16285 0,98154 0,57231 0,23851 0,26967 0,211 0,87310,7231 0,1011 Similar documents search

33 Minimal Spanning Tree document 1Cluster name Cluster document 2 document 3 document 4 document 5

34 Similar documents search Clusters table Tree table Distances table Similar documents search

35 IRS use

36

37

38

39

40

41 Contents Introduction Methods description Information Retrieval System Experiments

42 Test goals were: algorithm accuracy test different classification methods comparison algorithm efficiency evaluation

43 Experiments 60,000 documents 100 topics Training set volume = 5% of the collection size

44 Experiments

45 Result analysis - Russian Information Retrieval Evaluation Seminar - Such measures as macro-average recall precision F-measure were calculated.

46 Recall

47 Precision

48 F-measure

49 Result analysis List of some topics test documents were classified in № Category 1 Family law 2 Inheritance law 3 Water industry 4 Catering 5 Inhabitants’ consumer services 6 Rent truck 7 International law of the space 8 Territory in international law 9 Off-economic relations fellows 10 Off-economic dealerships 11 Economy free trade zones. Customs unions.

50 Result analysis Recall results for every category. Results which were the best for the category are selected with bold type. All results are set in percents. С V 1234567891011 textan 33343560462627987525100 xxxx 100.23400.90302 xxxx 004.32.3050.98300.8 xxxx 55867519595180041820 xxxx 213922215601.4050 xxxx 404316112523101.41.250 xxxx 2342.51.11870.901.2100 xxxx 2.70001.5000000 xxxx 2.20001.5000000 xxxx 372112221827510000

51 Thank you for your attention!


Download ppt "Classification and clustering methods development and implementation for unstructured documents collections by Osipova Nataly St.Petesburg State University."

Similar presentations


Ads by Google