Sept. 7, 2001 ECDL2001 1 An On-line Document Clustering Method Based on Forgetting Factors Yoshiharu Ishikawa, Yibing Chen Hiroyuki Kitagawa University.

Sept. 7, 2001 ECDL2001 1 An On-line Document Clustering Method Based on Forgetting Factors Yoshiharu Ishikawa, Yibing Chen Hiroyuki Kitagawa University of Tsukuba, Japan

2 Outline Background and Objectives F 2 ICM Incremental Document Clustering Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work

3 Background The Internet enabled on-line document delivery services newsfeed services over the network periodically issued on-line journals Important technologies (and applications) for on-line documents information filtering document summarization, information extraction topic detection and tracking (TDT) Clustering works as a core technique for these applications

4 Our Objectives (1) Development of an on-line clustering method which considers the novelty of each document Presents a snapshot of clusters in an up-to-date manner Example: articles from sports news feed Soccer World Cup time Formula 1 & M. Schumacher U.S. Open Tennis Other articles

5 Our Objectives (2) Development of a novelty-based clustering method for on-line documents Features: It weights high importance on newer documents than older ones and forgets obsolete ones introduction of a new document similarity measure that considers novelty and obsolescence of documents Incremental clustering processing low processing cost to generate a new clustering result Automatic maintenance of target documents obsolete documents are automatically deleted from the clustering target

6 A Incremental Clustering Process (1) when t = 0 (initial state) Clustering Module AA 1. arrival of new documents AAA t = 0 3. calculate and store statistics 2. store new documents in the repository AA Cluster 1 AA Cluster k ： 4. cluster documents and present the result

7 A Incremental Clustering Process (2) when t = 1 AAAAA t = 0 A A Cluster 1 A A Cluster k ： AAA t = 1 A A Cluster 1 A A Cluster k ： 1. arrival of new documents Clustering Module 2. store new documents in the repository 3. update statistics 4. cluster documents and present the result

8 A Incremental Clustering Process (3) when t =  +  AAAAA t =  A A Cluster 1 A A Cluster k ： A A Cluster 1 A A Cluster k ： AAA t =  AAA t =  + ... AAA t =  1. arrival of new documents Clustering Module 2. store new documents in the repository 4. delete old documents 3. update statistics 5. cluster documents and present the result

9 Outline Background and Objectives F 2 ICM Incremental Document Clustering Method C 2 ICM Clustering Method F 2 ICM Clustering Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and parameter Setting Experimental Results Conclusions and Future Work

10 C 2 ICM Clustering Method Cover-Coefficient-based Incremental Clustering Methodology Proposed by F. Can (ACM TOIS, 1993) [3] Incremental Clustering Method with Low Update Cost Seed-based Clustering Method Based on the concept of seed powers Seed powers are defined probabilistically Documents with highest seed powers are selected as cluster seeds

11 Decoupling/Coupling Coefficients Two important notions in C 2 ICM method used to calculate seed powers Decoupling coefficient of document d i : the probability that the document d i is obtained when a document d i itself is given an index to measure the independence of d i Coupling coefficient of document d i : an index to measure the dependence of d i

12 Seed Power Seed power sp i for document d i measures the appropriateness (moderate dependence) of d i as a cluster seed freq(d i, t j ): the occurrence frequency of term t j within document d i : decoupling coefficient for term t j : coupling coefficient for term t j

13 C 2 ICM Clustering Algorithm (1) Initial phase Red: F1 & Schumacher Green: U.S. Open Tennis  2.Other documents are assigned to the cluster with the most similar seed 1.Select new seeds based on the seed powers

14 C 2 ICM Clustering Algorithm (2) Incremental update phase  2.Other documents are assigned to the cluster with the most similar seed 1.Select new seeds based on the seed powers Red: F1 & Schumacher Green: U.S. Open Tennis Orange: Soccer World Cup

15 Outline Background and Objectives F 2 ICM Incremental Document Clustering Method C 2 ICM Clustering Method F 2 ICM Clustering Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and parameter Setting Experimental Results Conclusions and Future Work

16 F 2 ICM Clustering Method Extension of C 2 ICM method Main differences Introduction of a new document similarity measure based on the notion of the forgetting factor: it weights high importance on newer documents to generate clusters Incremental maintenance of statistics Automatic deletion of obsolete old documents

17 Outline Background and Objectives F 2 ICM Incremental Document Clustering Method Document Similarity Based on Forgetting Factor Document forgetting model Derivation of document similarity measure Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work

18 Document Similarity Based on Forgetting Factor New Document Similarity Measure Based on Document Forgetting Model Assumption: each delivered document gradually loses its value (weight) as time passes Derivation of document similarity measure based on the assumption put high weights on new documents and low weights on old ones  old documents have low effects on clustering Using the derived document similarity measure, we can achieve a novelty-based clustering

19 Document Forgetting Model (1) T i : acquisition time of document d i Information value (weight) of d i is defined as Document weight exponentially decreases as time passes (0 < < 1) determines the forgetting speed 1  dw i TiTi t acquisition time of document d i current time

20 Document Forgetting Model (2) Why we use the exponential forgetting model? It inherits the ideas from the behavioral law of human memory The Power Law of Forgetting [1]: human memory exponentially decreases as time passes Relationship with citation analysis: Obsolescence (aging) of citation can be measured by measuring citation rates Some simple obsolescence model takes exponential forms Efficiency: based on the model, we can obtain an efficient statistics maintenance procedure Simplicity: we can control the forgetting speed using the parameter

22 Our Approach for Document Similarity Derivation Probabilistic derivation based on the document forgetting model Let Pr(d i, d j ) be the probability to select the document pair (d i, d j ) from the document repository We regard the coocurrence probability Pr(d i, d j ) as their similarity sim(d i, d j ) AAAAAAAA doc d i doc d j Pr(d i, d j )

23 Derivation of Similarity Formula (1) tdw: total weights of all the m documents simple summation of all document weights Pr(d i ): subjective probability to select document d i from the repository Since old documents have small document weights, their selection probabilities are small where

24 Derivation of Similarity Formula (2) Pr(t k |d i ): selection probability of term t k from document d i freq(d i, t k ): the number of occurrence of t k in d i the probability corresponds to term frequency

25 Derivation of Similarity Formula (3) Pr(t k ): occurrence probability of term t k this probability corresponds to document frequency of term t k the reciprocal of df(t k ) represents IDF (inverse document frequency)

26 Derivation of Similarity Formula (4) Using Bayes’ theorem, Then we get

27 Derivation of Similarity Formula (5) Therefore, the coocurrence probability of d i, d j is: The more a document d i becomes old, the smaller its similarity scores with other documents are because old documents have low Pr(d i ) values old documents have low similarity scores inner prodocut of document vectors based on TF-IDF weighting

29 A Updating Statistics and Probabilities when t =  +  AAAAA t =  A A Cluster 1 A A Cluster k ： A A Cluster 1 A A Cluster k ： AAA t =  AAA t =  + ... AAA t =  1. arrival of new documents Clustering Module 4. delete old documents 2. store new documents in the repository 5. present new clustering result 3. update statistics

30 Approach to Update Processing (1) In every incremental clustering step, we have to calculate document similarities To compute similarities, we need to calculate document statistics and probabilities beforehand It is inefficient to compute statistics every time from scratch Store the calculated statistics and probabilities and utilize them for later computation Incremental Update Processing

31 Approach to Update Processing (2) Formulation d 1,..., d m : document set consists of m documents t 1,..., t n : index term sets that appear in d 1,..., d m t =  : the latest update time of the document set Assumption when t =  + , new documents d m + 1,..., d m + m’ are appended to the document set new documents d m + 1,..., d m + m’ introduce additional terms t n + 1,..., t n + n’ m >>  m and n >>  n are satisfied

32 Update Processing Method (1) Update of document weight dw i Since  unit time has passed from the previous update time t = , the weight of each document decreases according to  For each new document, assign initial value 1

33 Update Processing Method (2) Example of Incremental Update Processing: Updating from tdw|  to tdw|  +  Naive Approach: compute tdw|  +  from scratch time consuming!

34 Update Processing Method (3) Smart Approach: compute tdw|  +  incrementally exponential weighting enables efficient incremental computation

35 Updating Processing Method (4) Occurrence probability of each document Pr(d i ) can be easily recalculated We need to calculate term frequencies tf(d i, t k ) only for new documents d m + 1,..., d m + m’

36 Updating Processing Method (5) Update formulas for document frequency of each term df(t k ) we expand the formula of df(t k ) as follows, then store each permanently can be incrementally updated using the formula

37 Update Processing Method (6) Calculation of new decoupling coefficient  i is easy: Update formulas for decoupling coefficient for terms incremental update is also possible details are shown in the paper

38 Summary of Update Processing Following statistics are maintained persistently (m: no. of documents, n: no. of terms) dw i : weight of document d i (1  i  m) tdw: total weight of documents freq(d i, t k ): term occurrence frequency (1  i  m, 1  k  n) doclen i : document length (1  i  m) : statistics fo compute df(t k ) (1  k  n) : statistics to compute (1  i  m) Incremental statistics update cost O(m + m’n)  O(m + n) with storage cost O(m + n): linear cost cf. naive method (not incremental) costs O(mn) cost

40 A Expiration of Old Documents (1) when t =  +  AAAAA t =  A A Cluster 1 A A Cluster k ： A A Cluster 1 A A Cluster k ： AAA t =  AAA t =  + ... AAA t =  1. arrival of new documents Clustering Module 4. delete old documents 2. store new documents in the repository 5. present new clustering result 3. update statistics

41 Expiration of Old Documents (2) Two reasons to delete old documents: reduction of storage area old documents have only tiny effects on the resulting clustering structure Our approach: If dw i <  (  is a small parameter constant) is satisfied, delete document d i When we delete d i, related statistics values are deleted e.g., freq(d i, t k ) details are in the proceedings and [6]

42 Parameter Setting Methods F 2 ICM uses two parameters in its algorithms: forgetting factor (0 < < 1): specifies the forgetting speed expiration parameter  (0 <  < 1): threshold value for document deletion We use the following metaphors:  : half-life span of the value of a document   = ½ is satisfied, namely  ： life span of a document  is determined by  = 

44 Dataset and Parameter Settings Dataset: Mainich Daily Newspaper articles Each article consists of the following information: issue date subject area (e.g., economy, sports) keyword list (50  150 words in Japanese) The articles we used in the experiment: issue date: January 1994 to February 1994 subject area: international affairs Parameter Settings n c (no. of clusters) = 10  (half-life span) = 7: the value of an article reduces to ½ in one week  (life span) = 30: every document will be deleted after 30 days

45 Computational Cost for Clustering Sequences Plot of CPU time and response time for each clustering performed everyday Costs linearly increase until 30 th day, then becomes almost constant

46 Overview of Clustering Result (1) Summarization of 10 clusters after 30 days (at January 31, 1994) No.Subject 1East Europe, NATO, Russia, Ukraine 2Clinton (White Water/politics), military issue (Korea/Myanmar/Mexico/Indonesia) 3China (import and export/U.S.) 4U.S. politics (economic sanctions and Vietnam/elections) 5Clinton (Syria/South East issue/visiting Europe), Europe (France/Italy/Switzerland) 6South Africa (ANC/human rights), East Europe (Bosnia-Herzegovina, Croatia), Russia (Zhirinovsky/ruble/Ukraine) 7Russia (economy/Moscow/U.S.), North Korea (IAEA/nuclear) 8China (Patriot missiles/South Korea/Russia/Taiwan/economics) 9Mexico (indigenous peoples/riot), Israel 10South East Asia (Indonesia/Cambodia/Thailand), China (Taiwan/France), South Korea (politics)

47 Overview of Clustering Result (2) Summarization of 10 clusters after 57 days (at March 1, 1994) No.Subject 1Bosnia-Herzegovina (NATO/PKO/UN/Serbia), China (diplomacy) 2U. S. issue (Japan/economy/New Zealand/Bosnia/Washington) 3Myanmar, Russia, Mexico 4Bosnia-Herzegovina (Sarajevo/Serbia), U.S. (North Korea/economy/military) 5North Korea (IAEA/U.S./nuclear) 6East Asia (Hebron/random shooting/PLO), Myanmar, Bosnia-Herzegovina 7U.S. (society/crime/North Korea/IAEA) 8U.N. (PKO/Bosnia-Herzegovina/EU), China 9Bosnia-Herzegovina (U.N./PKO/Sarajevo), Russia (Moscow/Serbia) 10Sarajevo (Bosnia-Herzegovina), China (Taiwan/Tibet)

48 Summary of the Experiment Brief observations F 2 ICM groups similar articles into a cluster as far as an appropriate seed is selected But a cluster obtained by the experiment usually contains multiple topics, and different clusters contain similar topics: clusters are not well separated Reasons of the observed phenomena: Selected seeds are not well separated in topics: more sophisticated seed selection method is required The number of keywords for an articles is rather small (50  150 words)

50 Conclusions and Future Work Conclusions Development of an on-line clustering method which considers the novelty of documents Introduction of document forgetting model F 2 ICM: Forgetting Factor-based Incremental Clustering Method Incremental statistics update method (linear update cost) Automatic document expiration and parameter setting methods Preliminary report of the experiments Current and Future Work Revision of the clustering algorithms based on Scatter/Gather approach [4] More detailed experiments and their evaluation Development of automatic parameter tuning methods

Sept. 7, 2001 ECDL2001 1 An On-line Document Clustering Method Based on Forgetting Factors Yoshiharu Ishikawa, Yibing Chen Hiroyuki Kitagawa University.

Similar presentations

Presentation on theme: "Sept. 7, 2001 ECDL2001 1 An On-line Document Clustering Method Based on Forgetting Factors Yoshiharu Ishikawa, Yibing Chen Hiroyuki Kitagawa University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sept. 7, 2001 ECDL2001 1 An On-line Document Clustering Method Based on Forgetting Factors Yoshiharu Ishikawa, Yibing Chen Hiroyuki Kitagawa University.

Similar presentations

Presentation on theme: "Sept. 7, 2001 ECDL2001 1 An On-line Document Clustering Method Based on Forgetting Factors Yoshiharu Ishikawa, Yibing Chen Hiroyuki Kitagawa University."— Presentation transcript:

Similar presentations

About project

Feedback