Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering Author : Yaohong JIN Source : International Conference on Computer Science.

Similar presentations


Presentation on theme: "A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering Author : Yaohong JIN Source : International Conference on Computer Science."— Presentation transcript:

1 A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering Author : Yaohong JIN Source : International Conference on Computer Science and Electronics Engineering (ICCSEE), Date : 2013/10/7 Presenter : 曹昌林 1

2 Outline Introduction CLUSTERING ALGORITHM TOPIC DETECTION AND TRACKING ALGORITHM Conclusion 2

3 TDT(Topic Detection and Tracking, 話題檢測 與跟蹤 ) 一種訊息處理的技術 可用於識別主要議題,並追蹤延伸話題 運用在 news mining ,會隨著時間產生位移 3

4 suffix tree( 後綴樹 ) 一棵包含 m 個字的字符串 S 的後綴樹 T 僅有 m 個葉子節點的樹,且每條邊都被標上非空的 S 的子串,並且從一個節點發出的兩條邊不能包 含相同詞開始的字串。 ex:bananas 4

5 suffix tree clustering( 後綴樹組 )(1) 將 n 個字串集合到一棵後綴樹,叫後綴樹組。 每個葉子節點被標示為 ( j, i ) ,從根到該葉子 節點的整個路徑的邊串起來的內容就是 j(0 < j ≦ n) 從位置 i 起的後綴子串 5

6 suffix tree clustering( 後綴樹組 )(2) ex: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse too" } 6

7 Outline Introduction CLUSTERING ALGORITHM TOPIC DETECTION AND TRACKING ALGORITHM Conclusion 7

8 CLUSTERING ALGORITHM 8

9 Feature Selection(1) 為了 clustering 使用 NLP algorithm 來選擇較有意 義的字 使用 stop word table 來過濾高頻率單字 (such as "the", "I", "a“) 使用 TF-IDF 來計算單字的權重,並且過濾常使 用的單字 9

10 Feature Selection(2) 初始化 STC ,來追蹤任何長度的單字 對所有單字標註詞性和意思 選擇 noun 、 verb 和意思作為文件的 key word 10

11 Suffix Tree Clustering 將 feature selection 過濾後的結果,輸入到 STC 保留在文本的標點符號和他們的位置關係 優點在於一個文檔可以出現在多個 clusters ,而 且任何句子輸入到 tree 僅需 linear time 11

12 Scoring Clusters(1) 每日的新聞標題被分散到一連串的 clusters 一個 cluster 的重要性,關於有多少文章包含此 topic 跟有多少媒體將此 topic 放入文章中,而兩 者皆高的,就會具有最高的關注度 經過下一頁式子計算,選出最高的 50 個 cluster 來當作 TDT 的 source 12

13 Scoring Clusters(2) is the importance of the topic is the number of articles in the topic is the total number of articles in the day is the number of the medias in which the topic is involved is the total number of medias in corpus. 13

14 Outline Introduction CLUSTERING ALGORITHM TOPIC DETECTION AND TRACKING ALGORITHM Conclusion 14

15 TOPIC DETECTION AND TRACKING ALGORITHM(1) Suppose A={a1,a2,……an} is the set of topics in one period time. Initially A is an empty set. B ={ }is the set of clusters in one day, where i is the ith day, and m is 50 Step 1, to initialize the topic set A; Step 2, if set A is empty set, add all the elements of B into A; 15

16 TOPIC DETECTION AND TRACKING ALGORITHM(2) Step 3, to compute the similarity of each pair of (ak, bij); Step 4, If a cluster bij is similar with ak, bij is linked with ak (This procedure is tracking), and bij is called as sub-topic of ak; Step 5, If bij is not similar with anyone of set A, bij is a new topic, and was added into the set A (This procedure is detection); Step 6, to generate a description for each topic. 16

17 TOPIC DETECTION AND TRACKING ALGORITHM(3) The difficulty of TDT algorithm above is the similarity computing of clusters because the focus of topic is gradually shifting over time similarity computing has to take the shifting phenomenon into account a new description has to be generated from a list of topics if a topic is linked by other topics 17

18 Similarity of two Clusters(1) use Vector Space Model (VSM) to represent the content of the cluster In addition to the label of the cluster, we added the top K words into the vector K words were extracted from the nodes of suffix tree by the Mutual Information algorithm K is set to 50 18

19 Similarity of two Clusters(2) use Jaccard distance to measure the correlation of two vectors of clusters is the number of words appears in two clusters is the total number of words in two clusters. 19

20 Similarity of two Clusters(3) means these two clusters are similar, and can be linked means they are not similar, and a new topic have to be added 20

21 Description Generation use semantic analysis based on the Hierarchical Network of Concepts theory (HNC theory) to extract the description from the labels. The words with same meaning or hyponymy have to be filtered, and the noun is prior to be retained in the list The common phrase has to be extracted from the remaining word list 21

22 Outline Introduction CLUSTERING ALGORITHM TOPIC DETECTION AND TRACKING ALGORITHM Conclusion 22

23 Conclusion Advantage can track the topics effectively Drawback The different aspects of the topic were revealed correctly, but not linked with each other the ambiguity of topic detection and tracking was not processed very well combine the semantic analysis technology with TDT to deal with the ambiguity of topic detection and tracking 23


Download ppt "A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering Author : Yaohong JIN Source : International Conference on Computer Science."

Similar presentations


Ads by Google