Download presentation
Presentation is loading. Please wait.
Published byLucinda Stanley Modified over 9 years ago
AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC DECOMPOSITION Proceeding EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing reporter: Ying-Ying, Chen
OUTLINE Introduction Building Topic Interpreters Topical PageRank for Keyphrase Extraction Experiments Related work Conclusion 2
INTRODUCTION Keyphrases are defined as a set of terms in a document that give a brief summary of its content for readers. It is widely used in information retrieval and digital library It is also an essential step in document categorization, clustering and summarization Two principle approach: supervised and unsupervised Supervised method regards keyphrase extraction as a classification task required a documents set with human-assigned keyphrases 3
INTRODUCTION Unsupervised method Graph-based rank Process: 1. first build a word graph according to word co-occurrences within the document, 2. use random walk techniques to measure word importance 3. top ranked words are selected as keyphrases Problems: keyphrases should be relevant to the major topics of the given document keyphrases should also have a good coverage of the document’s major topics 4
INTRODUCTION To address the problem, it is intuitive to consider the topics of words and document in random walk for keyphrase extraction. decompose traditional PageRank into multiple PageRanks specific to various topics obtain the importance scores of words under different topics We call the topic-decomposed PageRank as Topical PageRank (TPR). Moreover, TPR is unsupervised and language independent TPR for keyphrase extraction is a two-stage process: 1. Build a topic interpreter to acquire the topics of words and documents. 2. Perform TPR to extract keyphrases for documents. 5
BUILDING TOPIC INTERPRETERS There are two method to acquire topic distributions of words Use manually annotated knowledge bases. Ex. WordNet Use unsupervised machine learning techniques to obtain word topics from a large-scale document collection. LSA(Latent Semantic Analysis) pLSA(probability LSA), LDA(Latent Dirichlet Allocation) 6
BUILDING TOPIC INTERPRETERS LDA Each word w of a document d is regarded to be generated by first sampling a topic z from d’s topic distribution θ (d), and then sampling a word from the distribution over words φ (z) that characterizes topic z. In LDA, θ (d) and φ (z) are drawn from conjugate Dirichlet priors α and β, separately. Therefore, θ and φ are integrated out and the probability of word w given document d and priors is represented as follows: Where K is the number of topics 7
LDA(LATENT DIRICHLET ALLOCATION) Dirichlet distribution( 狄氏分配 ) Dirichlet 分配是多項式分配的共軛分配 先驗機率為 Dirichlet 分配,相似度函數為多項式分配,那麼 後驗分配仍為 Dirichlet 分配 P(Y|X): 後驗機率 ; P(X): 先驗機率 ; P(X|Y): 相似度函數 8
LDA(LATENT DIRICHLET ALLOCATION) LDA 透過將文本映射到主題空間,也就是他認為一篇文章是由很多個主 題隨機構成,透過主題得到文本與文本之間的關係。 LDA 和 LSA 、 pLSA 的前提都相同,是 bag of word 所以不考慮任何語法 及出現順序的問題。 LDA 與 pLSA 的差異 pLSA 的文件參數是由訓練文集中有出現的文件訓練得到 LDA 會給予沒有出現在訓練文集中的文件一個機率形式的表現方式, 所以需要的參數量較少 9
LDA(LATENT DIRICHLET ALLOCATION) LDA 是一個生成模型,其可以隨機生成可觀測的數據,也就是可以隨機 生成一篇由多個主題組成的文章。其建模過程是逆向透過文本的集合建 立生成模型,生成步驟如下 : 1. 選擇 N , N 遵守 poisson( ξ ) 分配,這裡 N 代表文章長度 ( 文章字數 ) 2. 選擇 θ , θ 遵守 Dirichlet(α) 分配, θ 代表每個主題發生的機率, α 是 Dirichlet 分配的參數 3. 對 N 個文字中的每一個文字 : 1. 選擇主題 z n , z n 會遵守 Multinominal(θ) 多項分配。 z n 代表當前選擇的主題 2. 選擇 w n ,根據 p(w n |z n ;β): 在 z n 條件下的多項分配, β 是一個 K*V 的矩陣, β ij =P(w j =1|z i =1) 在 LDA 中,不同的文章會有不同的 θ 對應,而 θ 可以用來判斷文章的相似 度 10
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION Given a document d, the process of keyphrase extraction using TPR consists of the following four steps : 1. Construct a word graph for d according to word co-occurrences within d. 2. Perform TPR to calculate the importance scores for each word with respect to different topics. 3. Using the topic-specific importance scores of words, rank candidate keyphrases respect to each topic separately. 4. Given the topics of document d, integrate the topic-specific rankings of candidate keyphrases into a final ranking, and the top ranked ones are selected as keyphrases. 11
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION We construct a word graph according to word co- occurrences within the given document Link weight between words the co-occurrence count within a sliding window with maximum W words in the word sequence. Direction When sliding a W-width window, at each position, we add links from the first word pointing to other words within the window. Format only add adjectives and nouns in word graph 12
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION PageRank The basic idea of PageRank is that a vertex is important if there are other important vertices pointing to it. This can be regarded as voting or recommendation among vertices. G = (V,E) as the graph of a document vertex set V = {w 1,w 2, · · ·,w N } link set (w i,w j ) ∈ E if there is a link from w i to w j the weight of link (w i,w j ) as e(w i,w j ) the out-degree of vertex w i as λ is a damping factor range from 0 to 1 |V| is the number of vertices 13
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION Topical PageRank(TPR) Each topic-specific PageRank prefers those words with high relevance to the corresponding topic. In the PageRank of a specific topic z, we will assign a topic-specific preference value p z (w) to each word w as its random jump probability with 14
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION Topical PageRank(TPR) We use three measures to set preference values for TPR: p z (w) = pr(w|z), This indicates how much that topic z focuses on word w. p z (w) = pr(z|w), This indicates how much that word w focuses on topic z. p z (w) = pr(w|z) * pr(z|w), This measure is inspired by the work in (Cohn and Chang, 2000). Terminate conditions: when the number of iterations reaches 100 the difference of each vertex between two neighbor iterations is less than 0.001. 15
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION Extract Keyphrases Using Ranking Scores We thus select noun phrases from a document as candidate keyphrases for ranking. The document is first tokenized. After that, we annotate the document with part of-speech (POS) tags. Third, we extract noun phrases with pattern (adjective)*(noun)+ We regard these noun phrases as candidate keyphrases. 16
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION Extract Keyphrases Using Ranking Scores We rank them using the ranking scores obtained by TPR. By considering the topic distribution of document, we further integrate topic-specific rankings of candidate keyphrases into a final ranking 17
EXPERIMENTS Datasets One dataset was built by Wan and Xiao which was used in (Wan and Xiao, 2008b). It contains 308 news articles in DUC2001 (Over et al.,2001) 2, 488 manually annotated keyphrases. There are at most 10 keyphrases for each document. In experiments we refer to this dataset as NEWS. The other dataset was built by Hulth 3 which was used in (Hulth, 2003). It contains 2, 000 abstracts of research articles 19, 254 manually annotated keyphrases. In experiments we refer to this dataset as RESEARCH. 18
EXPERIMENTS Dataset we use the Wikipedia snapshot at March 2008 to build topic interpreters with LDA. collected 2, 122, 618 articles build the vocabulary by selecting 20, 000 words according to their document frequency. learned several models with different numbers of topics, from 50 to 1, 500 respectively. 19
E XPERIMENTS Evaluation Metrics In experiments we select three evaluation metrics. Precision / recall / F-measure Binary preference measure(Bpref) R: correct keyphrases ; M: extracted keyphrases ; r: a correct keyphrase ; n: an incorrect keyphrase Mean reciprocal rank(MRR) d: a document ; rank d : the rank of the first correct keyphrase with all extracted keyphrases 20
EXPERIMENTS Influences of Parameters to TPR There are four parameters in TPR that may influence the performance of keyphrase extraction: 1. window size W for constructing word graph 2. the number of topics K learned by LDA 3. different settings of preference values p z (w) 4. damping factor λ of TPR Except the parameter under investigation, we set parameters to the following values: W =10, K=1000, λ =0.3 and p z (w) = pr(z|w) 21
EXPERIMENTS Window Size W In experiments on NEWS and W ranges from 5 to 20 as shown in Table 1: Similarly, W ranges from 2 to 10, the performance on RESEARCH does not change much but it will become poor when W = 20. RESEARCH(121 words) are much shorter than NEWS(704 words) the graph will become full-connected the weights of links will tend to be equal 22
EXPERIMENTS The Number of Topics K We demonstrate the influence of the number of topics K of LDA models in Table 2. The influence is similar on RESEARCH It indicates that LDA is appropriate for obtaining topics of words and documents for TPR to extract keyphrases. 23
EXPERIMENTS Damping Factor λ Damping factor λ of TPR reconciles the influences of graph walks 24
EXPERIMENTS Preference Values In Table 3 we show the influence when the number of keyphrases M = 10 on NEWS. pr(w|z) assigns preference values according to how frequently that words appear in the given topic. pr(z|w) prefers those words that are focused on the given topic. 25
EXPERIMENTS Comparing with Baseline Methods We select three baseline methods to compare with TPR TFIDF PageRank TFIDF amd PageRank don’t use the topic information LDA computes the ranking score for each word using the topical similarity between the word and the document. The LDA baseline calculated using cosine similarity which performs the best. 26
EXPERIMENTS In Tables 4 and 5 we show the comparing results of the four methods on both NEWS and RESEARCH. The improvements of TPR are all statistically significant tested with bootstrap re-sampling with 95% confidence. LDA performs equal or better than TFIDF and PageRank under precision/recall/F measure. the performance of LDA under MRR is much worse than TFIDF and PageRank 27
EXPERIMENTS In Figures 3 and 4 we show the precision-recall relations of four methods on NEWS and RESEARCH. Each point on the precision-recall curve is evaluated on different numbers of extracted keyphrases M 28
EXPERIMENTS in Table 6 we show an example of extracted keyphrases using TPR from a news article with title “Arafat Says U.S. Threatening to Kill PLO Officials” Top 3 topic: Palestine Israel terrorism 29
EXPERIMENTS TFIDF only considered the frequency highly ranked the phrases with “PLO” which appeared about 16 times in this article LDA without considering the frequency failed to extract keyphrase “political assassination”, in which the word “assassination” occurred 8 times in this article. 30
RELATED WORK 1. supervised methods regarded keyphrase extraction as a classification task (Turney, 1999) need manually annotated training set which is time-consuming 2. clustering techniques on word graphs for keyphrase extraction (Grineva et al., 2009; Liu et al., 2009). performed well on short abstracts but poorly on long articles 3. Topical PageRank with random jumps between topics(Nie et al., 2006)random jumps did not help improve the performance for keyphrase extraction Peter D. Turney. 1999. Learning to extract keyphrases from text. National Research Council Canada, Institute for Information Technology, Technical Report ERB-1057. M. Grineva, M. Grinev, and D. Lizorkin. 2009. Extractingkey terms from noisy and multi-theme documents. In Proceedings of WWW, pages 661–670. Lan Nie, Brian D. Davison, and Xiaoguang Qi. 2006. Topical link analysis for web search. In Proceedings of SIGIR, pages 91–98. 31
CONCLUSION We propose a new graph-based framework, Topical PageRank We investigate the influence of various parameters on TPR Future work We design to obtain topics using other machine learning methods and from other knowledge bases consider topic information in other graph-based ranking algorithms such as HITS (Kleinberg, 1999). We will investigate the influence of corpus selection in training LDA for keyphrase extraction using TPR. 32
RELATED WORK Topical link analysis for web search (Nie et al., 2006) when surfing following a graph link from vertex w i to w j, the ranking score on topic z of w i will have a higher probability to pass to the same topic of w j and have a lower probability to pass to a different topic of w j. 33
Similar presentations
© 2025 Inc.
All rights reserved.