Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team.

Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team

Introduction Algorithm implement Evaluation Outline Study Algorithm

Introduction Discard stop words Stem Extract frequency Select frequent term Clustering Expected probability Calculate X’ 2 value Output

Study Algorithm Preprocessing Goal: - Remove unnecessary words in document. - Get terms which are candidate keywords. Stop word: the function words and, the, and of, or other words with minimal lexical meaning. Stem: remove suffixes from words

Discard stop words It might be urged that when playing the “imitation game" the best strategy for the machine may possibly be something other than imitation of the behaviour of a man. This may be, but I think it is unlikely that there is any great effect of this kind. In any case there is no intention to investigate here the theory of the 2 game, and it will be assumed that the best strategy is to try to provide answers that would naturally be given by a man. urged playing “imitation game" best strategy machine possibly imitation behaviour man think unlikely great effect kind. case intention investigate theory 2 game, assumed best strategy try provide answers naturally given man. Study Algorithm Preprocessing

urged playing “imitation game" best strategy machine possibly imitation behaviour man think unlikely great effect kind. case intention investigate theory game, assumed best strategy try provide answers naturally given man. Stem urge play “imitation game" best strategi machine possible imitation behaviour man think unlike great effect kind. case intention investigate theory game, assum best strategi try provide answers natural give man. Study Algorithm Preprocessing

imitation best strategi man best strategi Extract frequency urge play “imitation game" best strategi machine possible imitation behaviour man think unlike great effect kind. case intention investigate theory game, assum best strategi try provide answers natural give man. Study Algorithm Preprocessing

Study Algorithm Term Co-occurrence and Importance the top ten frequent terms (denoted as ) and the probability of occurrence, normalized so that the sum is to be 1

Study Algorithm Term Co-occurrence and Importance Two terms in a sentence are considered to co-occur once.

co-occurrence probability distribution of some terms and the frequent terms. Study Algorithm Term Co-occurrence and Importance

The statistical value of χ2 is deﬁned as P g Unconditional probability of a frequent term g ∈ G (the expected probability) N w The total number of co-occurrence of term w and frequent terms G freq (w, g) Frequency of co-occurrence of term w and term g Study Algorithm Term Co-occurrence and Importance

Study Algorithm Term Co-occurrence and Importance

Study Algorithm Algorithm improvement P g (the sum of the total number of terms in sentences where g appears) divided by (the total number of terms in the document) N w The total number of terms in the sentences where w appears including w If a term appears in a long sentence, it is likely to co-occur with many terms; if a term appears in a short sentence, it is less likely to co-occur with other terms. We consider the length of each sentence and revise our deﬁnitions

the following function to measure robustness of bias values Study Algorithm Algorithm improvement

To improve extracted keyword quality, we will cluster terms Two major approaches (Hofmann & Puzicha 1998) are:  Similarity-based clustering If terms w1 and w2 have similar distribution of co- occurrence with other terms, w1 and w2 are considered to be the same cluster.  Pairwise clustering If terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster. Study Algorithm Algorithm improvement

Similarity-based clustering centers upon Red Circles Pairwise clustering focuses on Yellow Circles Study Algorithm Algorithm improvement

Where: Similarity-based clustering Cluster a pair of terms whose Jensen-Shannon divergence is and: Study Algorithm Algorithm improvement

Cluster a pair of terms whose mutual information is Pairwise clustering Where: Study Algorithm Algorithm improvement

Study Algorithm Algorithm improvement

Algorithm Implement

Discard stop words Stem Extract frequency Algorithm Implement Step 1: Preprocessing

Algorithm Implement Step 2: Selection of frequent terms Select the top frequent terms up to 30% of the number of running terms as a standard set of terms Count number of terms in document (Ntotal )

Algorithm Implement Step 3: Clustering frequent terms Similarity-base clustering Pairwise clustering

Algorithm Implement Step 4: Calculate expected probability Count the number of terms co-occurring with c ∈ C, denoted as n c, to yield the expected probability

Algorithm Implement Step 5: Calculate χ’2 value Where: the number of co-occurrence frequency with c ∈ C the total number of terms in the sentences including w

Algorithm Implement Step 6: Output keywords

Evaluation

In this paper, we developed an algorithm to extract keywords from a single document. Main advantages of our method are its simplicity without requiring use of a corpus and its high performance comparable to tfidf algorithm. As more electronic documents become available, we believe our method will be useful in many applications, especially for domain- independent keyword extraction.

Thank for your attention

Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team.

Similar presentations

Presentation on theme: "Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team.

Similar presentations

Presentation on theme: "Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team."— Presentation transcript:

Similar presentations

About project

Feedback