Download presentation
Presentation is loading. Please wait.
Published byMarianna Reeves Modified over 8 years ago
1
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM, 2009 Speaker: Chien-Liang Wu
2
Outline Motivation Framework for feature constructor Hierarchical Resolution Feature Generation Feature Selection Evaluation 2
3
Motivation Aggregated Search Gather the search results from various resources Present them in a succinct format One of key issue in aggregated technology “How should the information be presented to the user?” Traditional browsing search results A ranked list Inconvenient for users to effectively locate their interests 3
4
Motivation (contd.) Some commercial aggregated search systems, such as DIGEST and Clusty Provide clustering of relevant search results Make the information more systematic and manageable Short texts Snippets, product descriptions, QA passages and image captions play important roles in Web and IR applications Successful processing short texts is essential for aggregated search systems Consist of a few phrases or 2–3 sentences Present great challenges in clustering 4
5
Framework 5 (1) (2) (3) Example Query: “The Dark Knight” Google Snippet: “Jul 18, 2008... It is the best American film of the year so far and likely to remain that way. Christopher Nolan’s The Dark Knight is revelatory, visceral...”
6
Hierarchical Resolution Exploit internal semantics of short texts Preserve contextual information Avoid data sparsity Use NLP to construct syntax tree Decompose the original text into a three-level top-down hierarchical structure Segment level, phrase level and word level 6
7
Three Levels Segment Level Text is split into segments with the help of punctuations Generally ambiguous Often fail to convey the exact information to represent the short text Phrase level Adopt shallow parsing to divide sentences into a series of words Stemming and removal of stop-words from the phrases The NP and VP chunks are employed as phrase level features 7
8
Three Levels (contd.) Word level Decompose the phrase level features directly Build word level features Choose the non-stop words contained in NP and VP chunks Further remove the meaningless words in the short texts Original feature set Select features at phrase level and word level Phrase level: contain original contextual information Word level: avoid problem of data sparseness 8
9
Feature Generation Goal: Build semantic relationship with other relevant concepts Example: “The Dark Knight” and “Batman” Two steps Select seed phrases from the internal semantics Generate external features from seed phrases. 9
10
Seed Phrase Selection Use the features at segment level and phrase level to construct the seed phrases Redundancy problem Segment level feature: “Christopher Nolan’s The Dark Knight is revelatory visceral” Three phrase level features: [NP Christopher Nolan’s], [NP The Dark Knight] and [NP revelatory visceral] Eliminate information redundancy Measure semantic similarity between phrase level features and its parent segment level feature 10
11
Semantic Similarity Measure A phrase-phrase semantic similarity measure algorithm Use co-occurrence double check in Wikipedia to reduce the semantic duplicates Download XML corpus of Wikipedia Build a Solr index of all XML articles Let P = {p 1, p 2,…,p n } P: segment level feature, p i : phrase level feature InfoScore(p i ): Semantic similarity between p i and {p 1, p 2,..., p n } 11
12
Semantic Similarity Measure (contd.) Given two phrases p i and p j Use p i and p j separately as query to retrieve top C Wikipedia pages f(p i ): total occurrences of p i in the top C Wikipedia pages retrieved by query p i f(p i |p j ): total occurrences of p i in the top C Wikipedia pages retrieved by query p j Variants of three popular co-occurrence measures 12
13
Semantic Similarity Measure (contd.) Similarity scores are normalized into values in [0, 1] range A linear combination: 13
14
Semantic Similarity Measure (contd.) For each segment level feature Rank the information score for its child node features at phrase level Remove the phrase level feature p * Delegate the most information duplicate to the segment level feature P 14
15
Feature Generator 15 retrieve the top w articles (1)titles and bold terms (links) in retrieved Wikipedia pages (2)key phrases extracted from the Wikipedia pages by Lingo Algorithm External features Example: "in his car“ WordNet synsets: "atuo", "automobile", "autocar"
16
Feature Selection Overzealous external features Bring adverse impact on the effectiveness Dilute the influence of valuable original information Empirical rules to refine the unstructured features obtained from Wikipedia pages Remove too general seed phrase (occurrence more than 10,000) Transform features used for Wikipedia management or administration e.g. "List of hotels" → "hotels", "List of twins" → "twins“ Phrase sense stemming using Porter stemmer Remove features related to chronology, e.g. “year”, “decade” and “centuries” 16
17
External Feature Collection Construct n 1 + n 2 dimension feature space n 1 original features, n 2 external features, where θ=0 no external features, θ=1 no original features One seed phrase s i (0< i ≦ m) may generate k external features {f i1, f i2,..., f ik } Select one feature f i * for each seed phrase Top n 2 - m features are extracted from the remaining external features based on their frequency 17
18
Evaluation Datasets Reuters-21578 Remove the texts which contain more than 50 words Filter those clusters with less than 5 texts or more than 500 texts Leave 19 clusters comprising 879 texts Web Dataset Ten hot queries from Google Trends Top 100 snippets for each query are retrieved Build a 10-category Web Dataset with 1000 texts 18
19
Clustering Methods Two clustering algorithms, K-means and EM Six different text representation methods BOW (baseline 1) : Traditional “bag of words” model with the tf-idf weighting schema BOW+WN (baseline 2) : BOW integrated with additional features from WordNet BOW+Wiki (baseline 3) : BOW integrated with additional features from Wikipedia BOW+Know (baseline 4) : BOW integrated with additional features from Wikipedia and WordNet BOF : The bag of original features extracted with the hierarchical view SemKnow : Our proposed framework 19
20
Evaluation Criteria F1 measure F1 Score = 2*(precision * recall) / (precision + recall) Average Accuracy 20
21
Performance Evaluation Parameter Setting: C=100, w=20, α=β=⅓, θ=0.5 Use k-means algorithm Use EM algorithm 21
22
Effect of External Features At a small external feature set size of θ= 0.2 or θ= 0.3, SemKnow achieve the best performance 22 Reuters using K-means algorithmWeb Dataset using K-means algorithm
23
Optimal results using two algorithms 23
24
Detail Analysis Feature space for the example of snippet 24
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.