Download presentation
Presentation is loading. Please wait.
1
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
2
Introduction Develop the Semantic Web: Automated annotation PANKOW (Pattern-based ANotation through Knowledge On the Web): Identify each instance Use Google queries to find pages about these instances Use the pages Google finds to annotate each instance with an appropriate concept
3
Context ‘Niger’ is a country is a state is a river is a region Statistical distribution of ‘is a’-patterns for Niger
4
C-PANKOW (context-driven) 1. Instances are extracted out of a web page. 2. For each instance discovered, queries are made to Google, and abstracts of the n first hits are downloaded. 3. Similarity between each abstract and the web page are calculated. Abstracts with a similarity above a threshold, t, are analyzed. 4. The concept label for each instance is updated according to the similarity calculated. 5. Each instance is annotated with the concept that has the largest number, as well as most contextually relevant hits.
5
Instance Recognition Instances are detected via the following regular expression: INSTANCE := (\w+{DT})? ([a-z]+{JJ})? PRE (MID POST)? PRE := POST := (([A-Z][a-z]*){NNS|NNP|NN|NP|JJ|UH})+ MID := the{DT} | of{IN} | -{-} | ‘{POS} | (de|la|los|las|del){FW} | [a-z]+{NP|NPS|NN|NNS}
6
Downloading Google Abstracts For each instance, i, seven queries are made to Google: 1. such as i 2. especially i 3. including i 4. i and other 5. i or other 6. the i 7. i is
7
Similarity Assessment Remove stopwords from documents. Adopt a bag-of-words model to create vectors of word counts. Similarity is the cosine of the angle between two vectors. Only consider abstracts with a similarity over the threshold, t.
8
Updating Concept Labels Search abstracts for concepts via the following patterns:
9
Updating Concept Labels Search abstracts for concepts via the following patterns:
10
C-PANKOW
11
Run Time and Query Size Complexity Runtime: O(|I|*|P|*n) |I| is the total number of instances |P| is the number of patterns n is the maximum number of pages downloaded |P| and n are constants Google API allows the retrieval of 10 documents per query.
12
Evaluation http://www.lonelyplanet.com/destinations Used pruned version of the tourism ontology in GETESS. Consists of 682 concepts. Two humans annotated 30 texts from destination descriptions.
13
Instance Detection Precision: P = 43.75% Recall: R = 57.20% F-measure: F = 48.39%
14
Instance Classification Accuracy: Acc’ Accuracy: Acc” Learning Accuracy: LA Precision: P Recall: R F-measure: F
15
Threshold Choose threshold of 0.05 Results of varying the threshold (no weighting, n=100)
16
Similarity Use similarity weighting Impact of using the similarity measure (n=100)
17
Number of Pages Choose 100 pages Results of varying the number of pages (t=0.05)
18
A posteriori evaluation 307 news stories from http://news.kmi.open.ac.uk/rostral/ 1270 total annotations produced One annotator analyzed the annotations a posteriori Ranked each annotation from 0 (incorrect) to 3 (totally correct).
19
A posteriori evaluation Average score: 1.81 P3 score: 54.88% P2 score: 57.95% P1 score: 68.66% Average score: 2.1 P3 score: 58.14% P2 score: 71.1% P1 score: 76.8% Lonely Planet Dataset
20
WordNet as Ontology Used the general purpose ontology, WordNet Implemented simple word sense disambiguation algorithm Results using 307 news stories: P3 score: 27.91% P2 score: 33.47% P1 score: 43.43%
21
Related Work
22
Conclusions By linguistically analyzing and normalizing pages, recall of pattern matching process is improved. The number of queries to the Google API is reduced (it is now constant for each instance being annotated). More accurate annotations are made because of contextualization.
23
Future Work Learn to annotate conceptual relations between discovered instances. Learn new patterns indicating a certain relation by a certain rule induction process.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.