Population Population of ontology: Finding instances of relations as well as of concepts Requires full understanding of natural language More modest target: The extraction of a set of predefined relations In this chapter: No acquisition of instances of relations The detection of instances of concepts
Population Common Approaches Corpus-based Population A standard similarity-based approach Learning by Googling Semi-supervised approach PANKOW C-PANKOW
Common Approaches Lexico-syntactic Patterns Hearst patterns Similarity-based Classification Algorithm12 Data sparseness problem Supervised Approaches Predict the category of a certain instance with a model Requires thousands of training examples to train the model Not feasible - considering hundreds of concepts as possible tags
Similarity-based Classification of Named Entities Using different similarity measures Cosine, Jaccard, L1 norm, Jensen-Shannon, Skew Using different feature weighting measures Conditional, PMI, Resnik
Evaluation Goal: learn a function f s f a and f b : specified by two annotators Functions as sets: Measurement Precision, Recall, F-measure, learning accuracy
Experiments Using Word Windows n words to the left and right of a word of interest Excluding stopwords without trespassing sentence boundaries Mopti is the biggest city along the Niger with one of the most vibrant ports and a large bustling market. Mopti has a traditional ambience that other towns seem to have lost. It is also the center of the local tourist industry and suffers from hard-sell overload. The nearby junction towns of Gao and San offer nice views over the Niger's delta. Mopti: traditional(l), biggest(1) Niger: city(l), delta(l), view(l) Gao: San(l), ofFer(l), town(l), junction(l) San: offer(l), view(l), Gao(l), nice(l)
Experiments Result:
Experiments Result:
Experiments Using Pseudo-syntactic Dependencies Object-attribute pair Mopti is the biggest city along the Niger with one of the most vibrant ports and a large bustling market. Mopti has a traditional ambience that other towns seem to have lost. It is also the center of the local tourist industry and suffers from hard-sell overload. The nearby junction towns of Gao and San offer nice views over the Niger's delta. Mopti: is-city(l), has_ambience(l) Niger: has_delta(l) Gao: junction.of(l) San: offer_subj(l) Result:
Experiments Dealing with Data Sparseness Using Conjunctions When two named entities linked by conjunctions Result:
Experiments Dealing with Data Sparseness Exploiting the Taxonomy Compute the context vector of a certain term by considering the context vectors of its subconcepts Take only into account the context vectors of direct subconcepts Normalizing aggregated vectors: Standard normalization of the vector Calculating its centroid
Experiments Dealing with Data Sparseness Exploiting the Taxonomy Result:
Experiments Dealing with Data Sparseness Anaphora Resolution Replace each anaphoric reference to the corresponding antecedent The port capital of Vathy is dominated by its fortified Venetian har- bor. The port capital of Vathy is dominated by Vathy's fortified Venetian harbor. Result:
Experiments Dealing with Data Sparseness Downloading Documents from the Web Downloading 20 additional documents D i for each named entity i keep d that its similarity is over an threshold of 0.2 Result:
Experiments Dealing with Data Sparseness Post-processing The k best answers of the system are checked for their statistical plausibility on the web Result:
PANKOW Pattern-based Annotation through Knowledge on the Web Certain lexico-syntactic patterns as defined by Hearst can be matched in corpus AND World Wide Web
PANKOW The Process of PANKOW Step 1: iterates the set of entities to be classified and generates instances of patterns, one for each concept in the ontology. For example: instance - South Africa, concepts – country and resulting in pattern instances - ' 'South Africa is a country" and ' 'South Africa is a hotel" or "countries such as South Africa" and "hotels such as South Africa". Result 1: A set of pattern instances Step 2: Google is queried for the pattern instances through its Web service API Result 2: the counts for each pattern instance Step 3: sums up the query results to a total for each concept. Result: The statistical web fingerprint for each entity, that is, the results of aggregating for each entity the number of Google counts for all pattern instances conveying the relation of interest.
PANKOW The Process of PANKOW
PANKOW Evaluation From the two annotators Reference standards for subject A and B Measurement: Precision, recall, and F-measure
PANKOW Evaluation Measurement: Average the results for both annotatores
PANKOW Result:
C-PANKOW Shortcoming of PANKOW A lot of actual instances of the pattern schema are not found Large number of queries sent to the Google Web API Not scale to larger ontologies
C-PANKOW C-PANKOW Process the web page to be annotated is scanned for candidate instances. for each instance i discovered and for each clue-pattern pair in our pattern library P, an automatically generated query is issued to Google and the abstracts or snippets of the n first hits are downloaded. Then the similarity between the document to be annotated and the downloaded abstract is calculated. If the similarity is above a given threshold t, the actual pattern found in the abstract reveals a phrase which may possibly describe the concept that the instance belongs to in the context in question. The pattern matched in a certain Google abstract is only considered if the similarity between the original page and this abstract is above a given threshold. In this way the pattern-matching process is contextualized. Finally, the instance i is annotated with that concept c having the largest number as well as most contextually relevant hits.
C-PANKOW Evaluation Same dataset and evaluation measures as PANKOW BUT the C-PANKOW uses the 682 concepts of the pruned Tourism ontology as possible tags Added learning accuracy
C-PANKOW Result:
C-PANKOW Result:
C-PANKOW Result: