Presented By- Shahina Ferdous, Student ID – , Spring 2010
SemTag is an application built on the platform Seeker that adds semantic tags to the existing HTML body of the web. Example: “The Chicago Bulls announced that Michael Jordan will…” Will be: The Chicago Bulls announced yesterday that Michael Jordan will...’’ Team_Bullshttp://tap.stanford.edu/ AthleteJordan_Michael The creation of this large scale automated semantic tagging will accelerate the creation of Semantic Web
Semantic Web is a vision to transform all documents in web into machine understandable format so that applications or programs can execute without human intervention. All the entities of documents will be canonically annotated; therefore programs can easily understand what documents are about.
To accomplish the Semantic Web Vision, we need ◦ Ontological support in the form of Web available services, which will maintain metadata about entities and provide them whenever needed. ◦ Large scale availability of annotations within documents encoding canonical references to the entities. Need to break the Circular Dependency, which means ◦ We need applications those will make extensive use of the semantically tagged Data. ◦ There should be enough Tagged Data on the web so that these applications can be useful.
Tagging is a way to classify entities either in written or spoken text. Any Tagging process generally consists of two steps: ◦ Step 1: Identify the entities those should be classified ◦ Step 2: classify these instances according to their categories. In case of Semantic Tagging, the categories used to classify the entities are derived from their intentions or meanings (what is being said than how is it said!)
He runs the companyHe runs the marathon run1 = controlrun2 = run by foot Sense Tagging HumanNon-Human Feature Tagging The speaker coughedThe speaker was disconnected
Needs to resolve ambiguities in a natural language corpus like web. Maintaining and Updating a large scale corpus requires such a scalable infrastructure, which most tagging applications are unable to support. Requires a platform so that multiple Tagging applications can share.
Designed the platform Seeker, which provides highly scalable core functionalities to support SemTag and other Tagging algorithms. SemTag uses a new disambiguation algorithm called TBD for resolving Taxonomy based disambiguates. Applied SemTag to a collection of approx. 264 million web pages and generate 434 million automatically disambiguated semantic tags Published metadata regarding the annotations to the web as a label bureau.
SemTag runs in three phases: Spotting Pass – Generate window of context surrounding a label (10 words-label-10 words) Learning Pass – Use representative sample to determine distribution of terms in the Taxonomy Tagging Pass – Disambiguate references using TBD algorithm. Two kinds of ambiguities are: Same label appears at multiple locations in TAP ontology. Some labels occurs in contexts, which are missing in the taxonomy.
TBD makes use of two classes of training information: Automatic Metadata – help in determining whether context around a label appears within a subtree of the taxonomy. Manual Metadata – Provides information regarding the nodes of the taxonomy whether it contains highly ambiguous or unambiguous labels.
An Ontology in TBD defined by four elements: A Set of classes, C A subclass relation, s(c1, c2) A Set of Instances, I A Type relation, t(i, c) A Taxonomy T is defined by three elements: A Set of Nodes, V A Root Node, r A parent function, p Ontology describes relationships in an N-dimensional manner, where Taxonomy describes hierarchical relationships.
Each node in Taxonomy has a set of labels. E.g.: Musician, Singer, Band Members all can contain the label Mark Knopfler. An ancestry chain denotes the path from a node to the root of the taxonomy followed by the parent relationship. A spot, spot (l, c), i.e. spot (Mark knopfler, Singer) is a label in a context.
Each internal node in TAP associates a similarity function that determines whether a particular context is similar to a node. Good Similarity function has the property that higher the similarity, the more likely that a spot containing a reference to an entity that belongs to the subtree rooted at that node. Music MusicianSinger Mark Knopfler Label Mark Knopfler Label Example of a subtree in Taxonomy Spot(Mark knopfler, Singer) c u Should have Higher similarity value
Determines whether a particular context is appropriate to a particular node in Taxonomy.
TBD Uses the manually generated Metadata to calculate m a u and m s u, as the training set, where m a u = probability as measured by Human judgement that spots for the subtree rooted at u are on topic. And m s u = Probability that Sim correctly judges whether spots for the subtree rooted at u are on topic.
Lexicon generation: ◦ Built a collection of 1.4 million unique words occurring in a random subset of windows containing approximately 90 million total words. ◦ Took the most frequent 200,100 words. ◦ Took the most frequent 100 words out. ◦ Further computations are performed in the 200,000 dimensional vector space defined by these words.
Each node is associated with 200,000 dimensional vector. Evaluated four standard candidates for Similarity Functions: Scheme ‘Prob’ Scheme ‘TF-IDF’ Algorithm ‘IR’ Algorithm ‘Bayes’ According to the their result, IR with TF-IDF scheme gives the best accuracy (82%), which is a significant improvement.
It is a platform developed to support SemTag and other sophisticated Text analytics applications. It is designed to achieve the following goals: Composibility Modularity Extensibility Scalability Robustness
Seeker is a service oriented architecture (SOA), which means it is a local area, loosely-coupled, pull-based distributed computation system. To address scalability and robustness issues, Seeker incorporates a Component containing small set of Critical Services named Infrastructure. Analysis agents perform processing of web pages to generate annotations.
Automatic semantic tagging is essential to bootstrap the Semantic Web. It’s possible to achieve good accuracy even with simple disambiguation approaches.
Question?