Semantic Web Course - Semantic Annotation Sadegh Aliakbary Mohammad Amin Badiezadegan Mahdy Khayyamian Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Presentation Outline Semantic annotation overview KIM as a semantic annotation tool A semantic annotation paper review Spring 2007 Semantic Web Course - Semantic Annotation
Need for Semantic Annotation Gartner reported in 2002: 95% of human to computer information input involve textual language (Gartner reported in 2002) Taxonomic and hierarchical knowledge mapping and indexing will be prevalent in almost all information-rich application by 2012 So There is a great gap between these two information representation that should be bridged by Automatic Semantic Annotation Spring 2007 Semantic Web Course - Semantic Annotation
Need for Semantic Annotation (cont.) The semantic web aims to add a machine readable layer to complement the existing web In order to realize this vision, the creation of semantic annotation, the linking of web pages to ontologies must become automatic or semi automatic process Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Definitions The process of tying semantic models and natural language together It is about assigning to entities and relations in the text links to other semantic descriptions in ontologies Spring 2007 Semantic Web Course - Semantic Annotation
Information Extraction (IE) Semantic annotation process involves Information Extraction Information Extraction is a technology based on analyzing natural language in order to extract snippets of information. The process takes text as input and produces fixed format unambiguous data as output Data may be used directly for displaying to users may be stored in a database may be used for indexing purposes in information retrieval systems as internet search engines Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation IE vs. IR an IR system finds relevant texts and presents them to the user; an IE application analyses texts and present only the specific information from them that the user is interested in. IE systems are more difficult and knowledge-intensive to build IE systems are Domain dependent but IR systems are not IE is more computationally intensive than IR IE is more efficient than IR when large amount of text volume is available because it reduces the amount of time people need to read IE is more suitable than IR where results need to be presented in structured unambiguous format Spring 2007 Semantic Web Course - Semantic Annotation
Information types in IE Entities: things in the text, for example people, places, organizations, amounts of money, dates, etc. Mentions: all the places that particular entities are referred to in the text. Descriptions of the entities present. Relations between entities. Events involving the entities Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation IE Example Consider the text : ‘Ryanair announced yesterday that it will make Shannon its next European base, expanding its route network to 14 in an investment worth around 180m. The airline says it will deliver 1.3 million passengers in the first year of the agreement, rising to two million by the fifth year’. Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation IE Example (cont.) IE will discover that ‘Shannon’ and ‘Ryanair’ as entities IE will discover ‘it’ and ‘its’ in the first sentence refer to Ryanair via a process of reference resolution IE will discover descriptive information like ‘Shannon is a European base’ IE will discover relations like ‘Sahanon will be a base of Ryanair’ IE will discover events like ‘Ryanair will invest 180 million euro in Shanon’ Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Entity extraction Entity recognition is the simplest and most reliable IE technology Entity recognition can be performed at up to around 95% accuracy human annotators do not perform to the 100 % level So entity recognition functions at human performance levels Spring 2007 Semantic Web Course - Semantic Annotation
Finding the mentions of entities co reference resolution (CO) is used to identify identity relations between entities in texts. These entities are both those identified by Entity recognition rand anaphoric references to those entities. This process is less relevant to end users than other IE tasks It is a basis for other IE tasks like relation end event extraction It breaks down to two sub problems : anaphoric resolution (e.g., ‘I’ with Ali) proper-noun resolution (e.g., ‘IBM’, ‘IBM Europe’, ‘International Business Machines Ltd) CO resolution is an imprecise process about (50-60%) particularly when applied to the solution of anaphoric reference. Spring 2007 Semantic Web Course - Semantic Annotation
Description Extraction build up on Entity recognition and co reference resolution associating descriptive information with the entities. For example, in a news article the ‘Bush administration’ can be also referred to as ‘government official’ Good scores for Description Extraction systems are around 80% on similar tasks humans can achieve results in the mid 90s. It is weakly domain independent Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Relation Extraction requires the identification of a small number of possible relations between the elements Extraction of relations among entities is a central feature of almost any information extraction task In general Relation Extraction (TR) system scores around 75% Relation Extraction is weakly domain dependent Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Event Extraction Representing information relating to events. It is the prototypical outputs of IE systems, being the original task for which the term was coined. It is is a difficult IE task, the best systems score around 60% and The human score is around 80% It is possible to increase precision at the expense of recall Spring 2007 Semantic Web Course - Semantic Annotation
Event Extraction example Description Extraction may have identified Mr. Smith and Mr. Jones as person entities and a company in a news article Relation Extraction would identify that these people work for the company. Event extraction identifies facts such as they signed a contract on behalf of the company with another supplier company Spring 2007 Semantic Web Course - Semantic Annotation
Realizing the semantic web vision Formally annotate and hyperlink (references to) entities and relations. Index and retrieve documents with respect to entities/relations Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Applications Highlighting Semantic search Categorization Generation of more advanced metadata Smooth traversal between unstructured text and formal knowledge Spring 2007 Semantic Web Course - Semantic Annotation
Ontology based Information Extraction (OBIE) a formal ontology as one of the system’s resources. It may involve reasoning linking it to its semantic description in the instance base.(URI mechanism) which allows entity tracking and description enrichment through the IE process. Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation OBIE subtasks Identification of Instances From the Ontology Automatic Ontology Population Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Presentation Outline Semantic annotation overview KIM as a semantic annotation tool A semantic annotation paper review Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Annotation Tools Categorized as: Traditional Information Extraction (IE) Ontology-based IE (OBIE) Difference OBIE use ontologies as a resource OBIE may also involve reasoning OBIE assign each term its semantic using hyperlink Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Annotation Tools Traditional IE AeroDAML Amilcare MnM S-Cream Ontology-based IE Magpie Pankow SemTag KIM Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation KIM Introduction The Knowledge and Information Management system A software platform for: Automatic semantic annotation, indexing, and retrieval of unstructured and semi-structured content Query and exploration of formal knowledge Co-occurrence tracking and ranking of entities Entity popularity timelines analysis Applications: Generation of meta-data for the Semantic Web Knowledge Management Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation KIM Platform Based on GATE, Sesame, OWLIM, and Lucene The KIM Platform includes: KIM Ontology (KIMO) KIM World KB KIM Server–with API for remote access and integration Clients: KIM Web UI, Plug-in for Internet Explorer Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation World Knowledge Common Knowledge based on social, cultural, historical, and education context. Common sense Common culture Events, famous people, films, companies, … KIM tries to provide common knowledge for most popular entities like the ones appears in the news. KIM knows Locations, Organizations and specific people. Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation KIM Features Annotations are separate from content An API is used for management Populated with 200,000 frequently used entities Mainly locations, their alias, geographic co-ordinates and co-positioning relations Spring 2007 Semantic Web Course - Semantic Annotation
KIM Annotating Process KIM analyzes texts and recognizes references to entities (like persons, organizations, locations, dates). Matches the reference with a known entity, having a unique URI and description. Alternatively, a new URI and description are automatically generated. Finally, the reference in the document gets annotated with the URI of the entity. For each term identifies: Class Alias Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation KIM Clients KIM Plug-in for Internet Explorer Use for semantic annotation Highlight instances in colors KIM Web UI Powerful semantic search interface Address: http://www.ontotext.com/kim/ Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation KIM IE Plug-in Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Annotated Page Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Entity Description Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation KIM Web UI http://62.213.161.156/KIM/screen/KWUIMain.jsp Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Entity Pattern Search Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Presentation Outline Semantic annotation overview KIM as a semantic annotation tool A semantic annotation paper review Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Annotation Methods Manual Rule-based Machine learning based Spring 2007 Semantic Web Course - Semantic Annotation
Two-dimensional Content Annotation methods typically convert the web page into an ‘object’ sequence. And then they utilize techniques to identify a subsequence that we want to annotate. However, information on a web page is usually two-dimensionally laid-out and should not be simply described as a sequence. Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Original Content Spring 2007 Semantic Web Course - Semantic Annotation
One-dimensional Context Spring 2007 Semantic Web Course - Semantic Annotation
Two-dimensional Context Spring 2007 Semantic Web Course - Semantic Annotation
Horizontal and vertical context By context, we mean the surrounding information of the targeted instance. By horizontal context, we mean information left to and right to the targeted instance e.g., the previous tokens and the next tokens. By vertical context, we mean information above and below of the targeted instance e.g., the previous lines and the next lines. Spring 2007 Semantic Web Course - Semantic Annotation
The Annotation Process The process of annotation is done in two stages: Block Detection Text Annotation Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Block Detection A block is a specific informative unit in a document. It can be defined by different granularity, e.g. text line, section, or paragraph. We also assign one or more labels to each block. Each label corresponds to a concept in the ontology. A block can also have no label. Spring 2007 Semantic Web Course - Semantic Annotation
Block Detection (cont.) We define block as a text line. because in our experiments, statistic shows that 99.6% of the targeted instances are in one single text line. In block detection, we detect the label of each block using one classification model . SVM is used for this purpose. Spring 2007 Semantic Web Course - Semantic Annotation
Company Annual Reports Fourteen sections, including “Introduction to Company”, “Company Financial Report”, etc. We will only describe the annotation of the first part (i.e. Section “Introduction to Company”). Section “Introduction to Company” contains company information such as Company-Chinese-Name, Legal-Representative and Office-Address. Spring 2007 Semantic Web Course - Semantic Annotation
Learning to Detect Blocks We view block detection as classification. For each concept, we train a SVM model to detect whether a block contains instance(s) of that concept. Spring 2007 Semantic Web Course - Semantic Annotation
Block Detection Features we define features at token level and line level. Main features in block detection are: Positive Word Features Negative Word Features Special Pattern Features Line Position Feature Number of Words Feature Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Text Annotation An identified block contains at least one instance. We then try to identify the start position and the end position of the targeted instance. Two SVMs are employed for this purpose. Spring 2007 Semantic Web Course - Semantic Annotation
Text Annotation Features Token Features tokens in the previous four positions, the current position, and in the next two positions. Because the previous tokens seem more important in our annotation tasks. Special Pattern Features Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation References Semantic Web Technologies: Trends and Research in Ontology-based Systems. John Davies, Rudi Studer, Paul Warren, 2006 John Wiley & Sons, Ltd http://annotation.semanticweb.org http://www.ontotext.com/kim/ Mingcai Hong, Jie Tang, and Juanzi Li, Semantic Annotation using Horizontal and Vertical Contexts Joachims T., Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning. Schölkopf B. and Burges C. and Smola A. (ed.), MIT-Press, 1999. Spring 2007 Semantic Web Course - Semantic Annotation
Semantic Web Course - Semantic Annotation Thanks Any Question? Spring 2007 Semantic Web Course - Semantic Annotation