Presented By- Shahina Ferdous, Student ID – 1000630375, Spring 2010.

Slides:



Advertisements
Similar presentations
BAH DAML Tools XML To DAML Query Relevance Assessor DAML XSLT Adapter.
Advertisements

Configuration management
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation Presented by: Hussain Sattuwala Stephen Dill, Nadav Eiron, David Gibson,
Ontology-based Annotation Sergey Sosnovsky
HTML5 and CSS3 Illustrated Unit B: Getting Started with HTML
Solutions to Review Questions. 4.1 Define object, class and instance. The UML Glossary gives these definitions: Object: an instance of a class. Class:
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information and Business Work
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
Xyleme A Dynamic Warehouse for XML Data of the Web.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
A New Web Semantic Annotator Enabling A Machine Understandable Web BYU Spring Research Conference 2005 Yihong Ding Sponsored by NSF.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Populating the Semantic Web by Macro-Reading Internet Text T.M Mitchell, J. Betteridge, A. Carlson, E. Hruschka, R. Wang Presented by: Will Darby.
Named Entity Disambiguation Based on Explicit Semantics Martin Jačala and Jozef Tvarožek Špindlerův Mlýn, Czech Republic January 23, 2012 Slovak University.
Overview of Search Engines
Information Retrieval in Practice
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Bayesian Networks. Male brain wiring Female brain wiring.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Survey of Semantic Annotation Platforms
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
The Semantic Web Service Shuying Wang Outline Semantic Web vision Core technologies XML, RDF, Ontology, Agent… Web services DAML-S.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
XP Dreamweaver 8.0 Tutorial 3 1 Adding Text and Formatting Text with CSS Styles.
The identification of interesting web sites Presented by Xiaoshu Cai.
SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Algorithmic Detection of Semantic Similarity WWW 2005.
Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.
Tutorial 3 Adding and Formatting Text with CSS Styles.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Internal and Confidential Cognos CoE COGNOS 8 – Event Studio.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Microsoft Expression Web 3 – Illustrated Unit D: Structuring and Styling Text.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.
Advanced Accounting Information Systems Day 34 XBRL Instance Documents and Taxonomies November 13, 2009.
HTML5 and CSS3 Illustrated Unit B: Getting Started with HTML.
Semantic Web. P2 Introduction Information management facilities not keeping pace with the capacity of our information storage. –Information Overload –haphazardly.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Information Retrieval in Practice
The Semantic Web By: Maulik Parikh.
Course Outcomes of Object Oriented Modeling Design (17630,C604)
Markup Languages Gilok Choi 9/17/2018
Hierarchical, Perceptron-like Learning for OBIE
Information Retrieval and Web Design
HTML5 and CSS3 Illustrated Unit B: Getting Started with HTML
Presentation transcript:

Presented By- Shahina Ferdous, Student ID – , Spring 2010

 SemTag is an application built on the platform Seeker that adds semantic tags to the existing HTML body of the web. Example: “The Chicago Bulls announced that Michael Jordan will…” Will be: The Chicago Bulls announced yesterday that Michael Jordan will...’’ Team_Bullshttp://tap.stanford.edu/ AthleteJordan_Michael  The creation of this large scale automated semantic tagging will accelerate the creation of Semantic Web

 Semantic Web is a vision to transform all documents in web into machine understandable format so that applications or programs can execute without human intervention.  All the entities of documents will be canonically annotated; therefore programs can easily understand what documents are about.

 To accomplish the Semantic Web Vision, we need ◦ Ontological support in the form of Web available services, which will maintain metadata about entities and provide them whenever needed. ◦ Large scale availability of annotations within documents encoding canonical references to the entities.  Need to break the Circular Dependency, which means ◦ We need applications those will make extensive use of the semantically tagged Data. ◦ There should be enough Tagged Data on the web so that these applications can be useful.

 Tagging is a way to classify entities either in written or spoken text.  Any Tagging process generally consists of two steps: ◦ Step 1: Identify the entities those should be classified ◦ Step 2: classify these instances according to their categories.  In case of Semantic Tagging, the categories used to classify the entities are derived from their intentions or meanings (what is being said than how is it said!)

He runs the companyHe runs the marathon run1 = controlrun2 = run by foot Sense Tagging HumanNon-Human Feature Tagging The speaker coughedThe speaker was disconnected

 Needs to resolve ambiguities in a natural language corpus like web.  Maintaining and Updating a large scale corpus requires such a scalable infrastructure, which most tagging applications are unable to support.  Requires a platform so that multiple Tagging applications can share.

 Designed the platform Seeker, which provides highly scalable core functionalities to support SemTag and other Tagging algorithms.  SemTag uses a new disambiguation algorithm called TBD for resolving Taxonomy based disambiguates.  Applied SemTag to a collection of approx. 264 million web pages and generate 434 million automatically disambiguated semantic tags  Published metadata regarding the annotations to the web as a label bureau.

 SemTag runs in three phases:  Spotting Pass – Generate window of context surrounding a label (10 words-label-10 words)  Learning Pass – Use representative sample to determine distribution of terms in the Taxonomy  Tagging Pass – Disambiguate references using TBD algorithm. Two kinds of ambiguities are:  Same label appears at multiple locations in TAP ontology.  Some labels occurs in contexts, which are missing in the taxonomy.

 TBD makes use of two classes of training information:  Automatic Metadata – help in determining whether context around a label appears within a subtree of the taxonomy.  Manual Metadata – Provides information regarding the nodes of the taxonomy whether it contains highly ambiguous or unambiguous labels.

 An Ontology in TBD defined by four elements:  A Set of classes, C  A subclass relation, s(c1, c2)  A Set of Instances, I  A Type relation, t(i, c)  A Taxonomy T is defined by three elements:  A Set of Nodes, V  A Root Node, r  A parent function, p  Ontology describes relationships in an N-dimensional manner, where Taxonomy describes hierarchical relationships.

 Each node in Taxonomy has a set of labels. E.g.: Musician, Singer, Band Members all can contain the label Mark Knopfler.  An ancestry chain denotes the path from a node to the root of the taxonomy followed by the parent relationship.  A spot, spot (l, c), i.e. spot (Mark knopfler, Singer) is a label in a context.

 Each internal node in TAP associates a similarity function that determines whether a particular context is similar to a node.  Good Similarity function has the property that higher the similarity, the more likely that a spot containing a reference to an entity that belongs to the subtree rooted at that node. Music MusicianSinger Mark Knopfler Label Mark Knopfler Label Example of a subtree in Taxonomy Spot(Mark knopfler, Singer) c u Should have Higher similarity value

Determines whether a particular context is appropriate to a particular node in Taxonomy.

TBD Uses the manually generated Metadata to calculate m a u and m s u, as the training set, where m a u = probability as measured by Human judgement that spots for the subtree rooted at u are on topic. And m s u = Probability that Sim correctly judges whether spots for the subtree rooted at u are on topic.

 Lexicon generation: ◦ Built a collection of 1.4 million unique words occurring in a random subset of windows containing approximately 90 million total words. ◦ Took the most frequent 200,100 words. ◦ Took the most frequent 100 words out. ◦ Further computations are performed in the 200,000 dimensional vector space defined by these words.

 Each node is associated with 200,000 dimensional vector.  Evaluated four standard candidates for Similarity Functions:  Scheme ‘Prob’  Scheme ‘TF-IDF’  Algorithm ‘IR’  Algorithm ‘Bayes’  According to the their result, IR with TF-IDF scheme gives the best accuracy (82%), which is a significant improvement.

 It is a platform developed to support SemTag and other sophisticated Text analytics applications.  It is designed to achieve the following goals:  Composibility  Modularity  Extensibility  Scalability  Robustness

 Seeker is a service oriented architecture (SOA), which means it is a local area, loosely-coupled, pull-based distributed computation system.  To address scalability and robustness issues, Seeker incorporates a Component containing small set of Critical Services named Infrastructure.  Analysis agents perform processing of web pages to generate annotations.

 Automatic semantic tagging is essential to bootstrap the Semantic Web.  It’s possible to achieve good accuracy even with simple disambiguation approaches.

Question?