Download presentation
1
Data on the (Semantic) Web
2
Agenda (75 min) Data on the Web Crawling and extraction
Extracting data Publishing data Linked Data Metadata in HTML SPARQL endpoints Crawling and extraction Indexing RDF data Database-style indexing IR-style indexing
3
IR view of the Web Web accessible resources
Documents (typically HTML) Multimedia Search engines index NL text Most of the structure in HTML is discarded Multimedia is indexed by surrounding text Additional information on web graph, usage See Manning, Raghavan, Müntze. Introduction to Information Retrieval. Cambridge Press, 2008.
4
Data on the Web Most web pages on the Web are generated from structured data Data is stored in relational databases (typically) Queried through web forms Presented as tables or simply as unstructured text The structure and semantics (meaning) of the data is not directly accessible to search engines Two solutions Extraction using Information Extraction (IE) techniques (implicit metadata) Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata)
5
Information Extraction methods
Named Entity Recognition (NER) and disambiguation OpenCalais, Zemanta Extraction of triples TextRunner, NELL Suchanek et al. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW, 2007. Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007. Filling web forms automatically (form-filling) Madhavan et al. Google's Deep-Web Crawl. VLDB 2008 Extraction from HTML tables Cafarella et al. WebTables: Exploring the Power of Tables on the Web. VLDB 2008 Wrapper induction Kushmerick et al. Wrapper Induction for Information ExtractionText extraction. IJCAI 2007
6
Information Extraction
A tale of many trade-offs Less or no training data, lower quality More complex the model to learn, more training data needed Deeper the analysis, slower the processing The more narrowly trained, the more likely to break Populating a Knowledge Base is easier than ad-hoc extraction However, a complete and correct semantic representation of the content may not be need for all tasks
7
Publishing data on the Web
Pre-Semantic Web technologies have been inadequate Existing formats are not appropriate for serendipitous reuse HTML: structure is lost due to a mix of presentation and content XML: captures structure, but not semantics Lack of protocols to talk to databases over the Web Motivation has been lacking Publishers are interested to the extent that they benefit from sharing data, e.g. because it drives traffic back to their site
8
What the Semantic Web provides
Data format: RDF Designed for object-relationship data Identification of objects by URIs Multiple serializations: RDF/XML, Turtle, N3, N-Triples, Trix etc. Schema language: OWL Description Logic based Extensible using rule languages such as RIF Query language and protocol: SPARQL The principles of Linked Data
9
Methods for publishing RDF data
Multiple ways of publishing RDF data SPARQL endpoints Linked Data Metadata in HTML documents Data feeds GRDDL Automated tools Each require different treatment in crawling and extraction
10
SPARQL endpoints SPARQL is a standard query language and protocol for accessing RDF stores via HTTP Also possible to expose a traditional RDBMs via a wrapper Advantages: Most flexible and best performing access from a consumer perspective Disadvantages: Higher maintenance Discovery is problematic Tools: Triple stores (Oracle, Virtuoso, Sesame, Jena, OWLIM etc.) RDB-to-RDF mappers such as D2RQ and Triplify SPARQL query builders
11
Linked Data A web of interlinked RDF documents
Each document describes the characteristics of a single object, and links to related objects Most important: links to the same object in different data sets (sameAs) Guidelines for proper configuration of web servers to serve such documents Rapidly growing community Focus on public datasets (government, scientific) see linkeddata.org
12
The even larger picture: entire datasets connected
13
Linked Data Advantages: Disadvantages: Tools
No change to the publishing of the HTML documents Data can be published by third party (e.g. Dbpedia) Disadvantages: Web servers need to be configured to properly handle URIs that identify concepts instead of documents Search engines need to be extended to crawl linked data Data is not always linked to documents Tools Linked Data browsers (Tabulator, Marbles etc.) RDB-to-RDF mappers (D2RQ, Triplify)
14
Metadata in HTML Microformats, RDFa, Microdata Advantages: Tools:
Data and document are always in sync Browser plug-in friendly Search engine friendly Copy-paste friendly Tools: XML editors (e.g. Oxygen) RDFa Distiller RDFa bookmarklet, Ubiquity RDFa plugin Optimus microformat parser Examples: many, including SlideShare, YouTube, LinkedIn, Digg, Myspace, Facebook…
15
Microformats (μf) Agreements on the way to encode certain kinds of data in HTML Reuse of semantic-bearing HTML elements Based on existing standards Minimality: designed to solve particular problems Microformats exist for a limited set of objects hCard, hResume, hProduct, hRecipe Varying degrees of support and stability hCard and rel-tag are widely supported Community centered around microformats.org Specifications and discussions are hosted there Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. Instead of throwing away what works today, microformats intend to solve simpler problems first by adapting to current behaviors and usage patterns
16
Example: the hCard microformat
<div class="vcard"> <a class=" fn" Friday</a> <div class="tel"> </div> <div class="title">Area Administrator, Assistant</div> </div> <cite class="vcard"> <a class="fn url" rel="friend colleague met" href=" Meyer</a> </cite> wrote a post (<cite> <a href=" Tax Relief</a></cite>) about an unintentionally humorous letter he received from the <span class="vcard"> <a class="fn org url" href=" Internal Revenue Service</a> </span>.
17
Microformats: limitations
No shared syntax Each microformat has a separate syntax tailored to the vocabulary No formal schemas Limited reuse, extensibility of schemas Unclear which combinations are allowed No datatypes No namespaces, unique identifiers (URIs) no interlinking mapping between instances is required
18
RDFa W3C standard for embedding RDF data in HTML documents
A set of new HTML attributes Despite the extension of HTML, RDFa does not require XHTML A specification of how to extract the data from these attributes RDFa can be used to embed data in HTML headers or to annotate parts of the body of HTML documents RDFa is just a syntax, you have to choose a vocabulary separately
19
Differences in usage Microformats are the first choice for most publishers because they are simple If you find none that perfectly fits your needs then you need RDFa Microformats have a fixed schema: you can not add your own attributes Example: a social networking site with user profiles VCard is a good candidate, but for example it doesn’t have a way to express the user’s social connections You either live without this, or go with RDFa
20
Example: Facebook’s Open Graph Protocol
RDF vocabulary to be used in conjunction with RDFa Simplify the work of developers by restricting the freedom in RDFa Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment Only HTML <head> accepted Facebook as consumer Facebook indexes OGP data whenever someone ‘likes’ a page with OGP data Social recommendation (‘like’ button) provides publishers with a way to promote their content on Facebook Shows up in profiles and news feed, the user is subscribing to a channel of future feeds from the web page they liked Facebook Graph API allows 3rd party developers to access the data
21
Example: Facebook’s Open Graph Protocol
<html xmlns:og=" <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content=" /> <meta property="og:image" content=" /> … </head> ... </html>
22
Microdata HTML5 is currently under standardization at the W3C
Introduces Microdata Similar to microformats Some predefined vocabularies with central registration Some of the flexibility of RDFa Introduce new terms using reverse domain names or full URIs Semantic HTML elements such as <time>, <video>, <article>…
23
Microdata example <div itemscope itemid=“ <p>My name is <span itemprop="name">Neil</span>.</p> <p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime=" "> May 10th 2009 </time>. <img itemprop="image" src=”me.png" alt=”me”> </p> </div
24
The state of metadata in HTML
5-10% of webpages contain some explicit metadata Depending on how you count… Too many competing approaches Too many formats: microformats vs RDFa vs Microdata Too many schemas: publishers may need to use multiple different vocabularies or microformats to satisfy everyone
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.