XML, RDF and Advanced Search (Semantic Web – Web3.0)

XML, RDF and Advanced Search (Semantic Web – Web3.0)
Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen

What we have covered What is IR Evaluation
Tokenization and properties of text Web crawling Query models Vector methods Measures of similarity Indexing Inverted files Basics of internet and web Spam and SEO Search engine design Google and Link Analysis This week: metadata, XML, RDF; advanced search, Semantic Web

The importance of data and their rules
Tim Berners-Lee inventor of the world wide web Founder of the W3C Presentation at Ted

“Metadata is data about data”
Metadata and Markup Languages “Metadata is data about data” Why is metadata important? Makes data easier to search It’s the foundation of the semantic web WEB3.0 Metadata often is written in XML

Metadata is semi-structured data conforming to commonly agreed upon models, providing operational interoperability in a heterogeneous environment

What is metadata? Some simple definitions
‘Structured data about data’. Dublin Core Metadata Initiative FAQ, 2005 Machine-understandable information about Web resources or other things. Tim Berners-Lee, W3C, 1997 ‘Structured data about data’. – both too narrow and too broad. Narrow because metadata about things other than data; and too broad because it allows something like an image of a document to be considered metadata. "Understandable" -> "Processable" Note that

"Web resources or other things"
Metadata might be "about"… anything! HTML documents digital images databases books museum objects archival records metadata records Web sites collections services physical places people organizations “works” formats concepts events

What might metadata "say"?
What is this called? What is this about? Who made this? When was this made? Where do I get (a copy of) this? When does this expire? What format does this use? Who is this intended for? What does this cost? Can I copy this? Can I modify this? What are the component parts of this? What else refers to this? What did "users" think of this? (etc!)

What operations/functions?
resource disclosure & discovery resource retrieval, use resource management, including preservation verification of authenticity intellectual property rights management commerce content-rating authentication and authorization personalization and localization of services (etc!)

What operations/functions?
Different functions for different metadata Metadata (and metadata standards) sometimes classified according to function Descriptive: primarily for discovery, retrieval Administrative: primarily for management Structural: relationships between component parts of resources Contextual: relationships between resources No “one size fits all solution”! Descriptive metadata may vary with resource type

Metadata of a report? What metadata would you associate with a report or memo? Descriptive metadata may vary with resource type

Types of Metadata Descriptive Structural Administrative
Discovery / description of objects Title, author, abstract, etc. Structural Storage & presentation of objects 1 pdf file, 1 ppt file, 1 LaTeX file, etc. Administrative Managing and preservation of objects Access control lists, terms and conditions, format descriptions, “meta-metadata” LOC - Library of Congress

Which View is Correct? figure 1 from:

Approaches to Metadata
from Ng, Park and Burnett, 1997 (also JASIS, 50(13)) library science: bibliographic control “organizing the physical containers of information, by means of bibliographical description, subject analysis, and classification notation construction, so that the container can be efficiently described, identified, located and retrieved” computer and information science: data management “not only to store, access and utilize data effectively, but also to provide data security, data sharing, and data integrity” Domains/areas define their own

Metadata Formats and Implementation
Use markup languages Interoperable Extensible Robust Permits advance search features When online, the beginning of a semantic web!

What is a markup language?
Textual (i.e. person readable) language where significant elements are indicated by markers <TITLE>XML</TITLE> Examples are RTF, HTML, XML, TEX etc. Easy to process and can be manipulated by a variety of application programs XML stands for Extensible Markup Language. There are a number of mark up languages which are in general use. All Web users will be aware of HTML (HyperText Markup Language). A markup language is in essence a language which consists of plain text and tags. The tags give meaning to the text. For example in HTML the tag <h1> indicates that the text enclosed between <h1> and </h1> is a top level header. Text is easy to process and is also to a certain extend human-readable. Unlike proprietary binary formats it may be processed by more than one application.

Standard Generalized Markup Language (SGML)
Based on GML (generalized markup language), developed by IBM in the 1960s An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document Can define any document format of any complexity Enables, extensibility, structure and validation Too many optional features for the Web Gave birth to the extensible markup language (XML), W3C recommendation in 1998

The Purpose of SGML SGML is designed to make your information last longer than the systems that created it. Such longevity also implies immunity to short-term changes -- such as a change from one application program to another -- so SGML is also inherently designed for re-purposing and portability.

What is SGML? SGML (and it's derivatives, HTML and XML) are ASCII character based representations of electronic data Remember, it's all bits--meaning is derived from how they are organized… Think of SGML docs as strings that must be parsed--A web browser parses an HTML doc and uses the markup codes to display the data contained Since it's all ASCII, these docs can also be handled by non parsing tools (such as vi, emacs, perl, etc.)

SGMLXMLHTML SGML is the “mother tongue” – but is overkill for most common applications. XML is an abbreviated version of SGML easier to define own document types easier for programmers to write programs to handle documents (and data) omits all the options (and most of more complex and less-used parts) of SGML) HTML is just one of many SGML or XML “applications” – most frequently used on the Web

SGML Components SGML documents have three parts:
Declaration: specifies which characters and delimiters may appear in the application DTD (document type definition) / style sheet: defines the syntax of markup constructs Document instance: actual text (with the tag) of the documents More info could be found:

World Wide Web (W3C) Consortium

What is XML? XML – eXtensible Markup Language
designed to improve the functionality of the Web by providing more flexible and adaptable information and identification “extensible” because not a fixed format like HTML a language for describing other languages (a meta-language) design your own customised markup language

The HTML World <body>
<h1> XML and Information Retrieval: A SIGIR 2000 Workshop </h1> <p> The workshop was held on 28 July The editors of the workshop were David Carmel, Yoelle Maarek, and Aya Soffer </p> <h2> XQL and Proximal Nodes </h2> <p> The paper was authored by Ricardo Baeza-Yates and Gonzalo Navarro. The abstract of this paper is given below. </p> <p> We consider the recently proposed language … </p> <p> The paper references the following papers: <a href=“ … </a> … </p>

The XML World <workshop date=”28 July 2000”>
<title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns:xlink=” … </cite> </paper>

XML XML is written in SGML – the Standardized General Markup Language, an international standard (ISO 8879) XML = very simple dialect of SGML goal = enable generic SGML to be served, received and processed on the Web in ways not possible with HTML

Why use XML? XML is not just for Web pages Data management:
store any kind of structured document enclose/encapsulate information in order to pass it between different computing systems that are otherwise unable to communicate

Key feature of XML An application is free to use XML tagged data in many different ways, e.g. produce an image generate a formatted text listing display the XML document’s markup in pretty colors restructure the data into a format for storing in a database, transmission over a network, input to another program.

XML Software? many programs are “XML ready” already today.
xml.coverpages.org covers news of new additions to XML Find Penn State pages with XML

How do I run or execute an XML file?
You can’t and you don’t ! XML is not a programming language XML is a markup specification language XML files are just data (unicode) (waiting for a program to do something with them) XML files can be viewed with an XML editor or XML-compatible browser

Things to Remember XML does not replace HTML – it provides an alternative which allows you to define your own set of markup elements to a published standard: <?xml version="1.0" standalone="yes"?> <conversation> <greeting>Hello, world!</greeting> <response>Stop the planet, I want to get off!</response> </conversation>

Things to Remember All parts of an XML document are case sEnSiTiVe
Element type names are case sensitive, so <BODY> …</body> is out. Attribute names are case sensitive … <PIC width=“7cm”/> and <PIC WIDTH=“6cm”/> describe different attributes, not just different values for the attribute “PIC width”.

What is XQuery? XQuery is the language for querying XML data
The best way to explain XQuery is to say that XQuery is to XML what SQL is to database tables. XQuery uses XPath expressions to extract XML data. XPath is a language for finding information in an XML document. XPath is used to navigate through elements and attributes in an XML document. XQuery is defined by the W3C. XQuery is supported by all the major database engines (IBM, Oracle, Microsoft, etc.) XQuery 1.0 W3C Recommendation

Motivation for XML Search
It is becoming increasingly popular to publish data on the Web in the form of XML documents. xml on the web? Current search engines, which are an indispensable tool for finding HTML documents, have two main drawbacks when it comes to searching for XML documents. It is not possible to pose queries that explicitly refer to XML tags. Search engines return references (i.e. links) to documents and not specific fragments thereof. This is problematic, since large XML documents may contain thousands of elements storing many pieces of information that are not necessarily related to each other.

Problems with XQuery A query language for XML, such as XQuery, can be used to extract data from XML documents. However, such a query language is not an alternative to an XML search engine for several reasons. The syntax of XQuery is more complicated than the syntax of a standart search query. Hence, it is not appropriate for a naive user. Extensive knowledge of the document structure is required in order to correctly formulate a query. Thus, queries must be formulated on a per document basis. XQuery lacks any mechanism for ranking answers. Solution - XML Search engine

XML Search Tool Design Features?
A simple syntax that can be used by naive users Search results should include XML fragments and not necessarily full documents The XML fragments in an answer, should be semantically related For example, a paper and an author should be in an answer only if the paper was written by this author Search results should be ranked Search results should be returned in “reasonable” time

XML Search Engines Summary of XML engines
Open source ones starting to emerge Or just use web search engine with filetype:xml Try Google Many for commercial use and some in design Active research area Web XML is a step in the direction of the semantic web!

XML for Search Engines - Sitemaps
The Sitemaps protocol allows a website to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. includes additional information about each URL when it was last updated, how often it changes, and how important it is in relation to other URLs in the site allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol and complement robots.txt, a URL exclusion protocol. Sitemaps are particularly beneficial on websites where: some areas of the website are not available through the browsable interface rich Ajax, Silverlight, or Flash content that is not normally processed by search engines. Site is very large or have a huge amount of pages that are isolated or not well linked together Website has few external links

Open Source XML Search Engine

What is Web 2.0 ? Web 2.0 describes World Wide Web sites that emphasize user-generated content, usability, and interoperability. Term coined by Tim O’Reilly and Media Live International as part of brainstorming session about the future of the web in 2005 Also may be called the Live Web or Living Web Refers to more interactive technologies that engage, facilitate and empower users – us! Companies utilizing interactive technologies are the hot investments Companies are starting to embrace these technologies for business value Tim’s Def (Video); Schmidt’s (Video) The Machine (Video)

Critics of Web 2.0 "Web 2.0" does not represent a new version of the World Wide Web at all, but merely continues to use so-called "Web 1.0" technologies and concepts. Techniques such as Ajax do not replace underlying protocols like HTTP, but add an additional layer of abstraction on top of them. Many of the ideas of Web 2.0 were already featured in implementations on networked systems well before the term "Web 2.0" emerged. Amazon.com, for instance, has allowed users to write reviews and consumer guides since its launch in 1995, in a form of self-publishing. Amazon also opened its API to outside developers in 2002. Previous developments also came from research in computer-supported collaborative learning and computer supported cooperative work (CSCW) and from established products like Lotus Notes and Lotus Domino, all phenomena that preceded Web 2.0.

Web 1.0 vs 2.0 (Some Examples) - 2005
Web 2.0 DoubleClick --> Google AdSense Ofoto Flickr Akamai BitTorrent mp3.com Napster Britannica Online Wikipedia personal websites blogging domain name speculation search engine optimization page views cost per click screen scraping web services publishing participation content management systems wikis directories (taxonomy) tagging ("folksonomy") stickiness syndication Source: “What is web 2.0: Design Patterns and Business Models for the next Generation of Software”, 9/30/2005

Web 3.0 This will be the INTELLIGENT Web!
The Semantic Web!

How will we get the semantic web?
Now... that should clear up a few things around here NL annotations (possibly with rendering annotation) already associated with images (only way google can find them) Augment NL with semantic annotation.

Web 2.0 vs Web 3.0 The Web and Web 2.0 were designed with us in mind.
(Human Understanding) The Web 3.0 will anticipate our needs! Whether it is State Department information when traveling, foreign embassy contacts, airline schedules, hotel reservations, area taxis, or famous restaurants: the information. The new Web will be designed for computers. (Machine Understanding) The Web 3.0 will be designed to anticipate the meaning of the search.

How do we get to the semantic web, really
The next stage for the Web will be making data accessible to artificial intelligence agents. The Web 3.0 uses new languages beyond HTML or XML. That is the case of RDF or Resource Description Framework. The Web 3.0 will need data delivered in computer-readable form (RDF).

General idea of Semantic Web
Make current web more machine accessible and intelligent! (currently all the intelligence is in the user) Motivating use-cases Search engines concepts, not keywords semantic narrowing/widening of queries Shopbots semantic interchange, not screenscraping E-commerce Negotiation, catalogue mapping, personalisation Web Services Need semantic characterisations to find them Navigation by semantic proximity, not hardwired links .....

Example Try these queries with Google: Now, try these with Google:
Distance between Paris and Madrid Distance between Paris and New York (The) Largest city of France (The) Largest city of Spain Now, try these with Google: Distance between largest city of France and largest city of Spain Distance between“largest city of France”and “largest city of Spain” And worst, Distance between“the largest city of France and that of Spain” – What state is south of Texas

Examples What other queries does Google not understand?
START YOUR ENGINES

Web Search Semantics Another approach – let Google do it.
So, what’s wrong with Google? Nothing. The problem is with the World Wide Web: The Web contains unstructured information and Google is a keyword- and phrase-based search engine Initiative to make the contents on the Web structured information/represented knowledge the Semantic Web Another approach – let Google do it.

Google Knowledge Graph
First step towards providing users with answers, rather than a collection of search results Attempts to turn ambiguous words into actual concepts understandable by search engine Distinguishes between homophones (eg –two, to) The knowledge graph only appears when you are searching for data already in its continuously growing database

What is it? “A huge knowledge graph of interconnected entities and their attributes”. Amit Singhal, Senior Vice President at Google “A knowledge based used by Google to enhance its search engine’s results with semantic-search information gathered from a wide variety of sources”

Sources Based on information derived from many sources including Freebase, CIA World Factbook, Wikipedia Contains 570 million objects and more than 18 billion facts about and relationships between these different objects

Enhancements - ambiguity
GKG enhances Google Search in three main ways: Find the right thing deals with the ambiguity of the language

Enhancements - summaries
GKG enhances Google Search in three main ways: Summaries summarize relevant content around that topic, including key facts about the entity

Deeper information GKG enhances Google Search in three main ways:
Deeper and broader information reveal new facts anticipate what the next questions and provide the information beforehand (based on what other users asked before)

Search for a person, place, or thing
How it is used? Search for a person, place, or thing Facts about entities are displayed in a knowledge box on the right side

How it is used? Explore your search

Data sources CIA World Factbook Freebase Wikipedia and many others …

GKG and CIA World Factbook
CIA World Factbook is a reference resource produced by the Central Intelligence Agency of the United States with almanac-style information about the countries of the world. GKG integrates information about geography, government, economy, etc. from CIA World Factbook

GKG and Freebase Freebase is large collaborative knowledge base, developed by Metaweb and acquired by Google in 2010. GKG uses UIDs directly from the Freebase; detective work of Andreas Thalhammer showing how to get from GKG UIDs to Freebased UIDs using base64 and gzip Check the “Knowledge Graph links to Freebase” thread on w3c mailinglist

For most search results first sentences come from Wikipedia
GKG and Wikipedia For most search results first sentences come from Wikipedia

Other sources GKG also considers the information Google retrieves from the volume of queries done by the users and the links those users have clicked on the results presented for those queries

GKG and other Google products
GKG is integrated with other Google products e.g. Google+

Picture from http://www.theatlantic.com/doc/194507/bush
Web of Data Web of Data Semantic Web Picture from [4] ? Semantic Annotations Web Hypermedia Hypertext “As We May Think”, 1945 Picture from

“Things” Characteristics: Web of Data Web of Data
Links between arbitrary things (e.g., persons, locations, events, buildings) Structure of data on Web pages is made explicit Things described on Web pages are named and get URIs Links between things are made explicit and are typed Web of Data Typed Links “Things”

GKG and the Web of Data A closed implementation of Web of Data principles is not about documents, but objects such as people, places and things objects are interlinked in the GKG objects have structured information which is obtained from the web The Google Knowledge Graph is the basis for transforming Google’ core search product from an information engine to a knowledge engine (entity search engine)

References

Semantic Search Engines
Do they exist? Some claim that they do Try these out (some no longer around): Lexxe iGlue Hakia Exalead Kosmix Swoogle WolframAlpha Bing

Expressed using the W3C stack

What it’s like to be a machine on the Web

Required are: Explicit meta-data Shared domain descriptions
Machine-processable content Machine-support for interoperability

machine accessible meaning (What it’s like to be a machine)
name CV education work private

XML  machine accessible meaning
< CV > < name > <education> <work> <private> < > < > < > < > < > CV name education work private

So why not just use XML? <country name=”Netherlands”>
No agreement on: structure is country a: object? class? attribute? relation? something else? what does nesting mean? vocabulary is country the same as nation? <country name=”Netherlands”> <capital name=”Amsterdam”> <areacode>020</areacode> </capital> </country> <nation> <name>Netherlands</name> <capital>Amsterdam</capital> <capital_areacode> 020 </capital_areacode> </nation> Are the above XML documents the same? Do they convey the same information? Is that information machine-accessible?

“2nd aim of Semantic Web”: Data integration
Unstructured and sensors, programs, services semi-structured sources (document collections, message traffic, web pages, ...) Structured data without an explicit data schema (non-local databases, data tables, charts and reports, ...) Non-Text collections (image, video, sound, ...) Streams of data from Must specify the structure of data resources..

2nd aim of Semantic Web: Data integration
... so a processor can tell how the "attributes" and "values" are related What is required vs. optional? How many values for a particular attribute? What attributes are keys for other attributes? Which attributes are necessarily related to other attributes and in what way?? How do the attributes (and values) in one data source map to attributes and values describing another source?

Stack of languages XML: XML Schema: RDF: RDF Schema (RDFS): OWL:
Surface syntax, no semantics XML Schema: Describes structure of XML documents RDF: Datamodel for “relations” between “things” RDF Schema (RDFS): RDF Vocabulary Definition Language OWL: A more expressive Vocabulary Definition Language

Semantic web languages today
Today there are three semantic web languages RDF – Resource Description Framework and variations DAML+OIL – Darpa Agent Markup Language (deprecated) OWL – Ontology Web Language OWL lit OWL DL OWL Full

RDF is the first Semantic Web language
Graph XML Encoding RDF Data Model <rdf:RDF ……..> <….> </rdf:RDF> Good For Human Viewing Good for Machine Processing Triples stmt(docInst, rdf_type, Document) stmt(personInst, rdf_type, Person) stmt(inroomInst, rdf_type, InRoom) stmt(personInst, holding, docInst) stmt(inroomInst, person, personInst) RDF is a simple language for building graph based representations Good For Reasoning

The RDF Data Model An RDF document is an unordered collection of statements, each with a subject, predicate and object (aka triples) A triple can be thought of as a labelled arc in a graph Statements describe properties of web resources A resource is any object that can be pointed to by a URI: a document, a picture, a paragraph on the Web, … E.g., a book in the library, a real person (?) isbn:// … Properties themselves are also resources (URIs)

RDF without a Schema Object ->Attribute-> Value triples
objects are web-resources Value is again an Object: triples can be linked data-model = graph pers05 ISBN... Author-of pers05 ISBN... Author-of MIT Publ-by

What does RDF Schema add?
Defines vocabulary for RDF Organizes this vocabulary in a typed hierarchy Class, subClassOf, type Property, subPropertyOf domain, range Person subClassOf subClassOf domain range Author communicatesTo Reader type type communicatesTo Frank Lynda

Version 1: "Semantic Web as Web of Data" (TBL)
Which Semantic Web? Version 1: "Semantic Web as Web of Data" (TBL) recipe: expose databases on the web, use XML, RDF, integrate metadata from: expressing DB schema semantics in machine interpretable ways enable integration and unexpected re-use

Which Semantic Web? Version 2: “Enrichment of the current Web”
recipe: Annotate, classify, index metadata from: automatically producing markup: named-entity recognition, concept extraction, tagging, etc. enable personalization, search, browse,..

Version 2: “Enrichment of the current Web”
Which Semantic Web? Version 1: “Semantic Web as Web of Data” Version 2: “Enrichment of the current Web” Different use-cases Different techniques Different users

Four popular fallacies about the Semantic Web
Semantic Web research Four popular fallacies about the Semantic Web

First: clear up some popular misunderstandings
False statement No : “Semantic Web people try to enforce meaning from the top” They only “enforce” a language. They don’t enforce what is said in that language Compare: HTML “enforced” from the top, But content is entirely free.

False statement No : “The Semantic Web people will require everybody to subscribe to a single predefined "meaning" for the terms we use.” Of course, meaning is fluid, contextual, etc. Lot’s of work on (semi)-automatically bridging between different vocabularies.

False statement No : “The Semantic Web will require users to understand the complicated details of formalized knowledge representation.” All of this is “under the hood”.

False statement No : “The Semantic Web people will require us to manually markup all the existing web-pages.” Lots of work on automatically producing semantic markup: named-entity recognition, concept extraction, etc.

The current state of Semantic Web
Semantic Web research The current state of Semantic Web

Advanced Search Metadata and semantic web will make advanced search much easier Growth of web metadata. Folksonomies! Tools that automatically generate metadata Crowdsourcing TREC

Search for Web 3.0 Natural language queries Search agent (avatar) understands and anticipates your needs Personal life search with avatar

The Evolving Web DATA/PROGRAMS DOCUMENTS 2010 2000 1990 Web of
Knowledge HyperText Markup Language HyperText Transfer Protocol Resource Description Framework eXtensible Markup Language Self-Describing Documents Foundation of the Current Web Proof, Logic and Ontology Languages Shared terms/terminology Machine-Machine communication 1990 2000 2010 Berners-Lee, Hendler; Nature, 2001 DATA/PROGRAMS DOCUMENTS

Semantic Web ? (Jim Hendler - internal talk, Microsoft Labs, July 2008)

DBpedia – starting the semantic web
CiteSeer Extracted structured content from information created as part of the Wikipedia project

Semantic Web Companies
List of companies Wikipedia

How how much rdf is out there on the web?

25th Anniversary of the WWW – 12 March 2014
A Bill of Rights for the World Wide Web

Web 4.0 :-?)

The next 5000 days of the web Kevin Kelly Founder of WIRED magazine
Video

Web 4.0 Evolution

Web 4.0 Machines talk back!

Search for Web 4.0 We get real help when we search!
Terminator: the Sarah Connor Chronicles Cameron’s on our side!

Web 2.0 vs Web 3.0 Web 3.0 applications
Everything on the web will be different – same impact as natural language processing. Web 4.0 will be the intelligent web with agents doing a lot of the work.

What we covered Metadata The web of data Web 2.0 Web 3.0
xml The web of data xml, rdf, others Web 2.0 The social web Web 3.0 The semantic web -Google knowledge graph Future of the web and web search

XML, RDF and Advanced Search (Semantic Web – Web3.0)

Similar presentations

Presentation on theme: "XML, RDF and Advanced Search (Semantic Web – Web3.0)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XML, RDF and Advanced Search (Semantic Web – Web3.0)

Similar presentations

Presentation on theme: "XML, RDF and Advanced Search (Semantic Web – Web3.0)"— Presentation transcript:

Similar presentations

About project

Feedback