Finding knowledge, data and answers on the Semantic Web

Finding knowledge, data and answers on the Semantic Web
Tim Finin University of Maryland, Baltimore County Joint work with Li Ding, Anupam Joshi, Cynthia Parr, Joel Sachs, Andriy Parafiynyk and Lushan Han  This work was partially supported by DARPA contract F , NSF grants CCR and IIS

This talk Motivation Semantic Web background
Swoogle Semantic Web search engine Use cases and applications Social Semantic Web Conclusions

Google has made us smarter
Software agents will need something similar to maximize the use of information on the semantic web.

But what about our agents?
Software agents will need something similar to maximize the use of information on the semantic web. tell register Agents still have a very minimal understanding of text and images.

But what about our agents?
Swoogle Swoogle Swoogle tell register Swoogle Swoogle Swoogle Swoogle Swoogle Swoogle Swoogle Swoogle Swoogle Swoogle Swoogle Software agents will need something similar to maximize the use of information on the semantic web. Swoogle A Google for knowledge on the Semantic Web is needed by software agents and programs

Brief history of the Semantic Web
Tim Berners-Lee’s original 1989 WWW proposal described a web of relationships among named objects unifying many info. management tasks. Guha’s MCF (~94) XML+MCF=>RDF (~96) Semantic Web coined (~97) RDF+OO=>RDFS (~99) RDFS+KR=>DAML+OIL (00) W3C’s SW activity (01) W3C’s OWL (03) SPARQL (06) Rules, RDFa, ….

Interest is high Interest in industry, government and VCs is high
RDF is in Adobe’s products, Oracle 10g and 11g, Microsoft Vista, and Yahoo’s food portal Several high-visibility startups use RDF Joost (internet TV), Teranode (Bioinformatics), Garlik (personal info monitoring) And, if you want more evidence that interest is high …

$1795 $695 CD Only

What do we mean by “Semantic Web”
“a smarter Google” “NLP” PowerSet explicit semantics topic maps ad hoc approaches Microformats Tags Folksonomies XML KR based other structured Freebase Google Base RDF+OWL

RDF is the first SW language
Graph XML Encoding RDF Data Model <rdf:RDF ……..> <….> </rdf:RDF> Good For Human Viewing Good for Machine Processing Triples stmt(docInst, rdf_type, Document) stmt(personInst, rdf_type, Person) stmt(inroomInst, rdf_type, InRoom) stmt(personInst, holding, docInst) stmt(inroomInst, person, personInst) Good For Reasoning RDF is a simple language for building graph based representations Grounded in web standards With terms to support ontologies, description logic, rules and much of first order logic

IMHO Better NLP will help search engines, it’s a long term, incremental project We need an well-defined and extensible representation system for explicit knowledge It should be backed by open, non-proprietary standards supported by industry, Government and other interested parties The W3C approach is not perfect But “The perfect is the enemy of the good.” “Semantic Web” vs. “semantic web”

Running since summer 2004 2.1M RDF docs, 420M triples, 10K ontologies, 15K namespaces, 1.5M classes, 185K properties, 49M instances, 800 registered users

Swoogle Architecture Analysis Index Discovery Search Services …
IR Indexer Search Services Semantic Web metadata Web Service Server Candidate URLs Bounded Web Crawler Google Crawler SwoogleBot SWD Indexer Ranking document cache SWD classifier human machine html rdf/xml … the Web Information flow Swoogle‘s web interface Legends

A Hybrid Harvesting Framework
true Swoogle Sample Dataset Submissions & pings Inductive learner would Seeds M Seeds H Seeds R Meta crawling Bounded HTML crawling RDF crawling google Google API call crawl crawl the Web

Performance – Site Coverage
SW06MAR - Basic statistics (Mar 31, 2006) 1.3M SWDs from 157K websites 268M triples 61K SWOs including >10K in high quality 1.4M SWTs using 12K namespaces Significance Compare with existing works ( DAML crawler, scutter ) Compare SW06MAR with Google’s estimated SWDs SWDs per website Website

Performance – crawlers’ contribution
High SWD ratio: 42% URLs are confirmed as SWD Consistent growth rate: SWDs per day RDF crawler: best harvesting method HTML crawler: best accuracy Meta crawler: best in detecting websites # of documents

Applications and use cases
Supporting Semantic Web developers Ontology designers, vocabulary discovery, who’s using my ontologies or data?, use analysis, errors, statistics, etc. Searching specialized collections Spire: aggregating observations and data from biologists InferenceWeb: searching over and enhancing proofs SemNews: Text Meaning of news stories Supporting SW tools Triple shop: finding data for SPARQL queries 1 2 3

80 ontologies were found that had these three terms
By default, ontologies are ordered by their ‘popularity’, but they can also be ordered by recency or size. Let’s look at this one

Basic Metadata hasDateDiscovered: hasDatePing: hasPingState: PingModified type: SemanticWebDocument isEmbedded: false hasGrammar: RDFXML hasParseState: ParseSuccess hasDateLastmodified: hasDateCache: hasEncoding: ISO hasLength: 18K hasCntTriple: 311.00 hasOntoRatio: 0.98 hasCntSwt: 94.00 hasCntSwtDef: 72.00 hasCntInstance: 8.00

Who uses this ontology and how do they access it?

rdfs:range was used 41 times to assert a value.
owl:ObjectProperty was instantiated 28 times time:Cal… defined once and used 24 times (e.g., as range)

All of this is available in RDF form for the agents among us.
These are the namespaces this ontology uses. Clicking on one shows all of the documents using the namespace. All of this is available in RDF form for the agents among us.

Here’s what the agent sees
Here’s what the agent sees. Note the swoogle and wob (web of belief) ontologies.

We can also search for terms (classes, properties) like terms for “person”.

10K terms associated with “person”! Ordered by use.
Let’s look at foaf:Person’s metadata

Metadata stored for a term is information about it’s definition – both what and by whom

10K terms associated with “person”! Ordered by use.

How do other terms use foaf:Person
How do other terms use foaf:Person? 100 documents assert that foaf:publication is a property of a foaf:Person

87K documents used foaf:gender with a foaf:Person instance as the subject

3K documents used dc:creator with a foaf:Person instance as the object

Swoogle’s archive saves every version of a SWD it’s seen.

2 An NSF ITR collaborative project with
University of Maryland, Baltimore County University of Maryland, College Park U. Of California, Davis Rocky Mountain Biological Laboratory

An invasive species scenario
Nile Tilapia fish have been found in a California lake. Can this invasive species thrive in this environment? If so, what will be the likely consequences for the ecology? So…we need to understand the effects of introducing this fish into the food web of a typical California lake

Food Webs A food web models the trophic (feeding) relationships between organisms in an ecology Food web simulators are used to explore the consequences of changes in the ecology, such as the introduction or removal of a species A locations food web is usually constructed from studies of the frequencies of the species found there and the known trophic relations among them. Goal: automatically construct a food web for a new location using existing data and knowledge ELVIS: Ecosystem Location Visualization and Information System

East River Valley Trophic Web
The web structure in the image is organized vertically, with node color representing trophic level. Red nodes represent basal species, such as plants and detritus, orange nodes represent intermediate species, and yellow nodes represent top species or primary predators. Links characterize the interaction between two nodes, and the width of the link attenuates down the trophic cascade (i.e. a link is thicker at the predator end and thinner at the prey end).

Species List Constructor
Click a county, get a species list

The problem We have data on what species are known to be in the location and can further restrict and fill in with other ecological models But we don’t know which of these the Nile Tilapia eats of who might eat it. We can reason from taxonomic data (similar species) and known natural history data (size, mass, habitat, etc.) to fill in the gaps.

Predict food web links using database and taxonomic reasoning.
Food Web Constructor Predict food web links using database and taxonomic reasoning. In an new estuary, Nile Tilapia could compete with ostracods (green) to eat algae. Predators (red) and prey (blue) of ostracods may be affected

Examine evidence for predicted links.
Evidence Provider Examine evidence for predicted links.

Status ELVIS (Ecosystem Location Visualization and Information System) as an integrated set of web services for constructing food webs for a given location. Background ontologies SpireEcoConcepts: concepts and properties to represent food webs, and ELVIS related tasks, inputs and outputs ETHAN (Evolutionary Trees and Natural History) Concepts and properties for ‘natural history’ information on species derived from data in the Animal diversity web and other taxonomic sources. 250K classes on plants and animals Under development Connect to visualization software Connect to triple shop to discover more data

3 Supporting SW Tools Semantic Web applications can access Swoogle through a REST-based Web interface or via SQL. Two examples: A system to help scientists construct datasets from RDF documents on the Web Tools to manage Semantic Web data in Blogs and other forms of social media

UMBC Triple Shop http://sparql.cs.umbc.edu/
Online SPARQL RDF query processing with several interesting features Automatically finds SWDs for give queries using Swoogle backend database Datasets, queries and results can be saved, tagged, annotated, shared, searched for, etc. RDF datasets as first class objects Can be stored on our server or downloaded Can be materialized in a database or (soon) as a Jena model

What’s SPARQL? SPARQL is the standard language (& protocol) for querying RDF graphs Think: SQL for RDF PREFIX rdf: < PREFIX foaf: < SELECT ?person ?name ? FROM < WHERE { ?person a foaf:Person . ?person foaf:name ?name . OPTIONAL {?person foaf:mbox ? } . }

The Fractal nature of SW systems
A SPARQL endpoint can make any Web data source look like a RDF graph that can be queried Give a graph as a query, get a graph as a result

Web-scale semantic web data access
agent data access service the Web Index RDF data ask (“person”) Search vocabulary Search URIrefs in SW vocabulary inform (“foaf:Person”) Compose query ask (“?x rdf:type foaf:Person”) Populate RDF database Search URLs in SWD index inform (doc URLs) Fetch docs Query local RDF database

Who knows Anupam Joshi? Show me their names, address and pictures

The UMBC ebiquity site publishes lots of RDF data, including FOAF profiles

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?p2name ?p2mbox ?p2pix FROM ??? WHERE { ?p1 foaf:surname "Joshi" ?p1 foaf:firstName “Anupam" . ?p1 foaf:mbox ?p1mbox . ?p2 foaf:knows ?p3 . ?p3 foaf:mbox ?p1mbox . ?p2 foaf:name ?p2name . ?p2 foaf:mbox ?p2mbox . OPTIONAL { ?p2 foaf:depiction ?p2pix } . } ORDER BY ?p2name No FROM clause!

log in specify dataset Enter query w/o FROM clause!

We want to create a reusable dataset

Find RDF data using terms found in the query
That also satisfy some simple constraints (e.g., for trust)

302 RDF documents were found that might have useful data.

We’ll select them all and add them to the current dataset.

We’ll run the query against this dataset to see if the results are as expected.

The results can be produced in any of several formats

Looks like a useful dataset
Looks like a useful dataset. Let’s save it and also materialize it the TS triple store. An extension will let us ask that it be automatically updated when constituents change

We can also annotate, save and share queries.

Social media sites have become the biggest source of new content on the Web
Blogs, Wikis, Photo sites, forums, etc. Accounting for ~1/3 of new Web content

It’s a global phenomenon
Japanese is now the most common language

Social media sites have embraced new ways of letting users add semantic information
Showing users the potential of semantics

Social Media and the Semantic Web
Many are exploring how Semantic Web technology can work with social media Social media like blogs are typically temporally organized valued for their timely and dynamic information! If static pages form the Web’s long term memory, then the Blogosphere is its stream of consciousness Maybe we can (1) help people publish data in RDF on their blogs and (2) mine social media sites for useful information

The OWL icon links to the data in RDF
A BioBlitz involves going out to an area and recording every organism you see The OWL icon links to the data in RDF

A good Semantic Web opportunity
We want to make it easy for scientists to enter and collect information from social media Professionals, students and amateurs! Two early examples SPOTter – a tool to add Semantic Web data to blogs Splickr – a system to mine Flickr for images of organisms

SPOTter: SPire Observation Tool
We’ve developed some simple components to help people add RDF data to blogs and ping Swoogle to get it indexed. SPOTter is an initial prototype that uses the ETHAN ontology and is being used in some BioBlitz activities with students. We’re working toward a version that uses Twitter so that people can make the blog entries from the cell phones via SMS The SPOTter agent will get the entries (via RSS) and index the data

SPOTter button Once entered, the data is embedded into the blog post and Swoogle is pinged to index it

Prototype SPOTter Search engine
We can draw a bounding box on The map and find observations An RSS feed provided for each query Prototype SPOTter Search engine

Flickr The Flickr “photo sharing” site has millions of photographs
Many of plants and animals Most of them have descriptions, timestamps, tags and even geo-tags Flickr has even introduced “machine tags” that can be mapped into RDF Any Flickr users (humans or bots) can add comments and annotations There’s a good API It could be a good source of ecological information

Splickr is an AJAX-based application using Flickr API for querying Flickr database of publicly available pictures Pictures have tags (e.g. names of animals) and geographical coordinates, therefore we can determine location of invasive species Results can be delivered in forms for people and machines

Results for people and machines

Conclusion The web will contain the world’s knowledge in forms accessible to people and computers We need better ways to discover, index, search and reason over SW knowledge SW search engines address different tasks than html search engines So they require different techniques and APIs Swoogle like systems can help create consensus ontologies and foster best practices Social media provide new challenges and opportunities for the Semantic Web

For more information Annotated in OWL

Finding knowledge, data and answers on the Semantic Web

Similar presentations

Presentation on theme: "Finding knowledge, data and answers on the Semantic Web"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finding knowledge, data and answers on the Semantic Web

Similar presentations

Presentation on theme: "Finding knowledge, data and answers on the Semantic Web"— Presentation transcript:

Similar presentations

About project

Feedback