Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Semantic Data Federation Engine: Design, Implementation, and Applications in Educational Information Management Mathew Cherian 02/04/2011.

Similar presentations


Presentation on theme: "A Semantic Data Federation Engine: Design, Implementation, and Applications in Educational Information Management Mathew Cherian 02/04/2011."— Presentation transcript:

1 A Semantic Data Federation Engine: Design, Implementation, and Applications in Educational Information Management Mathew Cherian 02/04/2011

2 A way to easily integrate information from distributed, heterogeneous, Semantic Web data sources Vision

3 Why? ● Querying data sources independently and then mashing up results is inefficient. ● It is unreasonable to expect users to know various details of integration (eg: ontologies) ● If efficient integration is feasible, more entities would convert to RDF sources (solving one half of the chicken/egg problem) ● Kumbaya – More RDF, More Linked Data, No more need to explain what the Semantic Web is

4 Solution

5 Contributions ● Designed and implemented an end-to-end SPARQL Federation Engine ● Developed a query optimizer that facilitates efficient query execution ● Initiated the Use and Integration of SWIG into the SWObjects project

6 Architecture

7 ● Validator – verifies that: – Query meets SPARQL 1.0 specs – Prefixes used were declared – Predicates are bound ● Mapper – Rewrites query into several subqueries directed to various SERVICES – Rewrites relies on mapping rules generated out of band from source descriptions ● Optimizer – Reorders SERVICE clauses based on an algorithm that makes use of sub-queries' structures and source descriptions.

8 Architecture ● Orchestrator – Sends sub-queries, in the optimized order, to endpoints ● If result set contains subject(s)/object(s) for subsequent queries, those are bound before sub-queries are sent out. – Combines all result sets and return result to user ● Proof Generator – Using endpoints' policy descriptions and user provided credentials, attempts to generate a proof to allow querying of secure data sources

9 Architecture ● Source Descriptions: – Contains information on endpoints, such as total number of triples, list of predicates, etc. – Used by optimizer to reorder subqueries – http://dig.csail.mit.edu/2009/AFOSR /service_description.n3 :EducatorsDB a :Endpoint; void:sparqlEndp oint ; dc:title "MA Educator Database"; policy ; readLatency "4.7"^^xsd:floa t; totalTriples "27931373"^^xsd :integer; numPredicates "3"^^xsd:intege r; predStat [ predicate teacher:has_nam e; numTriples "77546"^^xsd:in teger; numObjects "77536"^^xsd:in teger ]; predStat [ predicate teacher:math_ce rt_score; numTriples "234636"^^xsd:i nteger; numObjects "100"^^xsd:inte ger ]; predStat [ predicate teacher:cert_st atus; numTriples "156450"^^xsd:i nteger; numObjects "5"^^xsd:intege r ].

10 Architecture ● Map Generator ● Uses source descriptions to generate mapping rules for federation engine ● Maps are generated out-of- band LABEL 'http://www.mass.gov/dese#id' CONSTRUCT {?rs ?ro} {SERVICE {?rs ro}} LABEL 'http://www.mass.gov/dese#school' CONSTRUCT {?rs ?ro} {SERVICE {?rs ?ro}}

11 Optimization Algorithm ● Two Steps ● Step 1: Estimate result sizes of each sub-query using statistical information in source descriptions ● Step 2: Order sub-queries based on each sub- query's ability to reduce the result set size of other subqueries

12 Implementation ● Python, C/C++ ● Tools used – SWIG – Fyzz – RDFLib

13 Implementation

14 Use Case – Massachusetts Department of Elementary and Secondary Education (DESE) ● Responsible for the education of approximately 550,000 students in 391 school districts ● Tasked with providing useful, timely, information to stakeholders ● Required to report student performance on MCAS, down to specific subgroups, and educator qualifications to federal government every year to meet NCLB regulations

15 Status Quo ● Districts collect student demographic data and sends CSV files to DESE at the end of the year. ● DESE aggregates information from districts, MCAS results, and teacher data using a Cognos system – painfully slow! ● Huge delays (upto two years) between data structure changes and data updates ● The Growth Model

16 SPARQL Federation DESE Federator MCASTeacher s Cambridg e Boston …. Springfield

17 Secure SPARQL Federation PREFIX dese: PREFIX teacher: SELECT ?student ?mscore ?name ?cert_score ?status WHERE { ?student dese:id ?sasid ; dese:3rd_math_score ?mscore ; dese:has_math_teacher ?teacher ?teacher teacher:cert_status ?status ; teacher:math_score ?score. OPTIONAL { ?teacher teacher:has_name ?name } PREFIX dese: PREFIX teacher: SELECT ?student ?mscore ?name ?cert_score ?status WHERE { ?student dese:id ?sasid ; dese:3rd_math_score ?mscore ; dese:has_math_teacher ?teacher ?teacher teacher:cert_status ?status ; teacher:math_score ?score. OPTIONAL { ?teacher teacher:has_name ?name }

18 Secure SPARQL Federation SELECT ?student ?mscore ?name ?cert_score ?status WHERE { SERVICE { ?student ?sasid. ?student ?mscore. ?student ?teacher. } SERVICE { ?teacher ?status. ?teacher ?score. OPTIONAL { SERVICE { ?teacher ?name. }

19 Benefits ● Storing data in RDF would make data collection and storage more efficient ● No need to transfer large amounts of data between various databases ● Ideal for the state's plans to link education data to wage and other post-school indicators

20 Challenges ● SPARQL optimization is in its infancy ● Until better ontology mappings are available, districts would have to use DESE specified ontology ● Easy access to information means, more scrutiny for public servants – big disincentive to switch to RDF, SPARQL, etc. ● Absence of legal frameworks that facilitate inter- agency data sharing ● Fear of Big Brother

21 Demo

22 Evaluation ● Test Optimizer and End-to-End performance ● Performance measured by time taken for various tasks: – Validation Time(t1), Mapping Time(t2), Optimization Time(t3), Transformation Time(t2+t3), Execution Time (t4), End-to-End Time (t1 + t2 + t3 + t4) ● Federation Engine hosted on 64-bit VM running Ubuntu 10.04 server (4GB, 512 MB RAM)

23 Optimizer Tests ● Generated random source descriptions for three endpoints. ● No query execution against endpoints ● Tests

24 Evaluation Number of Triples/predicates vs. Number of Predicates (with Constant Total Number of Triples)

25 End to End Tests ● Four endpoints, each containing subsets of DBPedia – Person Data (1.7M), Article Categories (12M), Category Labels (632K), and Infoboxes (13.8M). ● Each hosted on a separate VM (64 bit, 256 MB RAM, Ubuntu 10.04 server) using 4-store ● Manually created source descriptions ● Ran 4 queries and compared performance of Federation Engine with and without optimizer.

26 End to End Tests - Q1 #German Musicians who were born in Berlin PREFIX dbpedia: PREFIX dbp_resource: PREFIX dbp_category: PREFIX dc_terms: PREFIX foaf: PREFIX rdfs: SELECT ?person ?name ?birthday WHERE { ?person foaf:name ?name. ?person dbpedia:birthDate ?birthday. ?person dbpedia:birthPlace dbp_resource:Berlin ?person dc_terms:subject dbp_category:German_musicians. OPTIONAL { dbp_category:German_musicians rdfs:label ?label. }

27 End to End Tests - Q2 #People who played professional baseball and basketball PREFIX dbpedia: PREFIX dbpedia_category: PREFIX dc: PREFIX dc_terms: PREFIX foaf: PREFIX rdf: PREFIX rdfs: SELECT ?name WHERE { ?p dc_terms:subject dbpedia_category:Minor_league_baseball_players. ?p foaf:name ?name. ?p dc_terms:subject dbpedia_category:American_basketball_players. }

28 End to End Tests - Q3 #Paris born Movie Stars PREFIX dbpedia: PREFIX dbp_resource: PREFIX dc: PREFIX foaf: PREFIX rdf: PREFIX rdfs: SELECT ?name ?m WHERE { ?m dbpedia:starring ?p. ?p dbpedia:birthPlace dbp_resource:Paris. ?p foaf:name ?name. FILTER (?city = dbp_resource:Paris) }

29 End to End Tests - Q4 # presidents of the united states and their vocations PREFIX dbpedia: PREFIX dc: PREFIX dc_terms: PREFIX foaf: PREFIX rdf: PREFIX rdfs: SELECT ?name ?job WHERE { ?p foaf:name ?name. ?p dbpedia:occupation ?job. ?p dc_terms:subject. }

30 Evaluation Transformation Times

31 Evaluation Total (End-to-End) Times

32 Limitations ● Requires users to have knowledge of ontologies and structure of endpoints ● Optimizer only uses a simple cost model ● Requiring bound predicates can limit the extent of queries ● Assumes one ontology across many endpoints

33 Related Work

34 Future Work ● Incorporate Proof Generator/Checker to facilitate secure SPARQL federation ● Automate the generation of source description files ● Standards for descriptions of/interfaces to SPARQL endpoints ● Mapper as subcomponent of Optimizer ● Better SPARQL optimizations ● NLP NLP...

35 Contributions ● Designed and implemented an end-to-end SPARQL Federation Engine ● Developed a query optimizer that facilitates efficient query execution ● Initiated the Use and Integration of SWIG into the SWObjects project


Download ppt "A Semantic Data Federation Engine: Design, Implementation, and Applications in Educational Information Management Mathew Cherian 02/04/2011."

Similar presentations


Ads by Google