Download presentation
Presentation is loading. Please wait.
Published byBetty McDaniel Modified over 9 years ago
1
Storing RDF Data in Hadoop And Retrieval Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham
2
Goal To build efficient storage using Hadoop for Peta-bytes of data To build an efficient query mechanism Possible outcomes – Open Source Framework for RDF – Integration with Jena
3
Possible Approaches Store RDF data in HDFS and query through Map- Reduce programming – Our current approach Store RDF data in HDFS and process query outside of Hadoop – Done in BIOMANTA [1] project, no details however Hbase – Currently being worked on by another team in Semantic Web lab
4
Dataset And Queries LUBM [2] – Dataset generator – 14 benchmark queries – Generates data of some imaginary universities – Used for query execution performance comparison by many researches
5
Our Clusters 4 node cluster in Semantic Web lab 10 node cluster in SAIAL lab – 4 GB main memory – Intel Pentium IV 3.0 GHz processor – 640 GB hard drive OpenCirrus HP labs test bed – Sponsor: Andy Seaborne, HP Labs
6
Tasks Completed/In Progress Setup Hadoop cluster Generate, preprocess & insert data Devise algorithm to produce map-reduce code for a SPARQL query Code for 14 queries Cascading output of one job to another job as input without using hard disk
7
Two Storage Approaches 1.Multiple File Approach: Dumping files as generated by LUBM generator, possibly merging some Each Line on file Contains Subject, Predicate and Object 2.Predicate Based Approach: Dividing Files based on Predicate File name will be “Predicate “ name Each line then contains only Subject and Object On-an Average there are about 20 different type of Predicate Common Preprocessing :- Adding Prefixes http://www.University10Department5:.... == U10D5:….
8
D0U0:Graduate20ub:typelehigh:GraduateStudent D0U0:Graduate20ub:memberOf lehigh:University0 D0U0:Graduate20lehigh:GraduateStudent … D0U0:Graduate20lehigh:University0 … Example Of Predicate Based File division: Filename : type Filename : memberOf Filename: type_GraduateStudent D0U0:Graduate20 … Filename: memberOf_University D0U0:Graduate20lehigh:University0 …
9
Sample Query:- PREFIX rdf: PREFIX ub: SELECT ?X WHERE { ?X rdf:type ub:Publication. ?X ub:publicationAuthor D0U0:AssistantProfessor0 } Map Function :- Look from which file (key) the data (value) is coming and filter it according to conditions. For example: If data is from file “type_Publication” output the pair If data is from file “publicationAuthor_*” look for D0U0:AssistantProfessor0 as object Reduce Function :- Look for all the required values according to condition and output the key as the result Ex: Filter those results having both ub:Publication & D0U0:AssistantProfessor0
10
Algorithm SELECT ?X, ?Y WHERE { 1.?X rdf:type ub:Chair. 2.?Y rdf:type ub:Department. 3.?X ub:worksFor ?Y. 4.?Y ub:subOrganizationOf } Y 1X1X 4Y4Y 3 X,Y 2Y2Y X Y Y |E| = 4 Job 1 map output keys: 1.Y – 2, 3, 4 (3 joins) Job 1 joins: 3 1 join left, so need more job VariableNodesJoins X1, 31-3 Y2, 3, 42-3, 3-4, 4-2
11
Algorithm (contd.) A (2, 3, 4) X, Y B (1) X X Job 2 map output key: 1.X – A, B (1 Join) Job 2 joins: 1 No joins left, no more jobs needed VariableNodesJoins XA, BA-B
12
Some Query Results Horizontal axis: Number of Triples Vertical axis: Time in milliseconds
13
Query Preprocessing Original query 2: ?X rdf:type ub:GraduateStudent. ?Y rdf:type ub:University. ?Z rdf:type ub:Department. ?X ub:memberOf ?Z. ?Z ub:subOrganizationOf ?Y. ?X ub:undergraduateDegreeFrom ?Y Rewritten: ?X rdf:type ub:GraduateStudent. ?X ub:memberOf_Department ?Z. ?Z ub:subOrganizationOf_University ?Y. ?X ub:undergraduateDegreeFrom_University ?Y
14
Parallel Experiment with Pig Script for query 2: /* Load statements */ GS = LOAD ‘type_GraduateStudent‘ AS (gs_subject:chararray); MO = LOAD ‘memberOf_Department‘ AS (mo_subject:chararray, mo_object:chararray); SOF = LOAD ‘subOrganizationOf_University‘ AS (sof_subject:chararray, sof_object:chararray); UDF = LOAD ‘undergraduateDegreeFrom_University‘ AS (udf_subject:chararray, udf_object:chararray); /* Joins */ MO_UDF_GS = JOIN GS BY gs_subject, UDF BY udf_subject, MO BY mo_subject PARALLEL 8; MO_UDF_GS = FOREACH MO_UDF_GS GENERATE mo_subject, udf_object, mo_object; MO_UDF_GS_SOF = JOIN SOF BY (sof_subject, sof_object), MO_UDF_GS BY (mo_object, udf_object); MO_UDF_GS_SOF = FOREACH MO_UDF_GS_SOF GENERATE mo_subject, udf_object, mo_object; /* Store query answer */ STORE MO_UDF_GS_SOF INTO ‘Query2' USING PigStorage('\t');
15
Parallel Experiment with Pig 2 jobs created for query 2 For 330 mln triples, answers in 20 mins – Direct MapReduce approach takes 10 mins
16
Future Works Run all 14 queries for 100 mln, 200 mln, …, 1 bln triples and compare with Jena In-Memory, RDB, SDB, TDB models Cascading output of one job to another job as input without using hard disk Generic map reduce code Proof of algorithm Modification of algorithm for queries with optional triple patterns Indexing, summary statistics
17
References [1] BIOMANTA: http://www.biomanta.org/http://www.biomanta.org/ [2] LUBM: http://swat.cse.lehigh.edu/projects/lubm/ http://swat.cse.lehigh.edu/projects/lubm/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.