FlexTable: Using a Dynamic Relation Model to Store RDF Data 2010. 7. 14 IDS Lab. Seungseok Kang.

FlexTable: Using a Dynamic Relation Model to Store RDF Data 2010. 7. 14 IDS Lab. Seungseok Kang

Copyright  2008 by CEBT Outline  Introduction  Preliminary  Schema Evolution Similarity Measurement Lattice-Based Algorithm Control Parameter  Modification of Physical Storage  Experiment and Analysis

Copyright  2008 by CEBT Introduction  Resource Description Framework (RDF) Flexible model for representing information about resources  Solutions to store and query RDF data TripleStore – Storing predicate as values in table VertPart – Statistics of predicate correlation are lost

Copyright  2008 by CEBT Introduction  Requirement for reducing scan and join cost Triple should be organized as triple groups – How to group the triples to reduce query cost? All triples sharing same subject should be stored in one page – How to support this process dynamically?  FlexTable Dynamic relation model Contributions of FlexTable – A method based on lattice-structure to design evolving triple groups – A new data page for reducing cost of schema evolution

Copyright  2008 by CEBT Preliminaries  Triple (s,p,v) ∈ (U ∪ B)XUX(U ∪ B ∪ L) U: a set of URLs, B: a set of blank node, L: a set of literals  RDF tuple A tuple coalesced with a set of triples having a same subject  RDF schema A set of RDF tuples stored as a table in FlexTable

Copyright  2008 by CEBT Schema Evolution  Classification of triples When triples are considered as a whole, the correlation of all predicates are difficult to compute (e.g. queries with join) Predicates could be clustered into several classes – Join order and predicate correlation statistics would have a great effect on query performance  Schema evolution Extract RDF schema from RDF tuple Similar schemas are merged automatically according to their similarity – Similarity measurement – Lattice-based algorithm (LBA) – Control parameter

Copyright  2008 by CEBT Similarity Measurement  Two schemas with maximum similarity value will be merged While a new RDF tuple is inserted  Cosine-distance measure Compute the importance of an attribute in one schema – Example: if attribute “a 1 ” exists in less schemas than “a 2 ”, two schemas sharing attribute “a 1 ” are more similar than those only sharing “a 2 ” (e.g. “inUniversity” vs. “name”) Cosine-distance which denotes the similarity of two schemas A ratio of RDF tuples which have values in attribute a j to all RDF tuples containted in s i

Copyright  2008 by CEBT Lattice-Based Algorithm  A straightforward method Compute every similarity pairs, pick up the most similar pair – O(n) time complexity / O(n 2 ) space complexity  Lattice-Based algorithm (LBA) Each RDF schema is corresponded to a node in the lattice With all the attribute of schema A is contained in attribute set of schema B, A is an ancestor (parent) of B – Upper node is parent node / Dashed line is brother node Only the similarities between parent-child schema or brother schema pair are computed

Copyright  2008 by CEBT Lattice-Based Approach Algorithm EvolutionLattice(tuple, lattice) Input: tuple – An RDF tuple lattice – An RDF schema lattice Output: lattice 1: schema <- ExtractSchema(tuple); 2: AddSchema(schema, lattice); 3: schemaPair,<-GetMaxSimPair(lattice); 4: if(NeedMerge(schemaPair)) 5: newSchema=MergeSchema(schemaPair); 6: AddSchema(newSchema,lattice) 7: InsertTuple(tuple); 8: return lattice; Algorithm AddSchema(schema, lattice) Input: schema - A new schema lattice – An RDF schema lattice Output: lattice 1: bottom <- getBottomNode(lattice); 2: stack <- new Stack(bottom); 3: while(!isEmpty(stack)) 4: temp <- pop(stack); 5: if (schema is ancestor of temp) 6: push all parents of temp into stack; 7: else 8: AddChildren(temp’s children, schema); 9: compute similarity between temp’s children and schema; 10: top<-getTopNode(lattice); 11: push top in stack; 12: while(!isEmpty(stack)) 13: temp<-pop(stack); 14: if (temp is ancestor of schema) 15: push all children of temp into stack; 16: else 17: AddParents(temp’s parents, schema); 18: compute similarity between temp’s parents and schema; 19: compute similarity between temp and schema; 20: compute similarity between temp’s brothers and schema; 21: return lattice; AddSchema

Copyright  2008 by CEBT Control Parameter  Problem of schema evolution Stop merge: to compute the storage gain evolution – If storage cost of a new schema is smaller than existing two schemas, merge these two schemas into the new one – Otherwise, no need for action Storage cost of a schema Storage gain for schema merging – While C gain >0, NeedMerge is T, otherwise F  Summary Compute similarity between two schemas Lattice-Based algorithm for dynamic relational schemas A formula to determine when to merge two schemas a: Storage cost of schema information b: Storage cost of each attribute in one schema |A|: Number of attributes |N|: Number of RDF tuples r: Storage cost of each bitmap C val : storage cost of actual values

Copyright  2008 by CEBT Physical Storage  A tuple’s values are stored in the same order as order as attributes in schema (traditional databases) Benefit to reduce storage space Inefficient when schema evolution happens frequently – {name,age,univ}{Kate,53}(110)+{name,sex,univ}{Jim,MEN,UCLA}(111) -> {name,age,univ,sex}(1100)(1011) Problems – The cost of schema merging is prohibitively high Solutions – System must “interpret” the attribute names and values for each tuple at query access time – Page-interpret to divide data page into three region Page header, attribute interpreted area, data value area

Copyright  2008 by CEBT Physical Storagae  Physical storage design of FlexTable

Copyright  2008 by CEBT Experiment and Analysis  Setting T2390@1.86GHz, 1GB Ram, 160GB SATA T2390@1.86GHz FreeToGovCyc with 45,823 triples, 10,905 instances Yago with 1,000,000 triples, 152,362 instances  Analysis Analysis of triples import Analysis of storage cost Analysis of query performance

Copyright  2008 by CEBT Experiment and Analysis  Analysis of triples import  Analysis of Storage Cost

Copyright  2008 by CEBT Experiment and Analysis  Analysis of query performance Test queries – search all instances having predicates in the query – “SELECT ?x WHERE {?x pred1 ?val1. {?x pred2 ?val2} … {?x predN ?valN} } – Add predicates to the query pattern one by one Number of joins is increased by predicate sequence

Copyright  2008 by CEBT Conclusion  FlexTable RDF storage system using dynamic relation model Support efficient storage and query for DF data Features of the paper – Mechanism to support dynamic schema evolution – Novel page layout to avoid physical data rewritten – Comprehensive experiments Advantage of FlexTable – Less storage cost than state-of-the-art – Better time for triple import, storage, and query performance  Future work Extending FlexTable to column-oriented database

FlexTable: Using a Dynamic Relation Model to Store RDF Data 2010. 7. 14 IDS Lab. Seungseok Kang.

Similar presentations

Presentation on theme: "FlexTable: Using a Dynamic Relation Model to Store RDF Data 2010. 7. 14 IDS Lab. Seungseok Kang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FlexTable: Using a Dynamic Relation Model to Store RDF Data 2010. 7. 14 IDS Lab. Seungseok Kang.

Similar presentations

Presentation on theme: "FlexTable: Using a Dynamic Relation Model to Store RDF Data 2010. 7. 14 IDS Lab. Seungseok Kang."— Presentation transcript:

Similar presentations

About project

Feedback