Extracting Schema from Semistructured Data

Extracting Schema from Semistructured Data
Nestorov, Abiteboul, and Motwani at Stanford

Perspective This paper is new work.
More than the details look at the issues: What are their goals? What does this contribute? Do they attain their goals? Why do we need this?

Sample Database 24 “The Keg” “Steak” “Jim” “Burger King” “Fries” “AA+
7 1 Hours Manager Name Manager Entree Entree Name 8 9 10 11 2 3 4 24 “The Keg” “Steak” “Jim” “Burger King” “Fries” Company Name Phone 5 6 “AA+ Management” Schema = Types

Where does semistructured data come from?
Document collections Biological data HTML Bibtex, etc.

Who needs structure? For the user Storage
To know what queries are possible Browsing the database Type checking Storage Data layout to facilitate querying E.g. place similar objects on same page Indexes

Who Needs Structure?(2) Query optimization Other?
All the relational query optimization tricks Maintaining statistics per data type Cardinality, # of pages, Index cardinality, etc. Estimating the cost/size of result of query plans Efficient processing of path expressions Other?

Example (little lie) Typing Program:
Their Goals Approximate typing (schema extraction) of semistructured data. Example (little lie) Typing Program: Restaurant(X) :- Link(X,A,B,C) & Name-atom(A) & Entrée-atom(B) & Manager-atom(C)

Outline of the Algorithm
Given a database: 1. Find the perfect typing program. This typing might be too large so we: 2. Coalesce similar types into k types. 3. Assign a type to objects in database. 4. Deduce meaningful names for the types.

Typing The two base relations: - link(FromObj, ToObj, Label)
- atomic(Obj, Value) 7 Manager Name Entree 8 9 10 “The Keg” “Steak” “Jim” These are the only two EDB’s of the typing program. Restaurant(X) :- link(X,A,Name) & atomic(A, Ap) & link(X,B,Entrée) & atomic(B, Bp) & link(X,C,Manager) & atomic(C,Cp)

Typing 2 Restaurant(X) :- link(X,A,Name) & atomic(A, Ap) &
link(X,B,Entrée) & atomic(B, Bp) & link(X,C,Manager) & atomic(C,Cp) EDB: link(7, 8, Name) atomic(8, “The Keg”) IDB: (intensional relations) defined by the typing program Extension of an IDB: Restaurant(1)

Restriction on Types Arbitrary type programs are not allowed.
Rules typei(X) can only be built from the following: 1. link(Y, X, c) & typej(Y) 2. link(X, Y, c) & typej(Y) 3. link(X, Y, c) & atomic(Y, Z) Types can only express local characteristics. The collection of typed links is a set. (2 entrées = 1 entrée) cj cj c0 X

Semantics of Type Program
The greatest fixpoint of a datalog program on a database defines the semantics of the typing. Fixpoint = Extensions of IDB’s + EDB’s Least fixpoint start with model of only EDB’s at each step union into the model anything new.

Greatest Fixpoint 1. Start with a model of EDB’s and all possible extensions. 2. At each step, remove any extensions not derived by applying the rules. Least fixpoint doesn’t work: person(X) :- link(X, Y, is-manager-of) & firm(Y) & link(X, Yp, name) & atomic(Yp, Z) firm(X) :- link(X, Y, is-managed-by) & person(Y) &

Imperfect Types Defect: a measure of how well an
object fits a given type. = Excess + deficit type1 = + Defect is 2 for assigning 11 to type1. 7 Manager Name Entree 4 5 6 “The Keg” “Steak” “Jim” manager0 name0 entree0 11 # seats Name Entree 8 9 10 “McD” “biscuit” 53

Imperfect Types(2) Excess: # of EDB’s not used to validate any object’s type. Deficit: Minimum # of ground facts that need to be added to make all type derivations possible. 7 Manager Name Entree 4 5 6 “The Keg” “Steak” “Jim” 11 # seats Name Entree 8 9 10 “McD” “biscuit” 53

Perfect Typing Program (Stage 1)
Gore.

Multiple Roles How hard is it to choose to types for the cover?
Country Country Movie Name Team Team Movie Name Movie France Rocky Horror Name Scholes Man Utd Bleu Star Trek Country Binoche Cantona England How hard is it to choose to types for the cover? How do you quantify atomization?

Clustering (Stage 2) Define a distance function between two types:
First approximation is difference between the bodies of their rule definitions. t1 :- a0, b2 t2 :- a0, b1 t3 :- b2, b1, b3 d(t1, t2) = 2

A Better Function Include some measure of the weight of a type(# of
objects of that type): t2 ~> t1 Some desirable properties: increasing in d = coalesce similar types decreasing in w1 = compensate for ‘expected noise’ increasing in w2 = maintain types with large extents Choosing what to coalesce is hard!

Recasting (Stage 3) Assign each object to types within the k types formed from stage 2. (optional) choose a better value of k an rerun step 2.

Results Heavy use of synthetic data. What do the graphs show?
Create a type definition and generate instances that are peturbed randomly in some way. What do the graphs show? Are the data sets realistic?

Conclusions Paper problems:
The algorithm isn’t completely explained. Many comments are not elaborated. But, it’s an important problem and good first approach.

Extracting Schema from Semistructured Data

Similar presentations

Presentation on theme: "Extracting Schema from Semistructured Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extracting Schema from Semistructured Data

Similar presentations

Presentation on theme: "Extracting Schema from Semistructured Data"— Presentation transcript:

Similar presentations

About project

Feedback