Q UERY -B ASED D ATA P RICING Paraschos Koutris Prasang Upadhyaya Magdalena Balazinska Bill Howe Dan Suciu University of Washington PODS 2012
M OTIVATION Data is increasingly sold and bought on the web Websites that sell data: – AggData [ – Xignite (financial data) [ – Gnip (social media) [ Data marketplace services: – Windows Azure Marketplace (100+ datasets) [datamarket.azure.com] – Infochimps (15,000 datasets) [ Query-based pricing customized for buyers 2
C URRENT P RICING (1) A fixed price for the whole dataset or for a specific set of views Example: CustomLists – USA Business Database for $399 – addresses for $299 – Businesses in WA for $199 Limitations: – Restaurants in WA ? – Businesses in cities with population >100,000 ? 3
C URRENT P RICING (2) API Subscriptions (Azure Marketplace, Infochimps) – Allow queries over the data – Pay by number of transactions (page of results) 4
I SSUES W ITH P RICING Buyers today need to buy a superset of the data they are interested in Sellers can’t easily anticipate all possible queries that buyers might ask Solution: we need a more flexible pricing scheme, parameterized by queries 5
O UTLINE 1.The Pricing Framework 2.The Pricing Formula 3.The Complexity of Pricing 4.Dichotomy and Algorithms for Selections 6
T HE P RICING F RAMEWORK The seller defines price points (view-price pairs): S = { (V 1,p 1 ), (V 2,p 2 ), … } A buyer can buy any query Q The system will compute price D S (Q) Seller V 1,p 1 V 2,p 2 … Buyer Q(D) ? Pricing System + Database D price D S (Q) 7
I NSTANCE -B ASED D ETERMINACY 8 Definition. V = V 1,…,V k determine Q given D, denoted D ⊢ V ↠ Q, if: forall D’, if V(D) = V(D’), then Q(D) = Q(D’) Intuitively, “V 1,…, V k determine Q” means that Q(D) can be answered only from V 1 (D),…,V k (D), without accessing the database instance D
A RBITRAGE -F REE Suppose V determines Q and price D (Q) > price D (V). Then, we can 1.buy V(D) for price D (V) 2.compute Q(D) from V(D) 3.now we have answered Q at some price p<price D (Q) Axiom 1. Given D, the pricing function price D (Q) is arbitrage- free if for all views V 1, …, V k and query Q where D ⊢ V 1, …, V k ↠ Q: price D (Q) ≤ price D (V 1 ) + … + price D (V k ) 9
D ISCOUNT -F REE The intuition is that the price points represent discounts that the seller offers relative to the price of the whole database A pricing function is discount-free if it is maximal Axiom 2. The pricing function price D (Q) should not offer any other additional discounts except for the explicit price points defined by the seller. 10
E XAMPLE : O RIGAMI D ATABASE ShapeColorPicture SwanWhite..... SwanYellow..... DragonYellow..... CarYellow..... FishWhite..... ViewPrice V 1 (x,y,z) :- S(x,y,z), x=‘Swan’$2 V 2 (x,y,z) :- S(x,y,z), x=‘Dragon’$2 V 3 (x,y,z) :- S(x,y,z), x=‘Car’$2 V 4 (x,y,z) :- S(x,y,z), x=‘Fish’$2 W 1 (x,y,z) :- S(x,y,z), y=‘White’$3 W 2 (x,y,z) :- S(x,y,z), y=‘Yellow’$3 W 3 (x,y,z) :- S(x,y,z), y=‘Red’$3 Price pointsDatabase S Get all dragon origami for $2 Get all red origami for $3 What is the price of the entire database? Q(x,y,z) :- S(x,y,z) Exhausts the active domain V 1, V 2, V 3, V 4 determine Q: price(Q) ≤ $8 W 1, W 2, W 3 determine Q: price(Q) ≤ $9 price(Q)=$8 12
E XAMPLE : O RIGAMI D ATABASE ShapeColorPicture SwanWhite..... SwanYellow..... DragonYellow..... CarYellow..... FishWhite..... What is the price of the full join? Q(x,y,z,u,v) :- R(x,u), S(x,y,z), T(y,v) ShapeInstructions Swanfold, cut, fold… Dragoncut, fold, cut,… ColorPaperSpecs White15g/100, $10 Black20g/100, $15 p(σ shape )=$99 p(σ color )=$50 p(σ color )=$5 p(σ shape )=$2 R S T 13
O UTLINE 1.The Pricing Framework 2.The Pricing Formula 3.The Complexity of Pricing 4.Dichotomy and Algorithms for Selections 14
T HE Q UERY P RICING F ORMULA 15 Given: 1.Price points S = {(V 1,p 1 ),…,(V k, p k )} 2.Database instance D 3.Query Q. Compute: price D S (Q) Properties: (a) arbitrage-free, (b) discount-free, (c) price D S (V i )=p i If it exists, we say that the price points are consistent Theorem. (a)The price points are consistent iff p D (V i )=p i for any price point i=1,…,k (b) price D S (Q) = p D (Q) is the unique arbitrage-free, discount-free pricing function that agrees with the price points Method: Consider all subsets of V ={V 1,…,V k } that determine Q Let C be the subset with the minimum price, Σ i p i, for V i in C Define p D (Q) = Σ i p i
D ISCUSSION If the result of Q 1 is always a subset of Q 2, should Q 1 be priced less than Q 2 ? No! Example: – V(x,y) :- Fortune500(x,y) Q(x,y) :- Fortune500(x,y), StrongBuyRec(x) – price(Q) >> price(V) We ignore computation costs in our framework – Cost of computing query Q – Q(D)=f(V(D)), but f can be hard to compute 16
O UTLINE 1.The Pricing Framework 2.The Pricing Formula 3.The Complexity of Pricing 4.Dichotomy and Algorithms for Selections 17
D ETERMINACY 18 Definition. [Instance-independent] V determines Q, denoted as V ↠ Q, if: forall D, D’, if V(D) = V(D’), then Q(D) = Q(D’) [Nash, Segoufin, Vianu ‘07] V ↠ Q iff there exists a function f such that Q(D) = f(V(D)) for all D iff for every D, we have that D ⊢ V ↠ Q Definition. [Instance-dependent] V determines Q given D, denoted as D ⊢ V ↠ Q, if: forall D’, if V(D’) = V(D), then Q(D) = Q(D’)
C OMPLEXITY O F D ETERMINACY 19 V, Q are UCQV, Q are CQ Instance-independent V ↠ Q Undecidable [NSV ’07] ? Instance- dependent D ⊢ V ↠ Q data coNP-complete [this paper] coNP-complete [this paper] combined Π 2 P [this paper] Π 2 P [this paper] Open Question: is the bound on the combined complexity tight?
C OMPLEXITY O F P RICING 20 Corollary. Deciding whether price D S (Q) ≤ k is: Combined complexity [input S, D]: Σ p 2 Data complexity [input D]: coNP-hard Proposition. Pricing is at least as hard as determinacy How do we deal with the hardness of computation?
O UTLINE 1.The Pricing Framework 2.The Pricing Formula 3.The Complexity of Pricing 4.Dichotomy and Algorithms for Selections 21
R ESTRICTING P RICE P OINTS TO S ELECTIONS A seller can specify only the prices of selection queries of the form σ R.X=a : prices on columns The domain of each column is finite and known to buyers and sellers Price points on selections is how prices are set in most cases today 22
D ICHOTOMY T HEOREM 23 Theorem. Assuming selection views only, for any Conjunctive Query w/o self-joins Q, one of the following holds (data complexity): (a) price Q S (D) is in PTIME (b) checking whether price Q S (D)≤k is NP-complete PTIME: – Q(x,y,z,u,v) :- R(x,u),S(x,y,z),T(y,v) [Chains] – Q(x 1,…,x k ) :- R 1 (x 1,x 2 ),…,R k (x k,x 1 ) [Cycles] NP-complete: – Q(x) :- R(x,y) [Projections] – Q(x,y,z) :- R(x,y,z),S(x),T(y),U(z)
A LGORITHM F OR PTIME C ASES 24 The algorithm uses a reduction to maximum flow Edges of finite capacity represent price points A set of edges of finite cost is a cut iff they determine the query Example: – Chain query Q(x,y):-R(x),S(x,y),T(y) X a1a1 a2a2 XY a1a1 b1b1 a2a2 b2b2 a2a2 b2b2 a3a3 b2b2 a4a4 b1b1 Y b1b1 b3b3 Dom(X) = {a 1,a 2,a 3,a 4 } Dom(Y) = {b 1,b 2,b 3 } R S T
F LOW G RAPH 25 a4a4 a3a3 a2a2 a1a1 R b1b1 b2b2 b3b3 T b1b1 b2b2 b3b3 S a4a4 a3a3 a2a2 a1a1 X a1a1 a2a2 XY a1a1 b1b1 a2a2 b2b2 a2a2 b2b2 a3a3 b2b2 a4a4 b1b1 Y b1b1 b3b3 R S T A set of edges of finite cost is a cut iff they determine the query
C ONCLUSIONS Summary: – The seller sets prices to some views, while the system computes the price of any query – Interesting application of query determinacy – Complexity: dichotomy for CQs w/o self-joins Future Work: – Pricing in the presence of updates – How do we overcome pricing for intractable queries? – Connection of pricing and privacy 26
