Download presentation
Presentation is loading. Please wait.
Published byChris Foulger Modified over 9 years ago
1
Y.C. Tay National University of Singapore Data Generation for Application-Specific Benchmarking
2
Background benchmarks help research and development --- the dominant database benchmark is TPC SIGMOD Conference 2011 research track: 87 papers, 17 use TPC (20%) industry track: 14 papers, 6 use TPC (43%) Problem : a few TPC benchmarks but many, many applications TPC becoming irrelevant?
3
Vision a paradigm shift in database benchmark development from top-down committee consensus domain-specific package (data generator + queries) to bottom-up community collaboration application-specific tools (dataset scaling) synthetically scale up/down application data application already has queries
4
Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. E.g. What would DBLP look like in 2020? s > 1 why: scalability testing difficulty: copying doesn’t work (e.g. social network data) s < 1 why: application testing difficulty: sampling not straightforward (similar to web crawling) s = 1 why: privacy/proprietary reasons difficulty: encryption is risky
5
Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. by query results difficulty: data correlation E.g. database = {photos, owners, comments, tags} inter-column correlation foreign keys age and gender user likely to comment on own photos gardener likely to tag photos of flowers inter-row correlation photo dimensions (same camera) tags used by gardener (“rose”, “bee”, “beetle”) inter-column + inter-row 2 users comment on each other’s photos (social network)
6
Challenge scaling a social network: D empirical dataset ~ D inject synthetic dataset E.g. how to inject into ~ D * correlation from indicating X and Y comment on each other’s photos ~ G * correlation between Alice’s birthday and wall posts by her classmates * correlation among tags used by bird watchers extract G empirical social graph use join query G ~ scale by s synthetic social graph use graph theory #edges? #triangles? path lengths? any database theory?
7
Challenge Attribute Value Correlation Problem for Social Networks : Suppose a dataset D records data from a social network. How do the social interactions affect the correlation among attribute values in D ? * online social networks are here to stay * their datasets can be huge * their datasets have commercial value where is the database theory?
8
Vision (for the next 25 years): a paradigm shift from a top-down design of domain-specific benchmarks by committee consensus to a bottom-up collaborative development of tools for application-specific dataset scaling Challenges: Dataset Scaling Problem Attribute Value Correlation Problem for Social Networks commercial value in dataset scaling tools new database research areas ( social network data, schema design, vertical/horizontal partition, query optimization, business intelligence, … ) Payoff: UpSizeR (http:www.comp.nus.edu.sg/~upsizer ) single-server version Hadoop version Start:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.