Y.C. Tay National University of Singapore Data Generation for Application-Specific Benchmarking.

Y.C. Tay National University of Singapore Data Generation for Application-Specific Benchmarking

Background benchmarks help research and development --- the dominant database benchmark is TPC SIGMOD Conference 2011 research track: 87 papers, 17 use TPC (20%) industry track: 14 papers, 6 use TPC (43%) Problem : a few TPC benchmarks but many, many applications TPC becoming irrelevant?

Vision a paradigm shift in database benchmark development from top-down committee consensus domain-specific package (data generator + queries) to bottom-up community collaboration application-specific tools (dataset scaling) synthetically scale up/down application data application already has queries

Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. E.g. What would DBLP look like in 2020? s > 1 why: scalability testing difficulty: copying doesn’t work (e.g. social network data) s < 1 why: application testing difficulty: sampling not straightforward (similar to web crawling) s = 1 why: privacy/proprietary reasons difficulty: encryption is risky

Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. by query results difficulty: data correlation E.g. database = {photos, owners, comments, tags} inter-column correlation foreign keys age and gender user likely to comment on own photos gardener likely to tag photos of flowers inter-row correlation photo dimensions (same camera) tags used by gardener (“rose”, “bee”, “beetle”) inter-column + inter-row 2 users comment on each other’s photos (social network)

Challenge scaling a social network: D empirical dataset ~ D inject synthetic dataset E.g. how to inject into ~ D * correlation from indicating X and Y comment on each other’s photos ~ G * correlation between Alice’s birthday and wall posts by her classmates * correlation among tags used by bird watchers extract G empirical social graph use join query G ~ scale by s synthetic social graph use graph theory #edges? #triangles? path lengths? any database theory?

Challenge Attribute Value Correlation Problem for Social Networks : Suppose a dataset D records data from a social network. How do the social interactions affect the correlation among attribute values in D ? * online social networks are here to stay * their datasets can be huge * their datasets have commercial value where is the database theory?

Vision (for the next 25 years): a paradigm shift from a top-down design of domain-specific benchmarks by committee consensus to a bottom-up collaborative development of tools for application-specific dataset scaling Challenges: Dataset Scaling Problem Attribute Value Correlation Problem for Social Networks commercial value in dataset scaling tools new database research areas ( social network data, schema design, vertical/horizontal partition, query optimization, business intelligence, … ) Payoff: UpSizeR (http:www.comp.nus.edu.sg/~upsizer ) single-server version Hadoop version Start:

Y.C. Tay National University of Singapore Data Generation for Application-Specific Benchmarking.

Similar presentations

Presentation on theme: "Y.C. Tay National University of Singapore Data Generation for Application-Specific Benchmarking."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Y.C. Tay National University of Singapore Data Generation for Application-Specific Benchmarking.

Similar presentations

Presentation on theme: "Y.C. Tay National University of Singapore Data Generation for Application-Specific Benchmarking."— Presentation transcript:

Similar presentations

About project

Feedback