Y.C. Tay National University of Singapore Data Generation for Application-Specific Benchmarking.

Slides:



Advertisements
Similar presentations
Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Advertisements

Wiki-Reality: Augmenting Reality with Community Driven Websites Speaker: Yi Wu Intel Labs/vision and image processing research Collaborators: Douglas Gray,
A PowerPoint Presentation
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Stelios Lelis UAegean, FME: Special Lecture Social Media & Social Networks (SM&SN)
Database Theory Each Table in a Database needs a Primary Key Data TypesDescriptionExample TextCharacters (Letters, numbers and symbols) ABC 123 NumberNumerical.
Technical BI Project Lifecycle
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
Supporting the 3Cs through Social Networking Tools April Hayman Instructional Designer International Society for Technology in Education.
Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
Haifeng Yu National University of Singapore
Web Data Management Raghu Ramakrishnan Research QUIQ Lessons Structured data management powers scalable collaboration environments ASP Multi-tenancy.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
By Chris Zachor.  Introduction  Background  Open Source Software  The SourceForge community and network  Previous Work  What can be done different?
Data and Knowledge Management
A Social Network is not a Graph Y.C. Tay National University of Singapore in collaboration with : Zhifeng Bao, Yong Zeng, Jingbo Zhou.
© 2002 by Prentice Hall 1 David M. Kroenke Database Processing Eighth Edition Chapter 2 Introduction to Database Development.
Yes, There is a Correlation - From Social Networks to Personal Behavior on the Web offence.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.
Scalable Network Distance Browsing in Spatial Database Samet, H., Sankaranarayanan, J., and Alborzi H. Proceedings of the 2008 ACM SIGMOD international.
Chapter 14 The Second Component: The Database.
The Very Small World of the Well-connected. (19 june 2008 ) Lada Adamic School of Information University of Michigan Ann Arbor, MI
Graph Algebra with Pattern Matching and Aggregation Support 1.
Chapter 2 Introduction to Database Development Database Processing David M. Kroenke © 2000 Prentice Hall.
A Social blog using MongoDB ITEC-810 Final Presentation Lucero Soria Supervisor: Dr. Jian Yang.
4/20/2017.
SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.
LDBC & The Social Network Benchmark Peter Boncz Database Architectures CWI Special chair “Large-Scale Data VU event.cwi.nl/lsde2015.
Titan Graph Database Meet Bhatt(13MCEC02).
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
Agency for Healthcare Research and Quality Advancing Excellence in Health Care A Web-Based Tool for Quality and Utilization Reporting Visit.
WEB-ENHANCED INFORMATION MANAGEMENT (COMS E6125) SPRING 2008 (CVN) NAVID AZIMI (NA2258) Web Platforms, or: How I Learned To Stop Worrying And Love Facebook.
Reporting and Build Statistics Using Business Intelligence By Naga Sowjanya Karumuri Build Team, VMware, Cambridge Summer Internship 2008.
LANGUAGE NETWORKS THE SMALL WORLD OF HUMAN LANGUAGE Akilan Velmurugan Computer Networks – CS 790G.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
(e)Business Process Management easyREMOTE DWH © Josef Schiefer, IBM Watson Process Warehousing Unified Business Framework... in concert.
LDBC: Benchmarking Graph Data Management Systems Peter Boncz.
Aules d’Empresa 2011 Aules d’empresa 2011 DEX. Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Contents Graph database Motivation DEX.
Linked Data Benchmark Council 2-year status report LDBC Linked Data Benchmark Council 2-year status report Peter Boncz.
System Support for Managing Graphs in the Cloud Sameh Elnikety & Yuxiong He Microsoft Research.
Data Warehouse. Design DataWarehouse Key Design Considerations it is important to consider the intended purpose of the data warehouse or business intelligence.
What we’ve learnt Doc 5.69 Doc 5.70 Section 1-3. A simple database Related objects Tables hold the data Forms, reports, queries to access the data.
Clustering XML Documents for Query Performance Enhancement Wang Lian.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
McGraw-Hill/Irwin ©2009 The McGraw-Hill Companies, All Rights Reserved CHAPTER 6 DATABASES AND DATA WAREHOUSES CHAPTER 6 DATABASES AND DATA WAREHOUSES.
ArcGIS Editor for OpenStreetMap: Contributing Data Christine White.
Data Science Background and Course Software setup Week 1.
Co-funded by the European Union WeKnowIt Emerging, Collective Intelligence for personal, organisational and social use Event Detection.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
1 One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing Bei Yu 1, Guoliang Li 2, Beng Chin Ooi 1, Li-zhu Zhou 2 1 National.
From Theory to Practice: Efficient Join Query Processing in a Parallel Database System Shumo Chu, Magdalena Balazinska and Dan Suciu Database Group, CSE,
Agency for Healthcare Research and Quality Advancing Excellence in Health Care A Web-Based Tool for Quality and Utilization Reporting Visit.
Presented By Amarjit Datta
Melbourne, Australia, Oct., 2015 gSparsify: Graph Motif Based Sparsification for Graph Clustering Peixiang Zhao Department of Computer Science Florida.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Web 2.0 – A New Beginning Web 2.0, a phrase coined by O'Reilly Media in 2004 refers to a supposed second generation of Internet-based services— such.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Location-based Social Networks 6/11/20161 CENG 770.
External Data Access 5/29/08. Current Problems No way to load, process & analyze live Atlas data via critical analysis & programming tools (SAS, R, Perl)
EvoGen: a Generator for Synthetic Versioned RDF Marios Meimaris Institute for the Management of Information Systems Research Center “Athena”
Vision: Increase regional sharing and collaboration in order to expedite the delivery and adoption of energy efficiency. Conduit is brought to you by NEEA.
Introduction to Visual Analytics
Department of Computer Science DCC University of Chile
Presentation transcript:

Y.C. Tay National University of Singapore Data Generation for Application-Specific Benchmarking

Background benchmarks help research and development --- the dominant database benchmark is TPC SIGMOD Conference 2011 research track: 87 papers, 17 use TPC (20%) industry track: 14 papers, 6 use TPC (43%) Problem : a few TPC benchmarks but many, many applications TPC becoming irrelevant?

Vision a paradigm shift in database benchmark development from top-down committee consensus domain-specific package (data generator + queries) to bottom-up community collaboration application-specific tools (dataset scaling) synthetically scale up/down application data application already has queries

Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. E.g. What would DBLP look like in 2020? s > 1 why: scalability testing difficulty: copying doesn’t work (e.g. social network data) s < 1 why: application testing difficulty: sampling not straightforward (similar to web crawling) s = 1 why: privacy/proprietary reasons difficulty: encryption is risky

Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. by query results difficulty: data correlation E.g. database = {photos, owners, comments, tags} inter-column correlation foreign keys age and gender user likely to comment on own photos gardener likely to tag photos of flowers inter-row correlation photo dimensions (same camera) tags used by gardener (“rose”, “bee”, “beetle”) inter-column + inter-row 2 users comment on each other’s photos (social network)

Challenge scaling a social network: D empirical dataset ~ D inject synthetic dataset E.g. how to inject into ~ D * correlation from indicating X and Y comment on each other’s photos ~ G * correlation between Alice’s birthday and wall posts by her classmates * correlation among tags used by bird watchers extract G empirical social graph use join query G ~ scale by s synthetic social graph use graph theory #edges? #triangles? path lengths? any database theory?

Challenge Attribute Value Correlation Problem for Social Networks : Suppose a dataset D records data from a social network. How do the social interactions affect the correlation among attribute values in D ? * online social networks are here to stay * their datasets can be huge * their datasets have commercial value where is the database theory?

Vision (for the next 25 years): a paradigm shift from a top-down design of domain-specific benchmarks by committee consensus to a bottom-up collaborative development of tools for application-specific dataset scaling Challenges: Dataset Scaling Problem Attribute Value Correlation Problem for Social Networks commercial value in dataset scaling tools new database research areas ( social network data, schema design, vertical/horizontal partition, query optimization, business intelligence, … ) Payoff: UpSizeR ( ) single-server version Hadoop version Start: