Download presentation
Presentation is loading. Please wait.
Published byJames Dalton Modified over 9 years ago
1
Authors: Anant Bhardwaj, Amol Deshpande, Aaron J. Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, Rebecca Zhang. Type: Demonstration paper Presented by: Dardan Xhymshiti Fall 2015
2
Organizations and companies collect data from various sources like: Financial transactions, Server logs, Sensor data etc. Teams and individuals inside the company want to use these dataset for extracting knowledge from them, using their home-grown tools, company tools, different programming languages, so making modifications on the data set (normalization, cleaning) and then exchanging these dataset back and forth. Problem: collaborative data analysis. Heterogeneity of tools, diversity in skill-set of individuals and teams, difficulties on sorting, difficulties on retrieving and versioning of the exchanged datasets.
3
The authors motivate they work by providing two examples: Example 1: Expert analysis: Members of an web advertising team want to extract knowledge from an unstructured ad- click data. They write a script for extracting the task-relevant information from the data, and store it as a separate dataset which will be shared across the team. Problems: Different team members may be more comfortable with a particular tool: R, Python, Awk, and use these tool to clean, normalize and summarize the dataset. More proficient members use multiple languages for different purposes: Modeling in R. Visualization in JavaScript String extraction in Awl etc.
4
The team members manage the data set versions by recording it within a file with name: table_v1, table_v1.1 …. Versioning is difficult to manage in case of a hundred data set versions. The final result…:
5
The team members manage the data set versions by recording it within a file with name: table_v1, table_v1.1 …. Versioning is difficult to manage in case of a hundred data set versions. The final result…:
6
Example 2: Novice analysis: The coach and players of a football team want to study, query and visualize their performance over the last season. Probably they are going to use a tool like Excel for storing their data set, which have limited support on querying, cleaning, analysis or versioning. Query example: The coach wants to find all the games where a star player was absent? Most of the team players are not proficient with data analysis tools, such as SQL or scripting languages. Solution of the problem: Point-and-click apps. These apps offer: Easy load, query, visualize and share results with other users without much effort.
7
These teams are unable to perform collaborative data analysis because of the lack of: 1. Flexible data sharing and versioning support 2. Point-and-click apps to help novice users do collaborative data analysis 3. Support for a number of data analysis languages and tools. A tool for collaborative analysis can be used for example by genetics who want to share and collaborate on genome data with other research groups.
8
To address these problems the paper presents DataHub a unified data management and collaboration platform for hosting, sharing, combining and collaboratively analyzing datasets. DataHub has three main components: 1. Flexible data storage, sharing, and versioning capabilities. a) Keeps track of all versions of dataset. b) Enables collaborative analysis, while at the same time allows storing and retrieving these datasets at various stages of analysis. 2. App ecosystem for easy querying, cleaning, and visualization. a) Distill: data cleaning by example tool. b) DataQ: a query builder tool that allows user to build SQL queries by direct manipulation in graphical user interface. Interface is suitable for non-technical users. c) Dviz: Data visualization tool.
9
3. Language-agnostic hooks for external data analysis. For the team members that are proficient on different languages and libraries like: Python, R, Scala and Octave, the DataHub enable collaborative data analysis by using Apache Thrift to translate between these languages and datasets in DataHub.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.