Fusion Tables
Takeaways Relatively “light” paper on a real-world public facing system Clearly useful to some people and organizations – many users… Companion paper in SOCC talks about details of implementation most are standard adaptations of existing techniques
Target Users are Data Enthusiasts Also called “factivists”: those who know nothing about DBMS They need to find good data, do meaningful data integration, and tell compelling stories Allow them to upload, collaborate, visualize data Also combine them with existing datasets Need to understand semantics of datasets
Goals of Fusion Tables Goal 1: Easy to use database system integrated with the web Support common workflows Easy upload Sharing Visualizations Publishing Goal 2: Fusion with other datasets; find others and combine with yours
Any thoughts about the first goal? Who are the target users for tools like this? We saw some examples…
Any thoughts about the first goal? Who are the target users for tools like this? We saw some examples… People who want to store and study small datasets But need something more powerful than excel Joins, Selects, aggregates (visualizations) e.g., Journalists, scientists, governments, non-profits
Let’s talk about each of these steps.. Data Acquisition Primarily work on CSVs They don’t require a schema in advance Automatically infer schemas Is this sufficient in practice?
Not really! Studies state that data acquisition accounts for 80% of the development time and cost in data science What if data is not in csv, but in JSON, or XML? How would you clean it then? Thoughts?
Other Recent Work There has been some recent work on cleaning data automatically with humans.. (there’s other work on this as well) http://vimeo.com/19185801
Drawbacks?
Drawbacks? If you make a mistake, hard to go back. Requires expertise on the part of users Can you think of other ways to do this acquisition without programming?
Drawbacks? If you make a mistake, hard to go back. Requires expertise on the part of users Can you think of other ways to do this acquisition without programming? Examples? Highlight regions vertically? Use semantic knowledge?
Sharing Fusion tables supports sharing and collaboration on tables; What are the issues that come up when multiple users are collaborating on tables?
Sharing Fusion tables supports sharing and collaboration on tables; What are the issues that come up when multiple users are collaborating on tables? What are the issues that come up when there are visualizations that derive from tables?
Sharing Fusion tables supports sharing and collaboration on tables; What are the issues that come up when multiple users are collaborating on tables? Need for coordination: conflicts. What are the issues that come up when there are visualizations that derive from tables?
Other issues What if users make mistakes while collaborating? What can we do then?
Our recent work: Datahub: Github for data
Eventually… Hopefully Powerful versioning query language
Users of Fusion Tables Examples…
Goals of Fusion Tables Goal 1: Easy to use database system integrated with the web Support common workflows Easy upload Sharing Visualizations Publishing Goal 2: Fusion with other datasets; find others and combine with yours
Fusion Tables Implementation Simple search and suggestion based on existing data What other use-cases could you see for web-data integration (i.e., finding data on the web to “mesh” with your data)?
Other Usecases Row or column augmentation Join augmentation Missing value augmentation Accuracy augmentation
For example
First source of data Many open data initiatives Governments Non-profits Collaborations and academic institutions E.g., uci repository