Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.

Slides:



Advertisements
Similar presentations
Writing constructed response items
Advertisements

What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.
Fast Algorithms For Hierarchical Range Histogram Constructions
A Fast Growing Market. Interesting New Players Lyzasoft.
More MR Fingerprinting
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Analysing and Interpreting Data Chapter 11. O'Leary, Z. (2005) RESEARCHING REAL-WORLD PROBLEMS: A Guide to Methods of Inquiry. London: Sage. Chapter 11.2.
1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Data Mining.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
MAE 552 Heuristic Optimization Instructor: John Eddy Lecture #20 3/10/02 Taguchi’s Orthogonal Arrays.
1 Validation and Verification of Simulation Models.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
Informing an Organization High-Tech Solutions for a Low-Tech Environment.
Universidad de los Andes-CODENSA The Continuous Genetic Algorithm.
Calibration Guidelines 1. Start simple, add complexity carefully 2. Use a broad range of information 3. Be well-posed & be comprehensive 4. Include diverse.
Tietojärjestelmien peruskurssi Systeemisuunnittelu ja prototyyppimenetelmä Malin Brännback.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
The Scientific Method Honors Biology Laboratory Skills.
 Frequency Distribution is a statistical technique to explore the underlying patterns of raw data.  Preparing frequency distribution tables, we can.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Midterm Stats Min: 16/38 (42%) Max: 36.5/38 (96%) Average: 29.5/36 (78%)
CFD Refinement By: Brian Cowley. Overview 1.Background on CFD 2.How it works 3.CFD research group on campus for which problem exists o Our current techniques.
1 Technical & Business Writing (ENG-715) Muhammad Bilal Bashir UIIT, Rawalpindi.
1 Human Computer Interaction Week 7 Prototyping. 2 Introduction Prototyping is a design technique where users can be involved in testing design ideas.
IS Metrics for C2 Processes Working Group 3 Brief Team Leaders: Steve Soules Dr. Mark Mandeles.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Project Deliverables CEN Engineering of Software 2.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,
Chapter 3: Organizing Data. Raw data is useless to us unless we can meaningfully organize and summarize it (descriptive statistics). Organization techniques.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
CS223: Software Engineering Lecture 2: Introduction to Software Engineering.
10 Chapter 101 Using Menus and Validating Input Programming Logic and Design, Second Edition, Comprehensive 10.
. Financial Information supply chain Columbia Business School’s Center for Excellence in Accounting and Security Analysis  “An Evaluation of the.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.
Helpful hints for planning your Wednesday investigation.
5 Copyright © 2008, Oracle. All rights reserved. Testing and Validating a Repository.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Real-World Writing and Classroom Application From Kelly Gallagher Write Like This.
A research and policy informed discussion of cross-curricular approaches to the teaching of mathematics and science with a focus on how scientific enquiry.
Building Valid, Credible & Appropriately Detailed Simulation Models
OM. Platinum Level Sponsors Gold Level Sponsors Pre Conference Sponsor Venue Sponsor Key Note Sponsor.
1 Management Information Systems M Agung Ali Fikri, SE. MM.
DATA SET GENERATOR TEAM: Li Xiangqun, Wu Xudong, Wu Dan, Yu Fangzhou P15.
Does the brain compute confidence estimates about decisions?
Updated Force Measurement Statistical Tool Jason Beardsley and Bob Fox.
» In order to create a valid experiment, the experimenter can only change ONE variable at a time » The variable that is being changed by the experimenter.
Survey Methodology Reliability and Validity
A Level Computer Science
AP CSP: Cleaning Data & Creating Summary Tables
So, what was this course about?
Microsoft Research & University of Southampton
Chapter 21 More About Tests.
Drum: A Rhythmic Approach to Interactive Analytics on Large Data
Module 5 Lesson 3 Extreme Scratch Cards
Chapter 8 Models and Decision Support
Presentation transcript:

Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai

Background Queries over large scale (petabyte) data bases often mean waiting overnight for a result to come back. Scale costs time. Potential avenues of exploration are ignored because the costs are perceived to be too high to run or even propose them. SampleAction: Accelerate and open up the query process with incremental visualizations.

Problems Trading off speed of exploring and richness of questions for time and resources when running queries over vast arrays of data. The number and types of queries are still restricted. Incremental queries: Analysts are accustomed to seeing precise figures, rather than probabilistic results

Goals In order to let incremental analysis to be a viable technique Complementing technical aspects of the back-end with an investigation of the interaction design Visualize estimates on incremental data.

METHOD Hypothesis: Users working with incremental visualizations will be able to interpret the confidence intervals comfortably. This will allow them to act rapidly on their queries. Incremental results will allow users to carry out exploratory queries. SampleAction Simulating the experience of using a very large dataset. Incrementally displaying results based on ever-larger portions of the dataset.

SampleAction Simulating the effects of interacting with very large datasets while supporting an iterative query interaction for large aggregates. Error bars: show the values of the estimate.

SampleAction shows how the bounds are changing over time

Bounded uncertainty based on samples

The Back-End Database Industrial DBMS do not currently support incremental queries of the type required Constraine this initial evaluation to deploying sampleAction on a database small enough to query interactively:

USER STUDY Bob: Server Operations Allan: Online Game Reporting Sam: Twitter Analytics

ANALYSIS The value of seeing a first record fast users found value in getting a quick response to their queries: Sam and Allan realized they had entered an incorrect query, and were able to repair it quickly by adding appropriate filters.

ANALYSIS New Behaviors around Data data in a static, non-interactive form real exploration of the dataset If the first few samples had not converged, they would decide whether it was worth the trade-off of waiting longer, sometimes checking the convergence view to decide.

ANALYSIS Difficulties with Error Bar Convergence Big variance Past literature on visualizing uncertainty has emphasized visualizations that fit the entire uncertainty range on screen; these were not sufficient for some of bounds. Noisy values Incremental systems can be slowed by datasets that are not clean. Solution: Using additional domain knowledge during the execution, such as discarding values that fall outside meaningful constraints

ANALYSIS Non-Expert Views of Confidence Intervals Error bars sometimes are confusing for users. For example, the interval would shrink toward a converged value. Two very different adjacent columns might have identical confidence intervals.

Implications Users seem to be able to interpret confidence intervals, which opens opportunities for using uncertainty visualization tied to probabilistic datasets.

Limitations of Incremental Visualization There some genres of queries that are structurally going to be difficult. Outlier Values For example, there is no probabilistic answer to “which item has the highest value”. Table Joins When joins against a rare or unique key, using samples from joining tables may not work at all.

Future Work Representations of confidence, eliminating downsides of error bars More types of visualizations More types of data analysis

Conclusion While the concept of approximate queries has been known for some time, the visualization implications have not been explored with users. Showing the utility of these approximations will encourage further research on both the front- and back-ends of these systems. HCI researchers have also been limited in their ability to explore these concepts. Simulating large data systems may help them explore realistic front-ends without needing to build full-scale computation back-ends.