R EMCO C HANG | T UFTS U NIVERSITY 1/38 B IG D ATA V ISUAL A NALYTICS : A U SER -C ENTRIC A PPROACH Remco Chang Assistant Professor Computer Science, Tufts.

Slides:



Advertisements
Similar presentations
Techniques for Visualizing Massive Data Sets
Advertisements

1/26Remco Chang – Dagstuhl 14 Analyzing User Interactions for Data and User Modeling Remco Chang Assistant Professor Tufts University.
ProvenanceIntroLOCCog StateDist FuncWrap-up 1/52 User-Centric Visual Analytics Remco Chang Tufts University.
1 Chapter 5 : Query Processing and Optimization Group 4: Nipun Garg, Surabhi Mithal
ScalaRMotivationQueryPlanWrap-up 1/26 Dynamic Reduction of Query Result Sets for Interactive Visualization Leilani Battle (MIT) Remco Chang (Tufts) Michael.
VALTChessVA IntroAppsWrap-up 1/25 User-Centric Visual Analytics Remco Chang Tufts University Department of Computer Science.
Dist FuncIntroVAAppsATGWrap-up 1/25 Visual Analytics Research at Tufts Remco Chang Assistant Professor Tufts University.
ProvenanceIntroApplicationPersonalityDist FuncWrap-up 1/36 User-Centric Visual Analytics Remco Chang Tufts University Department of Computer Science.
1/26Remco Chang – PNNL 14 Analyzing User Interactions for Data and User Modeling Remco Chang Assistant Professor Tufts University.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
1 This work partially funded by NSF Grants IIS , IRIS and IIS Matthew O. Ward, Elke A. Rundensteiner, Jing Yang, Punit Doshi, Geraldine.
Live Re-orderable Accordion Drawing (LiveRAC) Peter McLachlan, Tamara Munzner Eleftherios Koutsofios, Stephen North AT&T Research Symposium August, 2007.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
1998/5/21by Chang I-Ning1 ImageRover: A Content-Based Image Browser for the World Wide Web Introduction Approach Image Collection Subsystem Image Query.
Chapter 14 The Second Component: The Database.
Sensor Data Management with Model-based View LSIR, EPFL.
Chapter 13 The Data Warehouse
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Chapter 11 Databases.
1/30Remco Chang – SEAri Workshop 15 Big Data Visual Analytics: A User Centric Approach Remco Chang Assistant Professor Tufts University.
SizeIntroDefinitionComplexityTuftsWrap-up 1/54 Big Data Visual Analytics: Challenges and Opportunities Remco Chang Tufts University.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
IntroDefinitionSizeComplexityWrap-up 1/54 Individual Big Data Visual Analytics: Challenges and Opportunities Remco Chang and Eli Brown Tufts University.
VALTVA IntroAppsWrap-up 1/16 Interactive Data Analysis and Model Exploration: A Visual Analytics Approach Remco Chang Tufts University Department of Computer.
COMP 410 & Sky.NET May 2 nd, What is COMP 410? Forming an independent company The customer The planning Learning teamwork.
1/20 (Big Data Analytics for Everyone) Remco Chang Assistant Professor Department of Computer Science Tufts University Big Data Visual Analytics: A User-Centric.
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
VISUAL ANALYTICS: VISUAL EXPLORATION, ANALYSIS, AND PRESENTATION OF LARGE COMPLEX DATA Remco Chang, PhD (Charlotte Visualization Center) (Tufts University)
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
VALTVA IntroAppsWrap-up 1/34 User-Centric Visual Analytics Remco Chang Tufts University Department of Computer Science.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
ProvenanceIntroPersonalityPrimingDist FuncWrap-up 1/52 User-Centric Visual Analytics Remco Chang Tufts University.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.
ProvenanceIntroPersonalityPrimingDist FuncWrap-up 1/40 User-Centric Visual Analytics Remco Chang Tufts University.
1/41 Visualization and Analysis of Text Remco Chang, PhD Assistant Professor Department of Computer Science Tufts University December 17, 2010 Cologne,
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.
Sunpyo Hong, Hyesoon Kim
IntroGoalCrowdPredictionWrap-up 1/26 Learning Debugging and Hacking the User Remco Chang Assistant Professor Tufts University.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
R EMCO C HANG | T UFTS U NIVERSITY 1/38 B IG D ATA V ISUAL A NALYTICS : A U SER -C ENTRIC A PPROACH Remco Chang Assistant Professor Computer Science, Tufts.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Book web site:
CPSC-310 Database Systems
Big Data Visual Analytics: A User-Centric Approach
Database management system Data analytics system:
CSCI5570 Large Scale Data Processing Systems
SuperB and its computing requirements
Pathology Spatial Analysis February 2017
So, what was this course about?
Chapter 13 The Data Warehouse
Remco Chang Associate Professor Computer Science, Tufts University
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Big Data Visual Analytics: Challenges and Opportunities
Join Processing in Database Systems with Large Main Memories (part 2)
AQUA: Approximate Query Answering
Overview of big data tools
Carlos Ordonez, Javier Garcia-Garcia,
Data Warehouse and OLAP Technology
Presentation transcript:

R EMCO C HANG | T UFTS U NIVERSITY 1/38 B IG D ATA V ISUAL A NALYTICS : A U SER -C ENTRIC A PPROACH Remco Chang Assistant Professor Computer Science, Tufts University

R EMCO C HANG | T UFTS U NIVERSITY 2/38 F INANCIAL F RAUD – A C ASE FOR V ISUAL A NALYTICS  Financial Institutions like Bank of America have legal responsibilities to report all suspicious wire transaction activities  money laundering, supporting terrorist activities, etc  Data size: approximately 200,000 transactions per day (73 million transactions per year)

R EMCO C HANG | T UFTS U NIVERSITY 3/38 F INANCIAL F RAUD – A C ASE S TUDY FOR V ISUAL A NALYTICS  Problems:  Automated approach can only detect known patterns  Bad guys are smart: patterns are constantly changing  Previous methods:  10 analysts monitoring and analyzing all transactions  Using SQL queries and spreadsheet-like interfaces  Limited time scale (2 weeks)

R EMCO C HANG | T UFTS U NIVERSITY 4/38 W IRE V IS : A V ISUAL A NALYTICS A PPROACH Heatmap View (Accounts to Keywords Relationship) Multiple Temporal View (Relationships over Time) Search by Example (Find Similar Accounts) Keyword Network (Keyword Relationships)

R EMCO C HANG | T UFTS U NIVERSITY 5/38 I NTERACTIVE V ISUALIZATION S YSTEMS Political Simulation – Agent-based analysis Bridge Maintenance – Exploring inspection reports Biomechanical Motion – Interactive motion comparison Interactive Metric Learning – DisFunction: learn a model from projection High-D Data Exploration – iPCA: Interactive PCA Jordan Crouser R. Chang et al., Two Visualization Tools for Analysis of Agent-Based Simulations in Political Science. IEEE CG&A, 2012

R EMCO C HANG | T UFTS U NIVERSITY 6/38 I NTERACTIVE V ISUALIZATION S YSTEMS Political Simulation – Agent-based analysis Bridge Maintenance – Exploring inspection reports Biomechanical Motion – Interactive motion comparison Interactive Metric Learning – DisFunction: learn a model from projection High-D Data Exploration – iPCA: Interactive PCA R. Chang et al., An Interactive Visual Analytics System for Bridge Management, EuroVis, 2010

R EMCO C HANG | T UFTS U NIVERSITY 7/38 I NTERACTIVE V ISUALIZATION S YSTEMS Political Simulation – Agent-based analysis Bridge Maintenance – Exploring inspection reports Biomechanical Motion – Interactive motion comparison Interactive Metric Learning – DisFunction: learn a model from projection High-D Data Exploration – iPCA: Interactive PCA R. Chang et al., Interactive Coordinated Multiple-View Visualization of Biomechanical Motion Data, IEEE Vis (TVCG) 2009.

R EMCO C HANG | T UFTS U NIVERSITY 8/38 I NTERACTIVE V ISUALIZATION S YSTEMS Political Simulation – Agent-based analysis Bridge Maintenance – Exploring inspection reports Biomechanical Motion – Interactive motion comparison Interactive Metric Learning – DisFunction: learn a model from projection High-D Data Exploration – iPCA: Interactive PCA Eli Brown R. Chang et al., Dis-function: Learning Distance Functions Interactively, IEEE VAST, 2012

R EMCO C HANG | T UFTS U NIVERSITY 9/38 I NTERACTIVE V ISUALIZATION S YSTEMS Political Simulation – Agent-based analysis Bridge Maintenance – Exploring inspection reports Biomechanical Motion – Interactive motion comparison Interactive Metric Learning – DisFunction: learn a model from projection High-D Data Exploration – iPCA: Interactive PCA R. Chang et al., iPCA: An Interactive System for PCA-based Visual Analytics, EuroVis 2009.

R EMCO C HANG | T UFTS U NIVERSITY 10/38 G OOD L ESSONS L EARNED  Analyst behavior  90% of time on Exploratory Data Analysis (EDA)  10% on confirmation (CDA)  Big data analysis == fast hypothesis testing  High Interactivity is key  Users can wait to find the exact answer

R EMCO C HANG | T UFTS U NIVERSITY 11/38 “T OUGH ” L ESSONS L EARNED  Careful engineering is not enough… A new paradigm is necessary to support this type of interactive analysis.

R EMCO C HANG | T UFTS U NIVERSITY 12/38 P ROBLEM S TATEMENT Visualization on a Commodity Hardware Large Data in a Data Warehouse

R EMCO C HANG | T UFTS U NIVERSITY 13/38 R ELATED W ORK  (See the DSIA workshop proceeding)  Organized with Carlos Scheidegger (Arizona), Jeff Heer (UW), Danyel Fisher (Microsoft Research)  Specialized distributed or parallelized based Databases  Tableau, Spotfire, Vertica, MonetDB, HaddopDB, etc.  Pre-compiled Data Structures  Nanocube (Scheidegger), imMens** (Liu, Heer), Map-D** (Mostak)  Sampling and Approximate Queries  BlinkDB (Agrawal, Berkeley), DICE (Kamat, Nandi), Ordering guarantees (Kim et al.)  Pre-Fetching  Xmdv (Doshi, Ward), Time-series (Chan, Hanrahan), Query prediction (Cetintemel, Zdonik)  Others  Streaming (Fisher), Optimization (Wu)

R EMCO C HANG | T UFTS U NIVERSITY 14/38 P ROBLEM S TATEMENT  Problem: Data is too big to fit into the memory of the personal computer  Note: Ignoring various database technologies (OLAP, Column-Store, No-SQL, Array-Based, etc)  Goal: Guarantee a result set to a user’s query within X number of seconds.  Based on HCI research, the upperbound for X is 10 seconds  Ideally, we would like to get it down to 1 second or less  Method: trading accuracy and storage (caching), optimize on minimizing latency (user wait time).

R EMCO C HANG | T UFTS U NIVERSITY 15/38 O UR A PPROACH : P REDICTIVE P RE -F ETCHING  In collaboration with MIT (Leilani Battle, Mike Stonebraker)  ForeCache: Three-tiered architecture  Thin client (visualization)  Backend (array-based database)  Fat middleware  Prediction Algorithms  Storage Architecture  Cache Management (Eviction Strategies) R. Chang et al., Dynamic Prefetching of Data Tiles for Interactive Visualization. To Appear in SIGMOD 2016 Leilani Battle Stonebraker

R EMCO C HANG | T UFTS U NIVERSITY 16/38 E XAMPLE OF P REDICTION A LGORITHM  Two-tiered approach using Markov  First tier: predict what “phase” of analysis the user is in  Second tier: given a “phase”, use phase-specific Markov model to predict user’s next actions

R EMCO C HANG | T UFTS U NIVERSITY 17/38

R EMCO C HANG | T UFTS U NIVERSITY 18/38 P REDICTION A LGORITHMS  General Idea:  Lots of “experts”  Represent different prediction algorithms  Image based  Statistics based  Interaction based  etc.  One “manager”  Chooses which expert to listen to  Iterate  Manager builds “trusts” in the experts

R EMCO C HANG | T UFTS U NIVERSITY 19/ I TERATION : 0

R EMCO C HANG | T UFTS U NIVERSITY 20/ I TERATION : 0

R EMCO C HANG | T UFTS U NIVERSITY 21/ I TERATION : 0 User Requests Data Block 13

R EMCO C HANG | T UFTS U NIVERSITY 22/ I TERATION : 0 User Requests Data Block 13

R EMCO C HANG | T UFTS U NIVERSITY 23/ I TERATION : 0 User Requests Data Block 13

R EMCO C HANG | T UFTS U NIVERSITY 24/ I TERATION : 1

R EMCO C HANG | T UFTS U NIVERSITY 25/38 S TUDY R ESULTS  Using a simple Google-maps like interface  18 users explored the NASA MODIS dataset  Tasks include “find 4 areas in Europe that have a snow coverage index above 0.5”

R EMCO C HANG | T UFTS U NIVERSITY 26/38 S UMMARY  Big data visual analytics requires fast interactive data systems.  A growing subfield in DB, VIS, and ML  Our approach: 1. Predictive pre-fetching 2. Three-tiered system 3. Pre-fetching based on “expert-manager” approach 4. Use the “explain” trick to handle cache-miss 5. Guarantees response time, but not data quality  Backbone (invisible) to data analysts

R EMCO C HANG | T UFTS U NIVERSITY 27/38 Q UESTIONS ? CS. TUFTS. EDU

R EMCO C HANG | T UFTS U NIVERSITY 28/ User’s Requests Data Block 52 W ORST C ASE S CENARIO : C ACHE M ISS

R EMCO C HANG | T UFTS U NIVERSITY 29/38 C ACHE M ISS  How to guarantee response time when there’s a cache miss?  Trick: the ‘EXPLAIN’ command  Usage: explain select * from myTable;  Returns the query plan and a cost estimation of running the query. R. Chang et al., Dynamic Reduction of Result Sets for Interactive Visualization, IEEE Big Data Workshop on Visualization, Leilani Battle Stonebraker

R EMCO C HANG | T UFTS U NIVERSITY 30/38 E XAMPLE EXPLAIN O UTPUT FROM S CI DB  Example SciDB the output of (a query similar to) Explain SELECT * FROM earthquake [("[pPlan]: schema earthquake <datetime:datetime NULL DEFAULT null, magnitude:double NULL DEFAULT null, latitude:double NULL DEFAULT null, longitude:double NULL DEFAULT null> [x=1:6381,6381,0,y=1:6543,6543,0] bound start {1, 1} end {6381, 6543} density 1 cells chunks 1 est_bytes e+09 ")] The four attributes in the table ‘earthquake’ Notes that the dimensions of this array (table) is 6381x6543 This query will touch data elements from (1, 1) to (6381, 6543), totaling 41,750,833 cells Estimated size of the returned data is e+09 bytes (~8GB)

R EMCO C HANG | T UFTS U NIVERSITY 31/38 O THER E XAMPLES  Oracle 11g Release 1 (11.1)

R EMCO C HANG | T UFTS U NIVERSITY 32/38 O THER E XAMPLES  MySQL 5.0

R EMCO C HANG | T UFTS U NIVERSITY 33/38 O THER E XAMPLES  PostgreSQL 7.3.4

R EMCO C HANG | T UFTS U NIVERSITY 34/38 R EDUCTION S TRATEGIES  If the query is estimated to be too expensive to execute, the middleware dynamically “modifies” the query by using:  Aggregation:  In SciDB, this operation is carried out as regrid (scale_factorX, scale_factorY)  Sampling  In SciDB, uniform sampling is carried out as bernoulli (query, percentage, randseed)  Filtering  Currently, the filtering criteria is user specified where (clause)