TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner,

Slides:



Advertisements
Similar presentations
Chapter 11 Designing the User Interface
Advertisements

2009 – E. Félix Security DSL Toward model-based security engineering: developing a security analysis DSML Véronique Normand, Edith Félix, Thales Research.
Software Analysis at Philips Healthcare MSc Project Matthijs Wessels 01/09/2009 – 01/05/2010.
Intracompany Stock Transfer Scenario Overview
Kien A. Hua Division of Computer Science University of Central Florida.
The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras.
TorusVis ND : Unraveling High- Dimensional Torus Networks for Network Traffic Visualizations Shenghui Cheng, Pradipta De, Shaofeng H.-C. Jiang* and Klaus.
THE UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Improving IM Collaboration in the Workplace Kirstin Williams COMP
Bilal Alsallakh Wolfgang Aigner Silvia Miksch Helwig Hauser Radial Sets: Interactive Visual Analysis of Large Overlapping Sets.
Iulian Mitrea 26 th June 2014 Salesforce a quality journey to happy customers.
Cognitive Walkthrough More evaluation without users.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
TRACK 2™ Version 5 The ultimate process management software.
Inspection (c) 2007 Mauro Pezzè & Michal Young Ch 18, slide 1 Photo credit jurvetson on Flickr.com; creative commons attribution license.
Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.
1 User Centered Design and Evaluation. 2 Overview Why involve users at all? What is a user-centered approach? Evaluation strategies Examples from “Snap-Together.
Mgt 240 Lecture MS Excel: Decision Support Systems September 16, 2004.
A Matter of Time and Interactions: Interactively Exploring Time-Oriented Data Silvia Miksch Vienna University of Technology Institute of Software Technology.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Law Enforcement Resource Allocation (LERA) Visualization System Michael Welsman-Dinelle April Webster.
1 User Centered Design and Evaluation. 2 Overview My evaluation experience Why involve users at all? What is a user-centered approach? Evaluation strategies.
What are competencies – some definitions ……… Competencies are the characteristics of an employee that lead to the demonstration of skills & abilities,
TRACK 3™ The ultimate process management software.
Chapter 13: Designing the User Interface
1//hw Cherniak Software Development Corporation ARM Features Presentation Alacrity Results Management (ARM) Major Feature Description.
Advanced Excel for Finance Professionals A self study material from South Asian Management Technologies Foundation.
William H. Bowers – Modeling Users: Personas and Goals Cooper 5.
12 November 2010 New Way forward to ICT Literacy Training.
Output and User Interface Design
SoberIT Software Business and Engineering Institute HELSINKI UNIVERSITY OF TECHNOLOGY User Studies Basic principles, methods, and examples Sari.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Designing the User Interface: Strategies for Effective Human-Computer.
Visualizing Information in Global Networks in Real Time Design, Implementation, Usability Study.
Multimedia Specification Design and Production 2013 / Semester 1 / week 9 Lecturer: Dr. Nikos Gazepidis
Overview of the rest of the semester Building on Assignment 1 Using iterative prototyping.
CoFM: An Environment for Collaborative Feature Modeling Li Yi Institute of Software, School of EECS, Peking University Key Laboratory of High Confidence.
An Internet of Things: People, Processes, and Products in the Spotfire Cloud Library Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist.
Business Process Change and Discrete-Event Simulation: Bridging the Gap Vlatka Hlupic Brunel University Centre for Re-engineering Business Processes (REBUS)
13-1 COBOL for the 21 st Century Nancy Stern Hofstra University Robert A. Stern Nassau Community College James P. Ley University of Wisconsin-Stout (Emeritus)
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 14 Slide 1 Object-oriented Design.
Creating Trends, Histograms, Profiles, and Statistics using PQView Express.
© 2009 IBM Corporation 1 Space, Time, and Antony Space, Time and Antony Visualizing Then and Now, Here and There.
Chapter 6 Prototyping, RAD, and Extreme Programming Systems Analysis and Design Kendall & Kendall Sixth Edition.
Controllability of Time-Aware Processes at Run Time Andreas Lanz 1, Roberto Posenato 2, Carlo Combi 2, and Manfred Reichert 1 1 Institute of Databases.
Otto is a system to maintain and analyze your hours on the job Job journal in familiar Outlook style calendar format – Maintaining your hours in.
Software Project MassAnalyst Roeland Luitwieler Marnix Kammer April 24, 2006.
IAD 2263: System Analysis and Design Chapter 3: Investigating System Requirements.
13-1 Sequential File Processing Chapter Chapter Contents Overview of Sequential File Processing Sequential File Updating - Creating a New Master.
Usability Evaluation of the Course Management Features of Sakai Jonathan Howarth Rex Hartson Aaron Zeckoski
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Designing the User Interface: Strategies for Effective Human-Computer.
13- 1 Chapter 13.  Overview of Sequential File Processing  Sequential File Updating - Creating a New Master File  Validity Checking in Update Procedures.
Generating Summaries from FOT Data ITS World Congress, Detroit 2014 Dr. Sami Koskinen, VTT
Testing plan outline Adam Leko Hans Sherburne HCS Research Laboratory University of Florida.
Learning Objectives Understand the concepts of Information systems.
Capture This! PO105 James Green. Table of Contents Capture Overview Laserfiche Tools Case Scenarios Questions and Answers.
Project Management: Messages
Best Practices for Dynamics NAV Administration and Security
Intracompany Stock Transfer Scenario Overview
Week 12 Option 3: Database Design
Intracompany Stock Transfer Scenario Overview
HCI in the software process
CSc4730/6730 Scientific Visualization
User analyses and profiling - results
CMS Pixel Data Quality Monitoring
Intracompany Stock Transfer Scenario Overview
HCI in the software process
Web Mining Department of Computer Science and Engg.
HCI in the software process
CHAPTER 7: Information Visualization
H676 Week 5 - Plan for Today Review your project and coding to date
CHAPTER 14: Information Visualization
Presentation transcript:

TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner, Simone Kriglstein, Margit Pohl, Nik Suchy

Motivation

Overview TimeCleanser: special quality checks for time-induced problems Evaluation of TimeCleanser Results Derived design principles Conclusion

TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Multiple Data Sets

TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Multiple Data Sets sales per hour

TimeCleanser – Design and Development Tight collaboration with software company Working at the company site for over 3 years Problem analysis, design, and implementation Fequent feedback sessions Evaluation

Requirement Analysis Page 2 Page 1 Page 3 taxonomy of time-oriented quality problems [Gschwandtner et al., 2012] real life experience of partner company

TimeCleanser

Time Checks – Examples time 8pm7am Intervals Same durations Minimum and maximum duration Obligatory gaps, e.g., break in the night

Time Checks – Examples Temporal range IDs should cover same temporal range (with some tolerance), e.g., different departments ………………... time ………………... B A

Time-Oriented Value Checks – Examples Valid minimum and maximum values within a given temporal range e.g., sales of one hour time value

Time-Oriented Value Checks – Examples Valid minimum and maximum values within a given temporal range Valid value sequences, e.g., ready – start – operate – end time value X X YY ZZ

Multiple Data Sets Checks – Examples Data should cover same temporal range (with some tolerance) Contain intervals of same length Contain time stamps of same precision time B A 8:02 9:01 8:00 9:00

Summary - Checks Syntax Checks Time Checks Valid overall temporal range Durations/interval length Missing time point or interval Entries for different IDs cover same temporal range Time-Oriented Value Checks Valid minimum and maximum values within a given temporal range Values which do not change for too long Valid timing of values Valid value sequences Valid intervals between subsequent values Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision

Visualizations Overview of values over time

Visualizations Difference plot of subsequent data values

Visualizations Heatmap of interval lengths and data values

Evaluation – Focus Group Participants: 2 data analysts of our partner company (target users) 2 HCI experts Session: 2 scenarios (GPS data and working hours) Tasks: 1. Remove syntax errors 2. Check interval lengths 3. Check plausibility of velocity values (GPS data set only) 4. Check validity of working hours and of weekly profiles (working hours data set only) Audio and video recording

Design Principle 1: Data cleansing is a sequential task with loops correct syntax

Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles FromToValueDifferentiator

Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand

Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets

Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets

Design Principle 2: Complex quality problems are best spotted with visualizations ‘You get a picture of the data set, not only of erroneous entries, but also of how the data looks like and how it should look like.’

Design Principle 3: Visualizations and raw data tables are complementary

Design Principle 4: Algorithmic means are suited to identify precisely definable errors

Design Principle 5: Original data needs to be preserved Correct data right away for further processing Confer with customers later Quickly undo changes

Design Principles – Summary 1. Data cleansing is a sequential task with loops 2. Complex quality problems are best spotted with visualizations 3. Visualizations and raw data tables are complementary 4. Algorithmic means are suited to identify precisely definable errors 5. Original data needs to be preserved

Negative Points and Possible New Features More interactive features would be necessary (HCI experts) Synchronized zooming for multiple visualizations Linking and brushing between visualizations and data tables Statistics about string lengths to support the detection of outliers Use of wildcards and regular expressions for filter functionality A one-page statistical summary of the data set (e.g.,minimum, maximum, average, distribution)

Conclusion Very close collaboration with target users Systematic list of data quality checks Sequence of cleansing steps Design principles for data cleansing support (with special focus on time-oriented data) Need of visualizations for complex error detection and cleansing tasks

Results – Topics which were discussed Traditional methods Workflow Advantages of TimeCleanser Attitudes towards visualizations Intertwinedness of analytical and visual methods Negative points and possible new features

Results – Topics which were discussed Traditional methods Workflow Advantages of TimeCleanser Attitudes towards visualizations Intertwinedness of analytical and visual methods Negative points and possible new features  Design principles

Syntax Checks

Time-Oriented Value Checks

Evaluation – Questions (1) Does the prototype help the target users to perform data cleansing tasks? (2) Is an integration of visualizations methods useful? (3) What are the advantages and disadvantages in comparison with the data cleansing methods they have used so far? (4) For which tasks are visualization methods, common data cleansing analysis methods, and a combination of both suitable? (5) Which interaction methods for the visualizations are useful to support users‘ working steps to perform data cleansing tasks?

TimeCleanser

Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview

Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter

Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets Additions to Shneiderman's Visual Information Seeking Mantra: `correct syntax first, assign semantic roles, overview, zoom and filter, then analysis and details-on-demand‘

Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets Additions to Keim's Visual Analytics mantra: `correct syntax first – assign semantic roles – overview – analyse – show the important – zoom, filter and analyse further – details on demand‘

Lessons Learned 1. Automatic methods are preferred in cases which are easily defined 2. Visualizations are superior when judging plausibility 3. Analysts appreciated the use of visualizations as an interactive analysis tool 4. Efficient connection of visualizations to raw data and a side by side display is important

TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision

TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision

TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision sales per hour vs. employees present

TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision sales per hour vs. employees present

TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision sales per hour vs. employees present

TimeCleanser – Design and Development Tight collaboration with software company Working at the company site for over 3 years Problem analysis, design, and implementation – CEO, data analysts, software developers, VA experts Fequent feedback sessions – CEO, VA experts, software developers Evaluation – data analysts, HCI experts

TimeCleanser

Intervals time start – end end – start

Intervals start – end end – start time

Intervals start – end end – start start – start time

Intervals start – end end – start start – start end – end time

Time Checks time Intervals Same durations

Time Checks – Examples time Intervals Same durations Minimum and maximum duration

Time Checks time Intervals Same durations Minimum and maximum duration No gaps

Time Checks Points in time time

Time Checks time Points in time Evenly spaced minimum and maximum intervals between no gaps obligatory gaps

Time Checks time Points in time Evenly spaced Minimum and maximum intervals between no gaps obligatory gaps

Time Checks time Points in time Evenly spaced Minimum and maximum intervals between Missing values obligatory gaps

Time Checks Points in time Evenly spaced Minimum and maximum intervals between Missing values Obligatory gaps time 8pm7am

Time Checks Temporal range Valid overall temporal range (with some tolerance) e.g., no data from 1980 or future ………………... time

Time-Oriented Value Checks time value Valid minimum and maximum values within a given temporal range e.g., sales of one hour vs. sales of one day

Time-Oriented Value Checks – Examples Valid minimum and maximum values within a given temporal range Values which do not change for too long, e.g., error values time value

Time-Oriented Value Checks Valid minimum and maximum values within a given temporal range Values which do not change for too long Valid timing of values, e.g., values sent by a server time value 8:059:0510:05 8:00 8:20 8:40 9:00 9:20 9:40 10:00

Time-Oriented Value Checks time value Valid minimum and maximum values within a given temporal range Values which do not change for too long Valid timing of values Valid value sequences Valid intervals between subsequent values, e.g., start to end: 1 to 10 minutes

Multiple Data Sets Data should cover same temporal range (with some tolerance) e.g., combine working hours and sales data of last month Contain time stamps of same precision ………………... time ………………... B A

Multiple Data Sets Data should cover same temporal range (with some tolerance) Contain intervals of same length Contain time stamps of same precision time B A

Visualizations Interval length as bars over time

Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand Additions to Shneiderman's Visual Information Seeking Mantra [Shneiderman, 1996]: `correct syntax first, assign semantic roles, overview, zoom and filter, then analysis and details-on-demand‘

Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand Additions to Keim's Visual Analytics mantra [Keim et al., 2008]: `correct syntax first – assign semantic roles – overview – analyse – show the important – zoom, filter and analyse further – details on demand‘

Design Principle 4: Algorithmic means are suited to identify precisely definable errors

‘The means for automatic corrections are very useful and allow for an immediate correction of typical errors.‘

Design Principle 4: Algorithmic means are suited to identify precisely definable errors