Download presentation
Presentation is loading. Please wait.
Published byBritton Lambert Modified over 9 years ago
1
TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner, Simone Kriglstein, Margit Pohl, Nik Suchy
2
Motivation
6
Overview TimeCleanser: special quality checks for time-induced problems Evaluation of TimeCleanser Results Derived design principles Conclusion
7
TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Multiple Data Sets
8
TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Multiple Data Sets sales per hour
9
TimeCleanser – Design and Development Tight collaboration with software company Working at the company site for over 3 years Problem analysis, design, and implementation Fequent feedback sessions Evaluation
10
Requirement Analysis Page 2 Page 1 Page 3 taxonomy of time-oriented quality problems [Gschwandtner et al., 2012] real life experience of partner company
11
TimeCleanser
12
Time Checks – Examples time 8pm7am Intervals Same durations Minimum and maximum duration Obligatory gaps, e.g., break in the night
13
Time Checks – Examples Temporal range IDs should cover same temporal range (with some tolerance), e.g., different departments ………………... time ………………... B A
14
Time-Oriented Value Checks – Examples Valid minimum and maximum values within a given temporal range e.g., sales of one hour time value
15
Time-Oriented Value Checks – Examples Valid minimum and maximum values within a given temporal range Valid value sequences, e.g., ready – start – operate – end time value X X YY ZZ
16
Multiple Data Sets Checks – Examples Data should cover same temporal range (with some tolerance) Contain intervals of same length Contain time stamps of same precision time B A 8:02 9:01 8:00 9:00
17
Summary - Checks Syntax Checks Time Checks Valid overall temporal range Durations/interval length Missing time point or interval Entries for different IDs cover same temporal range Time-Oriented Value Checks Valid minimum and maximum values within a given temporal range Values which do not change for too long Valid timing of values Valid value sequences Valid intervals between subsequent values Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision
18
Visualizations Overview of values over time
19
Visualizations Difference plot of subsequent data values
20
Visualizations Heatmap of interval lengths and data values
21
Evaluation – Focus Group Participants: 2 data analysts of our partner company (target users) 2 HCI experts Session: 2 scenarios (GPS data and working hours) Tasks: 1. Remove syntax errors 2. Check interval lengths 3. Check plausibility of velocity values (GPS data set only) 4. Check validity of working hours and of weekly profiles (working hours data set only) Audio and video recording
22
Design Principle 1: Data cleansing is a sequential task with loops correct syntax
23
Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles FromToValueDifferentiator
24
Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand
25
Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets
26
Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets
27
Design Principle 2: Complex quality problems are best spotted with visualizations ‘You get a picture of the data set, not only of erroneous entries, but also of how the data looks like and how it should look like.’
28
Design Principle 3: Visualizations and raw data tables are complementary
29
Design Principle 4: Algorithmic means are suited to identify precisely definable errors
30
Design Principle 5: Original data needs to be preserved Correct data right away for further processing Confer with customers later Quickly undo changes
31
Design Principles – Summary 1. Data cleansing is a sequential task with loops 2. Complex quality problems are best spotted with visualizations 3. Visualizations and raw data tables are complementary 4. Algorithmic means are suited to identify precisely definable errors 5. Original data needs to be preserved
32
Negative Points and Possible New Features More interactive features would be necessary (HCI experts) Synchronized zooming for multiple visualizations Linking and brushing between visualizations and data tables Statistics about string lengths to support the detection of outliers Use of wildcards and regular expressions for filter functionality A one-page statistical summary of the data set (e.g.,minimum, maximum, average, distribution)
33
Conclusion Very close collaboration with target users Systematic list of data quality checks Sequence of cleansing steps Design principles for data cleansing support (with special focus on time-oriented data) Need of visualizations for complex error detection and cleansing tasks
35
Results – Topics which were discussed Traditional methods Workflow Advantages of TimeCleanser Attitudes towards visualizations Intertwinedness of analytical and visual methods Negative points and possible new features
36
Results – Topics which were discussed Traditional methods Workflow Advantages of TimeCleanser Attitudes towards visualizations Intertwinedness of analytical and visual methods Negative points and possible new features Design principles
37
Syntax Checks
38
Time-Oriented Value Checks
41
Evaluation – Questions (1) Does the prototype help the target users to perform data cleansing tasks? (2) Is an integration of visualizations methods useful? (3) What are the advantages and disadvantages in comparison with the data cleansing methods they have used so far? (4) For which tasks are visualization methods, common data cleansing analysis methods, and a combination of both suitable? (5) Which interaction methods for the visualizations are useful to support users‘ working steps to perform data cleansing tasks?
42
TimeCleanser
44
Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview
45
Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter
46
Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets Additions to Shneiderman's Visual Information Seeking Mantra: `correct syntax first, assign semantic roles, overview, zoom and filter, then analysis and details-on-demand‘
47
Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets Additions to Keim's Visual Analytics mantra: `correct syntax first – assign semantic roles – overview – analyse – show the important – zoom, filter and analyse further – details on demand‘
48
Lessons Learned 1. Automatic methods are preferred in cases which are easily defined 2. Visualizations are superior when judging plausibility 3. Analysts appreciated the use of visualizations as an interactive analysis tool 4. Efficient connection of visualizations to raw data and a side by side display is important
49
TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision
50
TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision
51
TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision sales per hour vs. employees present
52
TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision sales per hour vs. employees present
53
TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision sales per hour vs. employees present
54
TimeCleanser – Design and Development Tight collaboration with software company Working at the company site for over 3 years Problem analysis, design, and implementation – CEO, data analysts, software developers, VA experts Fequent feedback sessions – CEO, VA experts, software developers Evaluation – data analysts, HCI experts
55
TimeCleanser
56
Intervals time start – end end – start
57
Intervals start – end end – start time
58
Intervals start – end end – start start – start time
59
Intervals start – end end – start start – start end – end time
60
Time Checks time Intervals Same durations
61
Time Checks – Examples time Intervals Same durations Minimum and maximum duration
62
Time Checks time Intervals Same durations Minimum and maximum duration No gaps
63
Time Checks Points in time time
64
Time Checks time Points in time Evenly spaced minimum and maximum intervals between no gaps obligatory gaps
65
Time Checks time Points in time Evenly spaced Minimum and maximum intervals between no gaps obligatory gaps
66
Time Checks time Points in time Evenly spaced Minimum and maximum intervals between Missing values obligatory gaps
67
Time Checks Points in time Evenly spaced Minimum and maximum intervals between Missing values Obligatory gaps time 8pm7am
68
Time Checks Temporal range Valid overall temporal range (with some tolerance) e.g., no data from 1980 or future ………………... time
69
Time-Oriented Value Checks time value Valid minimum and maximum values within a given temporal range e.g., sales of one hour vs. sales of one day
70
Time-Oriented Value Checks – Examples Valid minimum and maximum values within a given temporal range Values which do not change for too long, e.g., error values time value
71
Time-Oriented Value Checks Valid minimum and maximum values within a given temporal range Values which do not change for too long Valid timing of values, e.g., values sent by a server time value 8:059:0510:05 8:00 8:20 8:40 9:00 9:20 9:40 10:00
72
Time-Oriented Value Checks time value Valid minimum and maximum values within a given temporal range Values which do not change for too long Valid timing of values Valid value sequences Valid intervals between subsequent values, e.g., start to end: 1 to 10 minutes
73
Multiple Data Sets Data should cover same temporal range (with some tolerance) e.g., combine working hours and sales data of last month Contain time stamps of same precision ………………... time ………………... B A
74
Multiple Data Sets Data should cover same temporal range (with some tolerance) Contain intervals of same length Contain time stamps of same precision time B A
75
Visualizations Interval length as bars over time
76
Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand Additions to Shneiderman's Visual Information Seeking Mantra [Shneiderman, 1996]: `correct syntax first, assign semantic roles, overview, zoom and filter, then analysis and details-on-demand‘
77
Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand Additions to Keim's Visual Analytics mantra [Keim et al., 2008]: `correct syntax first – assign semantic roles – overview – analyse – show the important – zoom, filter and analyse further – details on demand‘
78
Design Principle 4: Algorithmic means are suited to identify precisely definable errors
79
‘The means for automatic corrections are very useful and allow for an immediate correction of typical errors.‘
80
Design Principle 4: Algorithmic means are suited to identify precisely definable errors
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.