Presentation is loading. Please wait.

Presentation is loading. Please wait.

TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner,

Similar presentations


Presentation on theme: "TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner,"— Presentation transcript:

1 TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner, Simone Kriglstein, Margit Pohl, Nik Suchy

2 Motivation

3

4

5

6 Overview TimeCleanser: special quality checks for time-induced problems Evaluation of TimeCleanser Results Derived design principles Conclusion

7 TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Multiple Data Sets

8 TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Multiple Data Sets sales per hour

9 TimeCleanser – Design and Development Tight collaboration with software company Working at the company site for over 3 years Problem analysis, design, and implementation Fequent feedback sessions Evaluation

10 Requirement Analysis Page 2 Page 1 Page 3 taxonomy of time-oriented quality problems [Gschwandtner et al., 2012] real life experience of partner company

11 TimeCleanser

12 Time Checks – Examples time 8pm7am Intervals Same durations Minimum and maximum duration Obligatory gaps, e.g., break in the night

13 Time Checks – Examples Temporal range IDs should cover same temporal range (with some tolerance), e.g., different departments ………………... time ………………... B A

14 Time-Oriented Value Checks – Examples Valid minimum and maximum values within a given temporal range e.g., sales of one hour time value

15 Time-Oriented Value Checks – Examples Valid minimum and maximum values within a given temporal range Valid value sequences, e.g., ready – start – operate – end time value X X YY ZZ

16 Multiple Data Sets Checks – Examples Data should cover same temporal range (with some tolerance) Contain intervals of same length Contain time stamps of same precision time B A 8:02 9:01 8:00 9:00

17 Summary - Checks Syntax Checks Time Checks Valid overall temporal range Durations/interval length Missing time point or interval Entries for different IDs cover same temporal range Time-Oriented Value Checks Valid minimum and maximum values within a given temporal range Values which do not change for too long Valid timing of values Valid value sequences Valid intervals between subsequent values Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision

18 Visualizations Overview of values over time

19 Visualizations Difference plot of subsequent data values

20 Visualizations Heatmap of interval lengths and data values

21 Evaluation – Focus Group Participants: 2 data analysts of our partner company (target users) 2 HCI experts Session: 2 scenarios (GPS data and working hours) Tasks: 1. Remove syntax errors 2. Check interval lengths 3. Check plausibility of velocity values (GPS data set only) 4. Check validity of working hours and of weekly profiles (working hours data set only) Audio and video recording

22 Design Principle 1: Data cleansing is a sequential task with loops correct syntax

23 Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles FromToValueDifferentiator

24 Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand

25 Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets

26 Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets

27 Design Principle 2: Complex quality problems are best spotted with visualizations ‘You get a picture of the data set, not only of erroneous entries, but also of how the data looks like and how it should look like.’

28 Design Principle 3: Visualizations and raw data tables are complementary

29 Design Principle 4: Algorithmic means are suited to identify precisely definable errors

30 Design Principle 5: Original data needs to be preserved Correct data right away for further processing Confer with customers later Quickly undo changes

31 Design Principles – Summary 1. Data cleansing is a sequential task with loops 2. Complex quality problems are best spotted with visualizations 3. Visualizations and raw data tables are complementary 4. Algorithmic means are suited to identify precisely definable errors 5. Original data needs to be preserved

32 Negative Points and Possible New Features More interactive features would be necessary (HCI experts) Synchronized zooming for multiple visualizations Linking and brushing between visualizations and data tables Statistics about string lengths to support the detection of outliers Use of wildcards and regular expressions for filter functionality A one-page statistical summary of the data set (e.g.,minimum, maximum, average, distribution)

33 Conclusion Very close collaboration with target users Systematic list of data quality checks Sequence of cleansing steps Design principles for data cleansing support (with special focus on time-oriented data) Need of visualizations for complex error detection and cleansing tasks

34

35 Results – Topics which were discussed Traditional methods Workflow Advantages of TimeCleanser Attitudes towards visualizations Intertwinedness of analytical and visual methods Negative points and possible new features

36 Results – Topics which were discussed Traditional methods Workflow Advantages of TimeCleanser Attitudes towards visualizations Intertwinedness of analytical and visual methods Negative points and possible new features  Design principles

37 Syntax Checks

38 Time-Oriented Value Checks

39

40

41 Evaluation – Questions (1) Does the prototype help the target users to perform data cleansing tasks? (2) Is an integration of visualizations methods useful? (3) What are the advantages and disadvantages in comparison with the data cleansing methods they have used so far? (4) For which tasks are visualization methods, common data cleansing analysis methods, and a combination of both suitable? (5) Which interaction methods for the visualizations are useful to support users‘ working steps to perform data cleansing tasks?

42 TimeCleanser

43

44 Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview

45 Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter

46 Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets Additions to Shneiderman's Visual Information Seeking Mantra: `correct syntax first, assign semantic roles, overview, zoom and filter, then analysis and details-on-demand‘

47 Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand time valuesdata valuessequencesmultiple data sets Additions to Keim's Visual Analytics mantra: `correct syntax first – assign semantic roles – overview – analyse – show the important – zoom, filter and analyse further – details on demand‘

48 Lessons Learned 1. Automatic methods are preferred in cases which are easily defined 2. Visualizations are superior when judging plausibility 3. Analysts appreciated the use of visualizations as an interactive analysis tool 4. Efficient connection of visualizations to raw data and a side by side display is important

49 TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision

50 TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision

51 TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision sales per hour vs. employees present

52 TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision sales per hour vs. employees present

53 TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Values aggregated in a time span Values sum up vs. values that do not sum up Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision sales per hour vs. employees present

54 TimeCleanser – Design and Development Tight collaboration with software company Working at the company site for over 3 years Problem analysis, design, and implementation – CEO, data analysts, software developers, VA experts Fequent feedback sessions – CEO, VA experts, software developers Evaluation – data analysts, HCI experts

55 TimeCleanser

56 Intervals time start – end end – start

57 Intervals start – end end – start time

58 Intervals start – end end – start start – start time

59 Intervals start – end end – start start – start end – end time

60 Time Checks time Intervals Same durations

61 Time Checks – Examples time Intervals Same durations Minimum and maximum duration

62 Time Checks time Intervals Same durations Minimum and maximum duration No gaps

63 Time Checks Points in time time

64 Time Checks time Points in time Evenly spaced minimum and maximum intervals between no gaps obligatory gaps

65 Time Checks time Points in time Evenly spaced Minimum and maximum intervals between no gaps obligatory gaps

66 Time Checks time Points in time Evenly spaced Minimum and maximum intervals between Missing values obligatory gaps

67 Time Checks Points in time Evenly spaced Minimum and maximum intervals between Missing values Obligatory gaps time 8pm7am

68 Time Checks Temporal range Valid overall temporal range (with some tolerance) e.g., no data from 1980 or future ………………... time

69 Time-Oriented Value Checks time value Valid minimum and maximum values within a given temporal range e.g., sales of one hour vs. sales of one day

70 Time-Oriented Value Checks – Examples Valid minimum and maximum values within a given temporal range Values which do not change for too long, e.g., error values time value

71 Time-Oriented Value Checks Valid minimum and maximum values within a given temporal range Values which do not change for too long Valid timing of values, e.g., values sent by a server time value 8:059:0510:05 8:00 8:20 8:40 9:00 9:20 9:40 10:00

72 Time-Oriented Value Checks time value Valid minimum and maximum values within a given temporal range Values which do not change for too long Valid timing of values Valid value sequences Valid intervals between subsequent values, e.g., start to end: 1 to 10 minutes

73 Multiple Data Sets Data should cover same temporal range (with some tolerance) e.g., combine working hours and sales data of last month Contain time stamps of same precision ………………... time ………………... B A

74 Multiple Data Sets Data should cover same temporal range (with some tolerance) Contain intervals of same length Contain time stamps of same precision time B A

75 Visualizations Interval length as bars over time

76 Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand Additions to Shneiderman's Visual Information Seeking Mantra [Shneiderman, 1996]: `correct syntax first, assign semantic roles, overview, zoom and filter, then analysis and details-on-demand‘

77 Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overviewzoom & filter analysis & details on demand Additions to Keim's Visual Analytics mantra [Keim et al., 2008]: `correct syntax first – assign semantic roles – overview – analyse – show the important – zoom, filter and analyse further – details on demand‘

78 Design Principle 4: Algorithmic means are suited to identify precisely definable errors

79 ‘The means for automatic corrections are very useful and allow for an immediate correction of typical errors.‘

80 Design Principle 4: Algorithmic means are suited to identify precisely definable errors


Download ppt "TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner,"

Similar presentations


Ads by Google